12 May 2015

Comparing Hadoop performance on DAS and Isilon and why disk locality is irrelevant


In a previous blog [3] I discussed how Isilon enables you to create a Data Lake that can serve multiple Hadoop compute clusters with different Hadoop versions and distributions simultaneously. I stated that many workloads run faster on Isilon than on traditional Hadoop clusters that use DAS storage. This statement has recently been confirmed by IDC [2] who ran various Hadoop benchmarks against Isilon and a native Hadoop cluster on DAS. Although I will show their results right at the beginning, the main purpose is to discuss why Isilon is capable of delivering such good results and what the differences are in regard to data distribution and balancing within the clusters.

The test environment

  • Cloudera Hadoop Distribution CDH5.
  • The Hadoop DAS cluster contained 7 nodes with one master and six worker nodes with eight 10k RPM 300 GB disks.
  • The Isilon Cluster was build out of four x410 nodes, each with 57 TB disks and 3,2 TB SSDs and 10 GBE connections.
  • For more details see the IDC validation report [2].


NFS access

First of all, IDC tested NFS read and write access and with no surprise, Isilon provides MUCH more throughput even with just 4 nodes.

image
Figure 1: Runtimes for NFS write and read copy jobs while copying a 10GB file (blocksize is not mentioned but I would assume 1MB or larger)
NFS write turned out to be 4.2 times faster. This is quite important if you want to ingest data via NFS. Read performance is almost 37 times faster.

Hadoop workloads

Three Hadoop workload types have been run and compared by using standard Hadoop benchmarks:
  1. Sequential write using TeraGen
  2. Mixed read/write using TeraSort
  3. Sequential read using TeraValidate
The results are illustrated in figure 2.
image
Figure 2: Runtimes for three different workloads using TeraGen, TeraSort and TeraValidate
It turns out that the runtimes for the write performance were about 2.6 shorter on Isilon and 1.5 times shorter for the other two workload types. The related throughputs are outlined in the following table.

Job Compute + Isilon Hadoop DAS Cluster
TeraGen 1681 MB/s 605 MB/s
TeraSort 642 MB/s 416 MB/s
TeraValidate 2832 MB/s 1828 MB/s
Table 1: Throughput for all three workloads on Isilon vs. DAS cluster with similar compute node configuration (results rounded).
The results speak for themselves but let’s look at what techniques OneFS provides to achieve this level of performance advantages over a DAS cluster.


The anatomy of file reads on Isilon

Although IOs on a DAS cluster are distributed across all nodes, an individual 64 MB block is served by a single node in the cluster. This is different on Isilon where the load distribution works more granular. The steps for a read on Isilon can be described as follows.
  1. The compute node sends HDFS metadata request to the Name Node service which runs on all Isilon nodes (no SPoF)
  2. The Name Node service will return the IP addresses and block numbers of any 3 Isilon nodes in the same rack as the compute node. This provides effective rack locality.
  3. The Compute node sends HDFS 64 MB block read request to the Data Node service on the first Isilon node returned.
  4. The contacted Isilon node will retrieve through the internal Infiniband network all 128 KB Isilon blocks that comprise the 64 MB HDFS block. The blocks will be read from disks if they are not already stored in the L2 cache. As said above, this is fundamentally different than on a DAS cluster where the whole 64 MB block is read from one node only. That means the IO on Isilon is served by much more disks and CPUs than on the DAS cluster.
  5. The contacted Isilon node will return the entire HDFS block to the calling compute node.

The anatomy of file writes on Isilon

When a client requests that a file be written to the cluster, the node to which the client is connected is the node that receives and processes the file.
  1. That node creates a write plan for the file including calculating FEC (this is much more space efficient compared to a DAS cluster where we typically do 3 copies of each block for data protection)
  2. Data blocks assigned to the node are written to the NVRAM of that node. The NVRAM cards are special for Isilon and not available on DAS clusters.
  3. Data blocks assigned to other nodes travel through the Infiniband network to their L2 cache, and then to their NVRAM.
  4. Once all nodes have all the data and FEC blocks in NVRAM a commit is returned to the client. That means, we do not need to wait until the data is written to disks as all IOs are securely buffered by NVRAM on Isilon.
  5. Data block(s) assigned to this node stay cached in L2 for future reads of that file.
  6. Data is then written onto the spindles.


The myth of disk locality importance for Hadoop

We sometimes hear objections from admins who claim that disk locality is critical for Hadoop. But remember that traditional Hadoop was designed for slow star networks which typically operated at 1 Gb/s. The only way to effectively deal with slow networks was to strive to keep all IO local to the server (disk locality).
There are several facts that make disk locality irrelevant:

I. Fast networks are standard today.

  • Today, a single non-blocking 10 Gbps switch port (up to 2500 MB/sec full duplex) can provide more bandwidth than a typical disk subsystem with 12 disks (360 – 1200 MB/sec).
  • We are no longer constrained to maintain data locality in order to provide adequate I/O bandwidth.
  • Isilon provides rack-locality, not disk-locality. This reduces the Ethernet traffic between racks.
By looking at the following illustration of the IO path, it is obvious that the bottleneck in the path is the disks, not the network (as long as it is a 10 GBE network.
image
Figure 3: IO path in a DAS architecture. Considering a non-blocking 10 Gbps network it is obvious that the network is not the bottleneck. Even if we would double the number of disks in the system, the disks remain the bottleneck. As a result, disk locality is irrelevant for most workloads.


II. Disk locality is lost under several common situations:

  • All nodes of a DAS cluster with a replica of the block are running the maximum number of tasks. This is very common for busy clusters!
  • Input files are compressed with a non-splitable codec such as gzip.
  • “Analysis of Hadoop jobs from Facebook [1] underscores the difficulty in attaining disk-locality: overall, only 34% of tasks run on the same node that has the input data.”
  • Disk locality provides very low latency IO, however this latency has very little effect for batch operations such as MapReduce.

III. Data replication for performance

  • For very busy traditional clusters, a high replication count may be needed for hot files that are used often by many concurrent tasks. This is required for data locality and high concurrent reads.
  • On Isilon, a high replication count is not required because:
    a) Data locality is not required and
    b) Reads are split evenly over many Isilon nodes with a globally coherent cache, providing very high concurrent read performance

 

Other Isilon performance relevant technologies

As mentioned earlier, OneFS is very mature and has been designed for more than a decade for high throughput and low latency for multi-protocol access. You can google a number of articles and papers describing relevant features. I’ll just give some keywords here:
  • All writes are buffered by redundant NVRAM. This makes writes extremely fast
  • OneFS provide a L1 cache, a globally coherent  L2 cache and L3 caches on SSDs for accelerated reads
  • Access patterns can be configured per cluster, pool or even on directory level to optimize and balance pre-fetching. Patterns are random, concurrent or streaming.
  • Meta data acceleration is provided by the L3 cache or can be configured alternatively. OneFS will store all filesystem meta data on SSDs

 

Summary

Isilon is a scale-out NAS system with a distributed filesystem that has been built for massive throughput requirements and workloads like Hadoop. HDFS is implemented as a protocol and Name Node as well as Data Node services are delivered in a highly available manner by all Isilon nodes. IDCs performance validation [2] showed up to  2.5 times higher performance compared to a DAS cluster. Due to modern networking technologies, the often referenced disks locality is irrelevant for Hadoop on Isilon. Besides the better performance, there are many other advantages that Isilon provides, such as the much higher capacity efficiency and many enterprise storage features. Furthermore, storage and compute nodes can be scaled independently and you can access the same data with different Hadoop versions and distributions simultaneously. 

References

[1]  Disk-Locality in Datacenter Computing Considered Irrelevant, Ganesh Ananthanarayanan, University of California, Berkeley
[2]   EMC Isilon Scale-out Data Lake Foundation – Essential Capabilities for Building Big Data Infrastructure, IDC White Paper, October 2014
[3] How to access data with different Hadoop versions and distributions simultaneously, Stefan Radtke, Blog post 2015
[4] EMC Isilon OneFS – A Technical Overview; White Paper, November 2013
The White Papers mentioned here are all available for download at https://support.emc.com

Acknowledgement

I have stolen several aspects and topics of the discussion from the excellent training material which my colleagues Claudio Fahey put together. Thanks to Matthias Radtke for improving my non-native language writing.







14 comments:

  1. Wow that's a wonderfull blog having all details & helpful. Hadoop cluster NJ

    ReplyDelete
  2. There are many institutes for hadoop allover, however many people and from countries like Russia are preferring hadoop online training in India.

    ReplyDelete
  3. Hi Admin, I went through your article and it’s totally awesome. You can consider including RSS feed for easy content sharing, So that you can drive huge traffic to your blog. Hadoop Training in Chennai | Big Data Training in Chennai

    ReplyDelete
  4. Good read. Stefan, have you ever tried implementing running HDFS on top of an existing NFS? Any performance metrics for that? I'm aware of concepts like NFS gateway, but was just curious if you've ever tried it.

    ReplyDelete
    Replies
    1. Hi Ahab, no, never tried and for the technology described here, it is never required because we provide native access via NFS *and* HDFS at the same time, max speed. No need for a gateway layer or the like.

      Delete
  5. This article describes the Hadoop Software, All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. This post gives great idea on Hadoop Certification for beginners. Also find best Hadoop Online Training in your locality at StaygreenAcademy.com

    ReplyDelete
  6. Thanks for providing this informative information you may also refer.
    http://www.s4techno.com/blog/2016/08/13/installing-a-storm-cluster/

    ReplyDelete
  7. Finding the time and actual effort to create a superb article like this is great thing. I’ll learn many new stuff right here! Good luck for the next post buddy..
    PHP training in chennai

    ReplyDelete
  8. Just found your post by searching on the Google, I am Impressed and Learned Lot of new thing from your post. I am new to blogging and always try to learn new skill as I believe that blogging is the full time job for learning new things day by day.
    "Emergers Technologies"

    ReplyDelete
  9. Thank you so much for sharing this worth able content with us. The concept taken here will be useful for my future programs and i will surely implement them in my study. Keep blogging article like this.

    Hadoop Training In Chennai

    ReplyDelete