Stefan Radtke's Blog: Comparing Hadoop performance on DAS and Isilon and why disk locality is irrelevant

12 May 2015

Comparing Hadoop performance on DAS and Isilon and why disk locality is irrelevant

In a previous blog [3] I discussed how Isilon enables you to create a Data Lake that can serve multiple Hadoop compute clusters with different Hadoop versions and distributions simultaneously. I stated that many workloads run faster on Isilon than on traditional Hadoop clusters that use DAS storage. This statement has recently been confirmed by IDC [2] who ran various Hadoop benchmarks against Isilon and a native Hadoop cluster on DAS. Although I will show their results right at the beginning, the main purpose is to discuss why Isilon is capable of delivering such good results and what the differences are in regard to data distribution and balancing within the clusters.

The test environment

Cloudera Hadoop Distribution CDH5.
The Hadoop DAS cluster contained 7 nodes with one master and six worker nodes with eight 10k RPM 300 GB disks.
The Isilon Cluster was build out of four x410 nodes, each with 57 TB disks and 3,2 TB SSDs and 10 GBE connections.
For more details see the IDC validation report [2].

NFS access

First of all, IDC tested NFS read and write access and with no surprise, Isilon provides MUCH more throughput even with just 4 nodes.

Figure 1: Runtimes for NFS write and read copy jobs while copying a 10GB file (blocksize is not mentioned but I would assume 1MB or larger)
NFS write turned out to be 4.2 times faster. This is quite important if you want to ingest data via NFS. Read performance is almost 37 times faster.

Hadoop workloads

Three Hadoop workload types have been run and compared by using standard Hadoop benchmarks:

Sequential write using TeraGen
Mixed read/write using TeraSort
Sequential read using TeraValidate

The results are illustrated in figure 2.

Figure 2: Runtimes for three different workloads using TeraGen, TeraSort and TeraValidate
It turns out that the runtimes for the write performance were about 2.6 shorter on Isilon and 1.5 times shorter for the other two workload types. The related throughputs are outlined in the following table.

Job	Compute + Isilon	Hadoop DAS Cluster
TeraGen	1681 MB/s	605 MB/s
TeraSort	642 MB/s	416 MB/s
TeraValidate	2832 MB/s	1828 MB/s

Table 1: Throughput for all three workloads on Isilon vs. DAS cluster with similar compute node configuration (results rounded).
The results speak for themselves but let’s look at what techniques OneFS provides to achieve this level of performance advantages over a DAS cluster.

The anatomy of file reads on Isilon

Although IOs on a DAS cluster are distributed across all nodes, an individual 64 MB block is served by a single node in the cluster. This is different on Isilon where the load distribution works more granular. The steps for a read on Isilon can be described as follows.

The compute node sends HDFS metadata request to the Name Node service which runs on all Isilon nodes (no SPoF)
The Name Node service will return the IP addresses and block numbers of any 3 Isilon nodes in the same rack as the compute node. This provides effective rack locality.
The Compute node sends HDFS 64 MB block read request to the Data Node service on the first Isilon node returned.
The contacted Isilon node will retrieve through the internal Infiniband network all 128 KB Isilon blocks that comprise the 64 MB HDFS block. The blocks will be read from disks if they are not already stored in the L2 cache. As said above, this is fundamentally different than on a DAS cluster where the whole 64 MB block is read from one node only. That means the IO on Isilon is served by much more disks and CPUs than on the DAS cluster.
The contacted Isilon node will return the entire HDFS block to the calling compute node.

The anatomy of file writes on Isilon

When a client requests that a file be written to the cluster, the node to which the client is connected is the node that receives and processes the file.

That node creates a write plan for the file including calculating FEC (this is much more space efficient compared to a DAS cluster where we typically do 3 copies of each block for data protection)
Data blocks assigned to the node are written to the NVRAM of that node. The NVRAM cards are special for Isilon and not available on DAS clusters.
Data blocks assigned to other nodes travel through the Infiniband network to their L2 cache, and then to their NVRAM.
Once all nodes have all the data and FEC blocks in NVRAM a commit is returned to the client. That means, we do not need to wait until the data is written to disks as all IOs are securely buffered by NVRAM on Isilon.
Data block(s) assigned to this node stay cached in L2 for future reads of that file.
Data is then written onto the spindles.

The myth of disk locality importance for Hadoop

We sometimes hear objections from admins who claim that disk locality is critical for Hadoop. But remember that traditional Hadoop was designed for slow star networks which typically operated at 1 Gb/s. The only way to effectively deal with slow networks was to strive to keep all IO local to the server (disk locality).
There are several facts that make disk locality irrelevant:

I. Fast networks are standard today.

Today, a single non-blocking 10 Gbps switch port (up to 2500 MB/sec full duplex) can provide more bandwidth than a typical disk subsystem with 12 disks (360 – 1200 MB/sec).
We are no longer constrained to maintain data locality in order to provide adequate I/O bandwidth.
Isilon provides rack-locality, not disk-locality. This reduces the Ethernet traffic between racks.

By looking at the following illustration of the IO path, it is obvious that the bottleneck in the path is the disks, not the network (as long as it is a 10 GBE network.

Figure 3: IO path in a DAS architecture. Considering a non-blocking 10 Gbps network it is obvious that the network is not the bottleneck. Even if we would double the number of disks in the system, the disks remain the bottleneck. As a result, disk locality is irrelevant for most workloads.

II. Disk locality is lost under several common situations:

All nodes of a DAS cluster with a replica of the block are running the maximum number of tasks. This is very common for busy clusters!
Input files are compressed with a non-splitable codec such as gzip.
“Analysis of Hadoop jobs from Facebook [1] underscores the difficulty in attaining disk-locality: overall, only 34% of tasks run on the same node that has the input data.”
Disk locality provides very low latency IO, however this latency has very little effect for batch operations such as MapReduce.

III. Data replication for performance

For very busy traditional clusters, a high replication count may be needed for hot files that are used often by many concurrent tasks. This is required for data locality and high concurrent reads.
On Isilon, a high replication count is not required because:
a) Data locality is not required and
b) Reads are split evenly over many Isilon nodes with a globally coherent cache, providing very high concurrent read performance

Other Isilon performance relevant technologies

As mentioned earlier, OneFS is very mature and has been designed for more than a decade for high throughput and low latency for multi-protocol access. You can google a number of articles and papers describing relevant features. I’ll just give some keywords here:

All writes are buffered by redundant NVRAM. This makes writes extremely fast
OneFS provide a L1 cache, a globally coherent L2 cache and L3 caches on SSDs for accelerated reads
Access patterns can be configured per cluster, pool or even on directory level to optimize and balance pre-fetching. Patterns are random, concurrent or streaming.
Meta data acceleration is provided by the L3 cache or can be configured alternatively. OneFS will store all filesystem meta data on SSDs

Summary

Isilon is a scale-out NAS system with a distributed filesystem that has been built for massive throughput requirements and workloads like Hadoop. HDFS is implemented as a protocol and Name Node as well as Data Node services are delivered in a highly available manner by all Isilon nodes. IDCs performance validation [2] showed up to 2.5 times higher performance compared to a DAS cluster. Due to modern networking technologies, the often referenced disks locality is irrelevant for Hadoop on Isilon. Besides the better performance, there are many other advantages that Isilon provides, such as the much higher capacity efficiency and many enterprise storage features. Furthermore, storage and compute nodes can be scaled independently and you can access the same data with different Hadoop versions and distributions simultaneously.

References

[1] Disk-Locality in Datacenter Computing Considered Irrelevant, Ganesh Ananthanarayanan, University of California, Berkeley
[2] EMC Isilon Scale-out Data Lake Foundation – Essential Capabilities for Building Big Data Infrastructure, IDC White Paper, October 2014
[3] How to access data with different Hadoop versions and distributions simultaneously, Stefan Radtke, Blog post 2015
[4] EMC Isilon OneFS – A Technical Overview; White Paper, November 2013
The White Papers mentioned here are all available for download at https://support.emc.com

Acknowledgement

I have stolen several aspects and topics of the discussion from the excellent training material which my colleagues Claudio Fahey put together. Thanks to Matthias Radtke for improving my non-native language writing.