05 May 2015

How to access data with different Hadoop versions and distributions simultaneously

Many companies start rolling out or at least think about using Hadoop for data analysis and processing. Hadoop Distributed Filesystem (HDFS) is the underlying filesystem and typically local storage within the compute nodes is used to provide the storage capacity. HDFS has been designed to satisfy the workload characteristics for analytics and has been born during the time where 1 Gigabit Ethernet was the standard networking technology in the datacenter. The idea was to bring the data to the compute nodes in order to minimize network utilization and bandwidth and reduce latency through data locality. (I’ll post another article to discuss this in more detail and show that this requirement is less important these days where we have 10 Gigabit almost everywhere in the datacenter). For the moment we’ll look at some other side effects of this strategy which reminds me somehow to a lot of data silos. Business Intelligence dudes know what I am talking about.

Figure 1: The “Bring Data to Compute” strategy results in a lot of data silos, complex and time consuming workflows.
One thing you may already know is the fact that HDFS is not compatible with POSIX protocols like

NFS or SMB. That means you need special tools to copy your data into the filesystem. That’ll take a long time. For example, if you need to copy 100TB over a 10 GB Ethernet, you’d need more than 24 hours to so so if the network is not occupied with other traffic.

But it gets even worse: You may have multiple HDFS distributions or versions like you have different RDBMS systems

Have you ever thought about how many different RDBMS systems you have in the company ? Typically companies have several relational database systems like Oracle, MS SQL, MySQL, DB2, Sybase etc. But wouldn’t it be easier to have only one system? The answer of course is: yes it would, but that’s not the reality we are facing. In practice we have different RDBMS systems for various reasons:

  • Application dependencies
  • Different people or organizations within the company have different preferences
  • Mergers and acquisitions
  • Price and licensing models
  • Functionality
  • Performance
  • Historical reasons
  • Large IT organizations or service providers just have to support what their customers want. They cannot dictate the Hadoop version or setup a new cluster including storage for every customer
  • New innovative distributions appear in the market. Consider how many Linux distributions we have. It’s not only RedHat, SuSe, Debian, Ubuntu and you name it. Just recently Intel, IBM and Pivotal/EMC have announced that they’ll maintain their additional distributions that are optimized for virtual and cloud environments. The same may happen with HDFS.
  • …and others
I guess we’ll see the same development with HDFS for quite the same reasons. Furthermore, the development of Hadoop is currently very fast and we’ll see HDFS vendors starting to build their individual strength within different areas.
Now think about how you would use different versions or distributions when you have compute and data tightly integrated? It most probably will end up in more HDFS clusters, more copies of data and big data movements. You may also be stuck at a specific version or distribution because you need your production data to be available for analysis and you cannot just migrate and copy them every day.

Here is the solution: the Scale-out Data Lake Isilon

Fortunately there is a solution to this issue: EMC’s Isilon Scale Out NAS System has a very mature distributed filesystem OneFS. It’s also build on top of commodity hardware and uses internal disks to provide the space for the data. However, it’s much more advanced than HDFS in many regards and it has been built over more than 15 years to serve massive amounts of data with very high throughput and low latency. To serve Hadoop requests, HDFS has been implemented as a protocol rather than a filesystem. As a result, you can access your data over various protocols such as SMB, NFS, FTP, HTTP, Openstack Swift and HDFS simultaneously while consistency, protection, access control and global file locking is provided by OneFS.

Figure 2: Data on the Scale-Out filesystem OneFS can be accessed via multiple protocols

HDFS as a protocol

Instead of storing the data on a new filesystem type, the Isilon team has integrated HDFS as a protocol. A multi-threaded daemon called isi_hdfs_d is running on every Isilon node. It services both Name Node and Data Node protocols and it translates HDFS RPCs to POSIX system calls. As HDFS is stateless, the underlying filesystem handles coherency.
Figure 3: Multi-threaded HDFS daemon runs on every Isilon node.
With this approach, new protocol version can be integrated quickly and data migrations or modifications are not required as they reside on the POSIX scale-out filesystem.

Figure 4: Data Node and Name Node requests are served in a highly available manner.


Access to the data with different Hadoop versions or distributions

This “de-coupling” of compute and storage with Isilon as your “Data Lake”, you can now access the very same data with multiple Hadoop distributions and even different HDFS versions (at the time of writing this, Isilon supports almost everything from HDFS 1.0 to HDFS 2.6 and the development team has a strong focus to have new versions ready right after the major distributions come up with new HDFS versions).
If you think about this for a moment you’ll agree that this is huge! By pooling your data into Isilon, you get complete freedom which version of HDFS you want or need to use. You can test new versions, roll back to previous ones or use another distribution to access the same data simultaneously. Think a moment about the analogy with the different RDBMS versions in your company which I have mentioned above. There is a high probability that you’ll have the same with Hadoop: different Hadoop distributions and versions within the company. That’s no problem with Isilon. Solved.

Other Advantages

But there are further advantages:
  • No single point of failure. Name node requests are served by all Isilon Nodes in an active/active manner.
  • Isilon protects data with erasure coding across nodes. That’s much more efficient than just creating multiple copies of each block. The protection level can be set very flexible and dynamically for every directory  or pool. If you follow the guidelines, you’ll get a protection overhead between 20% and 30%. That’s much more efficient over native HDFS where you need to provide 300% of DAS capacity for 3 copies of data. See [3] for more details.
  • You can scale compute and storage independently. Your compute nodes don’t require storage anymore (you might want to use internal disks for the shuffle IO though). If you need compute power, you add servers, if you need capacity, you add Isilon nodes.
  • Most workloads run faster on Isilon [1,4].
  • For some data you can eliminate ingest since the data is already present on Isilon
  • If you need to ingest data, you can do it via POSIX protocols such as NFS, SMB or FTP. IDC found that NFS  writes work 4.2 times faster on Isilon and 36 times faster for reads [1,4].
  • Use existing authentication providers such as Kerberos, Active Directory, LDAP etc. for integrated security
  • Isilon balances the data equally across all nodes in the cluster. If you need more capacity, you just add a node. New capacity is available immediately, rebalancing takes place in background.
  • Use parallel synchronization over LAN or WAN for a disaster recovery strategy
  • Manage a single large scale-out filesystem (today up to 50PB) very easy via a WebUI, CLI (OneFS is based on FreeBSD so Unix/Linux dudes feel home) or API.
  • Use Data Tiering: you can use different Isilon nodes in one cluster and use policy based and transparent data tiering for optimal performance and cost efficiency. For details see [5].
  • Data at Rest Encryption on Isilon is done at drive level. There is almost no performance impact compared to evolving software encryption solutions.
  • Use Isilon de-duplication. It’s running as a background post process and as such doesn’t impact production performance.
  • Use Snapshots. Currently more than 20000 snapshots are supported.
  • Use SEC 17-a4 compliant WORM retention
  • Use certified file system auditing
  • Use Isilon Access Zones to provide Hadoop as a Service securely to multiple tenants.
  • Use existing backup mechanisms.



OneFS is a very mature scale-out filesystem that serves data via multiple protocols including HDFS to hundreds or thousands of clients. The biggest advantage is that you separate compute and storage and you can scale both independently. Most importantly, you can provide access to the data via multiple HDFS protocols and distributions at the same time. No matter which version or distributions your users or customers prefer, they all can be served by Isilon as long as it is a distribution that’s based on the Apache base, such as Hortonworks, Pivotal or Cloudera. Instead of using HDFS data silos, Isilon is a great foundation for your Data Lake with enterprise grade functionalities that integrates well into your datacenter’s infrastructure with respect to security, serviceability, high performance. 


[1]   EMC Isilon Scale-out Data Lake Foundation – Essential Capabilities for Building Big Data Infrastructure, IDC White Paper, October 2014.
[2] EMC Isilon OneFS – A Technical Overview; White Paper, November 2013.
[3] High Availability and Data Protection with EMC Isilon Scale-Out NAS, White Paper, November 2013.
[4] Comparing Hadoop performance on DAS and Isilon, Stefan Radtke, blog post 2015.
[5] Next Generation Storage Tiering with EMC Smartpools, White Paper, April 2013.
The White Papers mentioned here are all available for download at https://support.emc.com


Thanks to my colleague Ryan Peterson who brought up the idea of the analogy to RDBMS systems and why you have different ones during a recent discussion.


  1. There are lots of information about latest technology and how to get trained in them, like Big Data Training in Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Big Data Training). By the way you are running a great blog. Thanks for sharing this.

    Hadoop Training in Chennai | Big Data Training in Chennai

  2. Excellent post on iOS mobile apps development!!! The future of mobile application development is on positive note. You can make most it by having in-depth knowledge on mobile application development platform and other stunning features. iOS Training in Chennai | iOS Training Institutes in Chennai

  3. Nice it is thanks for sharing

    Visit - www.tekclasses.in/

  4. The pictorial representation was really good and i got some clarity about Hadoop.Thanks for posting such a unique and interesting blog.Hadoop is a platform for storing and processing of Data in an environment with clusters of computers using simple programming language.

    Hadoop Training Chennai | Hadoop Training in Chennai | Big Data Training in Chennai

  5. Very Nice Blog I like the way you explained these things.

  6. GREEN WOMEN HOSTELGreen Women hostel is one of the leading Ladies hostel in Adyar and we serving an excellent service to Staying people, We create a home atmosphere, it is the best place for Working WomenOur hostel Surrounded around bus depot, hospital, atm, bank, medical Shop & 24 hours Security Facility

  7. brilliant article that I was searching for. Helps me a lot
    call360 is Fastest local search Engine we have 12 years of experience in online industery, in our Search Engine we offer,
    more than 220 categories and 1 Million Business Listing most frequently search categories
    are Money exchange Chennai and Bike mechanic Chennai,
    we deliver 100% accure data to users & 100% Verified leads to our
    registered business vendors and our most popular categories are
    AC mechanic chennai,
    Advertising agencies chennai
    catering services chennai

  8. brilliant article that I was searching for. Helps me a lot.
    We are one of the Finest ladies hostel near OMR and our
    womens hostel in adyar is secure place for working womens
    we provide home based food with hi quality, our hostel located very near to Adyar bus depot.
    womens hostel near Adyar bus depot, we are one of the best and experienced
    womens hostel near omr

  9. This content is so informatics and it was motivating all the programmers and beginners to switch over the career into the Big Data Technology. This article is so impressed and keeps updating us regularly.
    Hadoop Training in Chennai | Hadoop Training Chennai | Big Data Training in Chennai

  10. It’s really amazing that we can record what our visitors do on our site. Thanks for sharing this awesome guide. I’m happy that I came across with your site this article is on point,thanks again and have a great day. Keep update more information..
    Salesforce Training in Chennai

    Web Designing Training in Chennai

  11. it’s really nice and meanful. it’s really cool blog. Linking is very useful thing.you have really helped lots of people who visit blog and provide them usefull information.
    Hadoop Training in Hyderabad
    Java Training in Hyderabad

  12. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

  13. Thank you for taking the time and sharing this information with us. It was indeed very helpful and insightful while being straight forward and to the point.
    mcdonaldsgutscheine.net/ | startlr.com/ | saludlimpia.com/

  14. Your new valuable key points simply much a person like me and extremely more to my office workers. With thanks; from every one of us.
    Best Python training Institute in chennai

  15. hi your post on hadoop to access data with different hdfs was very much useful as a beginner I learnt something new Hadoop Training in Velachery | Hadoop Training .

  16. This is extremely great information for these blog!! And Very good work. It is very interesting to learn from to easy understood. Thank you for giving information. Please let us know and more information get post to link. aws interview questions for devops

  17. AWS Training in Bangalore - Live Online & Classroom
    myTectra Amazon Web Services (AWS) certification training helps you to gain real time hands on experience on AWS. myTectra offers AWS training in Bangalore using classroom and AWS Online Training globally. AWS Training at myTectra delivered by the experienced professional who has atleast 4 years of relavent AWS experince and overall 8-15 years of IT experience. myTectra Offers AWS Training since 2013 and retained the positions of Top AWS Training Company in Bangalore and India.

    IOT Training in Bangalore - Live Online & Classroom
    IOT Training course observes iot as the platform for networking of different devices on the internet and their inter related communication. Reading data through the sensors and processing it with applications sitting in the cloud and thereafter passing the processed data to generate different kind of output is the motive of the complete curricula. Students are made to understand the type of input devices and communications among the devices in a wireless media.

  18. Its a wonderful post and very helpful, thanks for all this information. You are including better information regarding this topic in an effective way. T hank you so much.
    Salesforce Training in Chennai
    German Classes in Chennai
    Salesforce Course
    Salesforce Developer Training
    German Language Classes in Chennai
    German Language Course in Chennai

  19. Your post is really awesome. Your blog is really helpful for me to develop my skills in a right way. Thanks for sharing this unique information with us.
    - Learn Digital Academy

  20. Nice thanks for sharing this post

  21. Hmm, it seems like your site ate my first comment (it was extremely long) so I guess I’ll just sum it up what I had written and say, I’m thoroughly enjoying your blog. I as well as an aspiring blog writer, but I’m still new to the whole thing. Do you have any recommendations for newbie blog writers? I’d appreciate it.
    Advanced AWS Training in Bangalore | Best Amazon Web Services Training Institute in Bangalore
    Advanced AWS Training Institute in Pune | Best Amazon Web Services Training Institute in Pune
    Advanced AWS Online Training Institute in india | Best Online AWS Certification Course in india
    AWS training in bangalore | Best aws training in bangalore

  22. Have you been thinking about the power sources and the tiles whom use blocks I wanted to thank you for this great read!! I definitely enjoyed every little bit of it and I have you bookmarked to check out the new stuff you post
    microsoft azure training in bangalore
    rpa training in bangalore
    best rpa training in bangalore
    rpa online training

  23. Nice tutorial. Thanks for sharing the valuable information. it’s really helpful. Who want to learn this blog most helpful. Keep sharing on updated tutorials…
    Best Devops training in sholinganallur
    Devops training in velachery
    Devops training in annanagar
    Devops training in tambaram

  24. A good blog always comes-up with new and exciting information and while reading I have feel that this blog is really have all those quality that qualify a blog to be a one.I wanted to leave a little comment to support you and wish you a good continuation. Wishing you the best of luck for all your blogging efforts read this.
    Python Online certification training
    python Training institute in Chennai
    Python training institute in Bangalore

  25. I have gone through your blog, it was very much useful for me and because of your blog, and also I gained many unknown information, the way you have clearly explained is really fantastic. Kindly post more like this, Thank You.
    honor service centre
    honor mobile service center in chennai
    honor mobile service center
    honor mobile service centre in Chennai
    honor service center near me

  26. Good job in presenting the correct content with the clear explanation. The content looks real with valid information.
    R Training Institute in Chennai | R Programming Training in Chennai

  27. Nice post. Thanks for sharing! I want people to know just how good this information is in your article. It’s interesting content and Great work.
    Thanks & Regards,
    VRIT Professionals,
    No.1 Leading Web Designing Training Institute In Chennai.

    And also those who are looking for
    Web Designing Training Institute in Chennai
    SEO Training Institute in Chennai
    Photoshop Training Institute in Chennai
    PHP & Mysql Training Institute in Chennai
    Android Training Institute in Chennai

  28. Hey, would you mind if I share your blog with my twitter group? There’s a lot of folks that I think would enjoy your content. Please let me know. Thank you.
    excel training in chennai | advanced excel training in chennai | excel classes in chennai | advanced excel course in chennai | ms excel training in chennai | excel courses in chennai

    SAN Solutions in Dubai

  30. Software Development Company We specialize in Blockchain development, Artificial Intelligence, DevOps, Mobile App development, Web App development and all your customised online solutions. Get best impression at online by our services, we are familiar for cost effectiveness, quality, delivery and support.