05 May 2015

How to access data with different Hadoop versions and distributions simultaneously


Many companies start rolling out or at least think about using Hadoop for data analysis and processing. Hadoop Distributed Filesystem (HDFS) is the underlying filesystem and typically local storage within the compute nodes is used to provide the storage capacity. HDFS has been designed to satisfy the workload characteristics for analytics and has been born during the time where 1 Gigabit Ethernet was the standard networking technology in the datacenter. The idea was to bring the data to the compute nodes in order to minimize network utilization and bandwidth and reduce latency through data locality. (I’ll post another article to discuss this in more detail and show that this requirement is less important these days where we have 10 Gigabit almost everywhere in the datacenter). For the moment we’ll look at some other side effects of this strategy which reminds me somehow to a lot of data silos. Business Intelligence dudes know what I am talking about.

image
Figure 1: The “Bring Data to Compute” strategy results in a lot of data silos, complex and time consuming workflows.
One thing you may already know is the fact that HDFS is not compatible with POSIX protocols like

NFS or SMB. That means you need special tools to copy your data into the filesystem. That’ll take a long time. For example, if you need to copy 100TB over a 10 GB Ethernet, you’d need more than 24 hours to so so if the network is not occupied with other traffic.

But it gets even worse: You may have multiple HDFS distributions or versions like you have different RDBMS systems

Have you ever thought about how many different RDBMS systems you have in the company ? Typically companies have several relational database systems like Oracle, MS SQL, MySQL, DB2, Sybase etc. But wouldn’t it be easier to have only one system? The answer of course is: yes it would, but that’s not the reality we are facing. In practice we have different RDBMS systems for various reasons:

  • Application dependencies
  • Different people or organizations within the company have different preferences
  • Mergers and acquisitions
  • Price and licensing models
  • Functionality
  • Performance
  • Historical reasons
  • Large IT organizations or service providers just have to support what their customers want. They cannot dictate the Hadoop version or setup a new cluster including storage for every customer
  • New innovative distributions appear in the market. Consider how many Linux distributions we have. It’s not only RedHat, SuSe, Debian, Ubuntu and you name it. Just recently Intel, IBM and Pivotal/EMC have announced that they’ll maintain their additional distributions that are optimized for virtual and cloud environments. The same may happen with HDFS.
  • …and others
I guess we’ll see the same development with HDFS for quite the same reasons. Furthermore, the development of Hadoop is currently very fast and we’ll see HDFS vendors starting to build their individual strength within different areas.
Now think about how you would use different versions or distributions when you have compute and data tightly integrated? It most probably will end up in more HDFS clusters, more copies of data and big data movements. You may also be stuck at a specific version or distribution because you need your production data to be available for analysis and you cannot just migrate and copy them every day.

Here is the solution: the Scale-out Data Lake Isilon

Fortunately there is a solution to this issue: EMC’s Isilon Scale Out NAS System has a very mature distributed filesystem OneFS. It’s also build on top of commodity hardware and uses internal disks to provide the space for the data. However, it’s much more advanced than HDFS in many regards and it has been built over more than 15 years to serve massive amounts of data with very high throughput and low latency. To serve Hadoop requests, HDFS has been implemented as a protocol rather than a filesystem. As a result, you can access your data over various protocols such as SMB, NFS, FTP, HTTP, Openstack Swift and HDFS simultaneously while consistency, protection, access control and global file locking is provided by OneFS.

image
Figure 2: Data on the Scale-Out filesystem OneFS can be accessed via multiple protocols

HDFS as a protocol

Instead of storing the data on a new filesystem type, the Isilon team has integrated HDFS as a protocol. A multi-threaded daemon called isi_hdfs_d is running on every Isilon node. It services both Name Node and Data Node protocols and it translates HDFS RPCs to POSIX system calls. As HDFS is stateless, the underlying filesystem handles coherency.
image
Figure 3: Multi-threaded HDFS daemon runs on every Isilon node.
With this approach, new protocol version can be integrated quickly and data migrations or modifications are not required as they reside on the POSIX scale-out filesystem.

image
Figure 4: Data Node and Name Node requests are served in a highly available manner.

 


Access to the data with different Hadoop versions or distributions

This “de-coupling” of compute and storage with Isilon as your “Data Lake”, you can now access the very same data with multiple Hadoop distributions and even different HDFS versions (at the time of writing this, Isilon supports almost everything from HDFS 1.0 to HDFS 2.6 and the development team has a strong focus to have new versions ready right after the major distributions come up with new HDFS versions).
If you think about this for a moment you’ll agree that this is huge! By pooling your data into Isilon, you get complete freedom which version of HDFS you want or need to use. You can test new versions, roll back to previous ones or use another distribution to access the same data simultaneously. Think a moment about the analogy with the different RDBMS versions in your company which I have mentioned above. There is a high probability that you’ll have the same with Hadoop: different Hadoop distributions and versions within the company. That’s no problem with Isilon. Solved.

Other Advantages

But there are further advantages:
  • No single point of failure. Name node requests are served by all Isilon Nodes in an active/active manner.
  • Isilon protects data with erasure coding across nodes. That’s much more efficient than just creating multiple copies of each block. The protection level can be set very flexible and dynamically for every directory  or pool. If you follow the guidelines, you’ll get a protection overhead between 20% and 30%. That’s much more efficient over native HDFS where you need to provide 300% of DAS capacity for 3 copies of data. See [3] for more details.
  • You can scale compute and storage independently. Your compute nodes don’t require storage anymore (you might want to use internal disks for the shuffle IO though). If you need compute power, you add servers, if you need capacity, you add Isilon nodes.
  • Most workloads run faster on Isilon [1,4].
  • For some data you can eliminate ingest since the data is already present on Isilon
  • If you need to ingest data, you can do it via POSIX protocols such as NFS, SMB or FTP. IDC found that NFS  writes work 4.2 times faster on Isilon and 36 times faster for reads [1,4].
  • Use existing authentication providers such as Kerberos, Active Directory, LDAP etc. for integrated security
  • Isilon balances the data equally across all nodes in the cluster. If you need more capacity, you just add a node. New capacity is available immediately, rebalancing takes place in background.
  • Use parallel synchronization over LAN or WAN for a disaster recovery strategy
  • Manage a single large scale-out filesystem (today up to 50PB) very easy via a WebUI, CLI (OneFS is based on FreeBSD so Unix/Linux dudes feel home) or API.
  • Use Data Tiering: you can use different Isilon nodes in one cluster and use policy based and transparent data tiering for optimal performance and cost efficiency. For details see [5].
  • Data at Rest Encryption on Isilon is done at drive level. There is almost no performance impact compared to evolving software encryption solutions.
  • Use Isilon de-duplication. It’s running as a background post process and as such doesn’t impact production performance.
  • Use Snapshots. Currently more than 20000 snapshots are supported.
  • Use SEC 17-a4 compliant WORM retention
  • Use certified file system auditing
  • Use Isilon Access Zones to provide Hadoop as a Service securely to multiple tenants.
  • Use existing backup mechanisms.

 

Summary

OneFS is a very mature scale-out filesystem that serves data via multiple protocols including HDFS to hundreds or thousands of clients. The biggest advantage is that you separate compute and storage and you can scale both independently. Most importantly, you can provide access to the data via multiple HDFS protocols and distributions at the same time. No matter which version or distributions your users or customers prefer, they all can be served by Isilon as long as it is a distribution that’s based on the Apache base, such as Hortonworks, Pivotal or Cloudera. Instead of using HDFS data silos, Isilon is a great foundation for your Data Lake with enterprise grade functionalities that integrates well into your datacenter’s infrastructure with respect to security, serviceability, high performance. 

References

[1]   EMC Isilon Scale-out Data Lake Foundation – Essential Capabilities for Building Big Data Infrastructure, IDC White Paper, October 2014.
[2] EMC Isilon OneFS – A Technical Overview; White Paper, November 2013.
[3] High Availability and Data Protection with EMC Isilon Scale-Out NAS, White Paper, November 2013.
[4] Comparing Hadoop performance on DAS and Isilon, Stefan Radtke, blog post 2015.
[5] Next Generation Storage Tiering with EMC Smartpools, White Paper, April 2013.
The White Papers mentioned here are all available for download at https://support.emc.com

Acknowledgement

Thanks to my colleague Ryan Peterson who brought up the idea of the analogy to RDBMS systems and why you have different ones during a recent discussion.




17 comments:

  1. There are lots of information about latest technology and how to get trained in them, like Big Data Training in Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Big Data Training). By the way you are running a great blog. Thanks for sharing this.

    Hadoop Training in Chennai | Big Data Training in Chennai

    ReplyDelete
  2. Excellent post on iOS mobile apps development!!! The future of mobile application development is on positive note. You can make most it by having in-depth knowledge on mobile application development platform and other stunning features. iOS Training in Chennai | iOS Training Institutes in Chennai

    ReplyDelete
  3. Nice it is thanks for sharing

    Visit - www.tekclasses.in/

    ReplyDelete
  4. The pictorial representation was really good and i got some clarity about Hadoop.Thanks for posting such a unique and interesting blog.Hadoop is a platform for storing and processing of Data in an environment with clusters of computers using simple programming language.

    Hadoop Training Chennai | Hadoop Training in Chennai | Big Data Training in Chennai

    ReplyDelete
  5. Very Nice Blog I like the way you explained these things.
    LOCAL BUSINESS DIRECTORY

    ReplyDelete
  6. GREEN WOMEN HOSTELGreen Women hostel is one of the leading Ladies hostel in Adyar and we serving an excellent service to Staying people, We create a home atmosphere, it is the best place for Working WomenOur hostel Surrounded around bus depot, hospital, atm, bank, medical Shop & 24 hours Security Facility

    ReplyDelete
  7. brilliant article that I was searching for. Helps me a lot
    call360 is Fastest local search Engine we have 12 years of experience in online industery, in our Search Engine we offer,
    more than 220 categories and 1 Million Business Listing most frequently search categories
    are Money exchange Chennai and Bike mechanic Chennai,
    we deliver 100% accure data to users & 100% Verified leads to our
    registered business vendors and our most popular categories are
    AC mechanic chennai,
    Advertising agencies chennai
    catering services chennai

    ReplyDelete
  8. brilliant article that I was searching for. Helps me a lot.
    We are one of the Finest ladies hostel near OMR and our
    womens hostel in adyar is secure place for working womens
    we provide home based food with hi quality, our hostel located very near to Adyar bus depot.
    womens hostel near Adyar bus depot, we are one of the best and experienced
    womens hostel near omr

    ReplyDelete
  9. This content is so informatics and it was motivating all the programmers and beginners to switch over the career into the Big Data Technology. This article is so impressed and keeps updating us regularly.
    Hadoop Training in Chennai | Hadoop Training Chennai | Big Data Training in Chennai

    ReplyDelete
  10. It’s really amazing that we can record what our visitors do on our site. Thanks for sharing this awesome guide. I’m happy that I came across with your site this article is on point,thanks again and have a great day. Keep update more information..
    Salesforce Training in Chennai

    Web Designing Training in Chennai

    ReplyDelete
  11. it’s really nice and meanful. it’s really cool blog. Linking is very useful thing.you have really helped lots of people who visit blog and provide them usefull information.
    Hadoop Training in Hyderabad
    Java Training in Hyderabad

    ReplyDelete
  12. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete
  13. Thank you for taking the time and sharing this information with us. It was indeed very helpful and insightful while being straight forward and to the point.
    mcdonaldsgutscheine.net/ | startlr.com/ | saludlimpia.com/

    ReplyDelete