07 October 2015

EMC Isilon vs. IBM Elastic Storage Server or how IBM performs an Apple to Apple comparison

 
Readers of my blog know that I share best practices and client experiences here in my blog. In the post Isilon as a TSM Backup Target – Analyses of a Real Deployment [8], I described the “before and after” situation in an Isilon deployment for an IBM Spectrum Protect (former name Tivoli Storage Manager, TSM) solution. These results were simply a view on how a production work load looked like and how throughput and resulting backup windows evolved.

Interestingly, IBM –in the intent to position their IBM Spectrum Scale/Elastic Storage Server as the better solution-, hired a marketing firm to evaluate a performance benchmark on IBM Elastic Storage Server (ESS) against my mentioned post and publish that as a white paper [1]. The results highlighted in this paper indicate, that IBM Spectrum Scale is 11 times faster than a similar EMC Isilon configuration. I’m just guessing at why IBM did not publish that themselves, rather than paying a marketing firm to do it. I assume they are too serious to publish a comparison between snapshots of an averagely loaded production environment with a prepared benchmark test that was suited to evaluate the maximum performance of their solution.

The results published in my post were by no means showing any limits of the platform and the results were influenced by external clients, additional server and network traffic etc. Also nothing was said about server and storage utilization or any other means of potential limits or bottlenecks. It’s obvious that the authors of the white paper did not read my blog, otherwise it would be hard to explain, how they could accept such a comparison.

To wit: IBM sponsored a white paper that compared an early customer workload of a new production environment with a well prepared benchmark. They used quite different equipment but called it a “similar environment”.
 

Not a like to like workload comparison

The results published in my post are by no means showing any limits of the platform and the results were influenced by external clients, additional server and network traffic etc. Also nothing was said
about server and storage utilization or any other means of potential limits or bottlenecks. It’s obvious that the authors of the white paper did not read my blog, otherwise it would be hardly explain how they could accept such a comparison.
In summary: IBM sponsored a white paper, that compared an early customer workload of a new production environment with a well prepared benchmark and called it a “similar environment”.

 

IBM used three times the number of disk and Infiniband and called that a “similar configuration”

Beside the fact that it has not been a benchmark to benchmark comparison, there are more things worth mentioning:
  • The IBM  GPFS Storage Server contained 348 NLSAS Disks whereas the Isilon system contained just three NL400 nodes with 108 SATA disks in total.
  • The TSM Servers were connected through 56 Gb/s to the GPFS storage servers whereas 10 Gb/s Ethernet was used to connect to the Isilon Nodes. I don’t need to emphasize here that Infiniband-Networks are not widely deployed in commercial production environments. 
  • The load on the IBM system was generated by TSM clients running co-located on the TSM servers and generated benchmark data which were not read from disk but pumped directly to the servers.
  • In the environment I described in my blog post, the clients were connected via Ethernet resulting in additional latency and shared network resources and backed up real file systems incrementally.

 

The issue with IBM Elastic Storage Server is not performance – it’s the complexity

I worked 17 years for Big Blue and there is no doubt that the core of ESS, the Spectrum Scale (former name GPFS) is a saleable filesystem. For good reason, it has some installations for High Performance Computing where the required expert skills are available to bring a GPFS cluster up and keep it running. This is typically not the case in commercial environments were IT staff has limited resources and business requires simple to manage rather than complex solutions. I know various customers that have or are in transition from GPFS to Isilon and they all prefer Isilon due to the simplicity for installation and maintenance.

 

Gartner’s view on  the IBM Enterprise Storage Server vs. EMC Isilon 

Here is an original statement from Gartner regarding IBM Elastic Store [2], “[..] Elastic Storage lacks features such as built-in de-duplication, compression and thin provisioning. Although IBM has made improvements by modeling the graphical user interface (GUI) after the popular XIV interface, overall manageability continues to be complex.
In the same paper, they write about Isilon:  “Among the distributed file systems for scalable capacity and performance on the market, Isilon stands out, with its easy-to-deploy clustered storage appliance approach and well-rounded feature sets. The product includes a tightly integrated file system, volume manager and data protection in one software layer;

 

IDC findings

IDC evaluated already in 2011 the business impacts of Scale Out NAS Solutions. They evaluated the Isilon OPEX savings to be 48% over traditional solutions. I am confident that this is the result of Isilon’s integrated architecture and the resulting simplicity. That’s affirmed by customers that responded to IDC in interviews

“With Isilon, storage provisioning takes maybe an hour a year. Before it took three to four hours a week. Storage allocation used to take five to six hours a week, and now it takes  six hours a year.”

“We're much more competitive because of Isilon and we're winning more jobs. Our three biggest competitors have all bought Isilon [solutions] since we did.”

“Isilon allows us to manage petabytes of storage with a tiny staff and scale easily so we don't have to worry about creating and managing volumes, or managing a bunch of other things that create costs.”

“As we grow, we can add a node in 60 seconds, which means we can take on large customers and also be more responsive to existing customers.”

Although IBM has now come up with a GUI which intends to make some management tasks more intuitive, that does not mean that the architecture in general has been simplified. Let’s look at some details why IBM Spectrum Scale is still complex.
 

 

 The complexity of Elastic Storage Server – it starts from the bottom: RAID !

Even though the GPFS Native RAID (GNR) has removed some limitations of the traditional Hardware-RAID implementations, it’s still RAID and it needs to be understood, configured and maintained. The Administration Guide for GNR [5] has 262 pages – just for GNR. It looks like there are a lot of concepts and details to learn and consider before you can even start to implement other components of the cluster. Some important GNR concepts are based on entities like
  • declustered arrays
  • recovery groups
  • pdisks
  • pdisk-group fault tolerance
  • pdisk paths
  • vdisks
  • Log vdisks
  • GPFS Native RAID vdisk configuration data (VCD)
  • VCD spares
  • RAID Codes
  • Block size
  • vdisk size
  • Log vdisks
Many of these entities have dependencies such as for example [5]:

Vdisks are created within declustered arrays, and vdisk tracks are declustered across all of an array's pdisks. A recovery group may contain up to 16 declustered arrays. A declustered array can contain up to 256 pdisks (but the total number of pdisks in all declustered arrays within a recovery group cannot exceed 512). A pdisk may belong to only one declustered array. The name of a declustered array must be unique within a recovery group; that is, two recovery groups may each contain a declustered array named DA3, but a recovery group cannot contain two declustered arrays named DA3. The pdisks within a declustered array must all be of the same size and should all have similar performance characteristics.

As we can see, there are many low level concepts that an administrator needs to understand before he/she can configure a reliable and balanced system. And the level of detail the admin needs to consider goes even further. For example: When creating a declustered array, several attributes need to be configured for each array [5, S.24]

  • dataSpares
    The number of disks' worth of equivalent spare space used for rebuilding vdisk data if pdisks fail. This defaults to one for arrays with nine or fewer pdisks, and two for arrays with 10 or more pdisks.
  • vcdSpares
    The number of disks that can be unavailable while the GPFS Native RAID server continues to function with full replication of vdisk configuration data (VCD). This value defaults to the number of data spares. To enable pdisk-group fault tolerance, this parameter is typically set to a larger value during initial system configuration (half of the number of pdisks in the declustered array + 1, for example).
  • replaceThreshold
    The number of disks that must fail before the declustered array is marked as needing to have
    disks replaced. The default is the number of data spares.
  • scrubDuration
    The number of days over which all the vdisks in the declustered array are scrubbed for errors.
    The default is 14 days.
I’ll stop here describing more details on the GPFS Native Raid architecture that needs to be considered (they are all well explained within the IBM product documentation). It should already be enough to understand that the concept is extremely complex compared to EMC’s well integrated Isilon architecture.

Let’s look at task level to understand the resulting management implications of the ESS complexity:

 Tasks to set up GNR on ESS

Just for setting up GNR on the Elastic Storage Server, you need to conceptually perform the following steps [5, S.47]:
  • Configuring GNR recovery groups on the ESS
  • Preparing ESS recovery group servers
    - Disk enclosure and HBA cabling
    - Verifying that the GL4 building block is configured correctly
  • Creating recovery groups on the ESS
    - Configuring GPFS nodes to be recovery group servers
    - Defining the recovery group layout
  • Defining and creating the vdisks
  • Creating NSDs from vdisks
Once you performed all these tasks you have just created some NSDs for GPFS. You can now to start implement IBM Spectrum Scale (GPFS) on top. I’ll stop here again with more details. I just wanted to highlight the complexity with an example on the management level. It remains a fact that volumes can hardly be managed at scale, be it with Hardware or Software-RAID, be it native or declustered.
 

 

Isilon has a well-integrated RAID, Volume Manger and Filesystem for Simplicity

In contrast, Isilon has a well thought through concept of an integrated RAID-level, Volume Manger and Filesystem. There is no need to deal with any of the above concepts as there are no RAID arrays or Volumes/vdisks in Isilon. There is just a single filesystem that’s available right after you boot up the appliance. Data as well as parity data is spread across all nodes and disk in the system. The protection level can easily set by cluster, pool, directory or even file level. It can be changed any time on the fly. The education required to administer an Isilon Cluster is a just a couple of hours – if at all.
image
Figure 1: OneFS has integrated RAID, Volume Manager and File System which makes it extremely easy to manage.
For further details on the Isilon and OneFS Architecture please refer to [6].
 

 

Adds to complexity: Lack of native Multi-Protocol support on the IBM Elastic Storage Server

As we have seen from the Edison Group Paper [1], IBM installed GPFS clients on the TSM servers. On one hand, this stands for efficient and fast communication by leveraging the proprietary NSD protocol. On the other hand, this is required due to lack of any native standard protocol support on the ESS. If your applications require NFS or CIFS support, you need to add a protocol server that has a GPFS client installed and serves the application via NFS or SMB. This is not something I would consider a Scale Out solution. How would things like cluster locking  be done in such as case ? Install SAMBA and the clustered trivial database ? That concept has already failed with SONAS and is not even supported by IBM. Accordingly, you must install and maintain GPFS clients an all your application servers that require access to the shared filesystem. You’d have Windows and/or Linux Servers to maintain in addition to ESS.

In addition, you need a management server running xCat and another one for the Hardware Management Console. Both interact with each other for various management and monitoring tasks.
You may agree that another full time employee would be required to keep the zoo running with many problems on the horizon:
  • Mange the IBM Spectrum Scale Filesystem, potentially hundreds of declustered RAID arrays, volumes, NSD and clients
  • Consider inter-dependencies on OS-Level, GPFS-Client/Server versions, TSM versions
  • Non-disruptive upgrade of all GPFS-Clients and Servers (this might be possible in theory but practically I haven’t seen it)
  • Cluster aware locking
  • Homogeneous Monitoring, Reporting, Auditing, Authentication, Security,…..

Isilon on the other hand as a rich set of natively supported protocols like NFS3, NFS4, SMB2, SMB3 Multi-Channel, FTP, HTTP, NDMP, SWIFT etc. No external protocol servers required, no plugins, connectors etc. All protocols are implemented natively, even HDFS.  Everything is tightly integrated in terms of authentication and authorization supporting multiple instances of Active Directory, Kerberos, LDAP, NIS and local providers. For more info see [4].

 

The ESS dual server building block vs. Isilon Node types

The IBM Elastic Storage Server has been build around the concept of a ‘building block’. Each building block contains always two IBM Power S822L Servers (of which one has a storage enclosure attached). You can choose between two model lines (GS and GL) with two types and various number of SAS storage enclosures. Whatever use case you are planning for, a building block has always two of the very same servers. Within the rack you can scale to something like a petabyte of storage, but still it’ll remain two servers. That looks to me like a legacy dual controller concept rather than a scale out architecture (at least within the rack; you might be able to add additional servers but they’d require additional SAS enclosures and cabling). Also it seems to me that the one size fit’s all strategy (it’s always the same server type, regardless of the workload) might not be the most cost efficient one. With Isilon you have the choice of 4 node types ranging from very fast S-nodes to highest dense HD-nodes. You can choose the right node type that fits to your workload in a most efficient way. Nodes can be mixed and data is placed or tiered by policies (the concept of tiering is quite similar in both, GPFS and OneFS).

 

Using an HPC like setup for backup/archive or an easy to manage EMC Isilon cluster

As the evaluation report [1] illustrates, IBM is selling an HPC-like solution with impressive performance numbers for a backup/archive solution. The solution is based upon a cluster were all clients (in this case the TSM servers) have to be members of the cluster. The interconnect technology is Infiniband, well suited and used in HPC environments. Beside the complexity that users need to manage, I’d be curious to see whether this is a cost effective solution for backup and archive purposes.

The EMC Isilon solution is based on commodity components in a well integrated a easy to use appliance. The interconnect technology between the TSM servers and the Isilon cluster is 10 Gigabit Ethernet which is typically available in every datacenter . With dsmISI [9] for IBM Spectrum Protect, the whole operation of IBM Spectrum Protect gets even more simplified and optimized in a fully automated manner.

Further differences to consider

There are many more aspects to consider when customers look for a solution, for example:
  • Reliability of the Architecture
  • Serviceability
  • Solution certification for 3rd Party solutions
  • Multi-Tenancy
  • Compliance/Auditing
  • Anti Virus support
  • Monitoring/Reporting Tools
  • Integration into VMware and other environments
  • Maturity of the solution, number of shipped systems
  • Security
As time permits, I may follow up with one or the other topics in some more detail. Stay tuned.

 

Summary

The competitive evaluation report of the Edison Group [1] has compared a real production environment workload profile with a specifically tailored benchmark that IBM has performed to demonstrate that IBM Elastic Storage Server is better suited to provide a backup target for IBM Spectrum Protect. While not comparing a similar configuration and real Isilon maximum performance values, the evaluation report did not consider the biggest issue with IBM Elastic Storage Server and IBM Spectrum Scale: complexity. While it may provide high throughput values, the solution requires a very high degree of skills and administration efforts compared to EMC Isilon. A Similar finding can be read in the Gartner Research Note [2].

 

Acknowledgements

Thanks to Matthias Radtke and Lars Henningsen for reviewing my writing and providing useful comments.

 

References and Further Reading

[1] The Edison Group: IBM® Spectrum Scale™ vs EMC Isilon for IBM® Spectrum Protect™ Workloads; A Competitive Test and Evaluation Report;  https://www-01.ibm.com/marketing/iwm/iwm/web/….

[2] Gartner Research Note: Critical Capabilities for Scale-Out File System Storage, January 2015

[3] IDC Market Scape - Scale-Out File–Based Storage Market, January, 2013

[4] IDC Lab Validation Brief: EMC Isilon Scale Out Data Lake Foundation, Essential Capabilities for
     building Big Data Infrastructure, October 2014

[5] Product Documentation: GPFS Native RAID, Version 4 Release 1.0.5, Administration

[6] EMC Isilon – OneFS – A Technical Overview, White Paper, November 2013

[7] Quantifying the Business Benefits of Scale-Out NAS Solutions, IDC White Paper,
     November 2011

[8] Isilon as a TSM Backup Target – Analyses of a Real Deployment,
      Blog Post, http://stefanradtke.blogspot.com/2014/06/isilon-as-tsm-backup-target-analyses-of.html

[9] How to optimize Tivoli Storage Manager operations with dsmISI and the OneFS Scale-Out File System.
     Blog Post, http://stefanradtke.blogspot.com.es/2015/06/how-to-optimize-tsm-operations-with.html

























5 comments:

  1. Now many organizations used elastic storage server system for storage their data. We can save our data long time with safe for long time. Now many companies provide this type of server system. I know SuperXpert provides good cloud server system in USA to government, private, big, small and medium enterprises.

    ReplyDelete
  2. Thank you for sharing this blog. It is very helpful and very informative. I really enjoyed reading this blog, it definitely helped me a lot to enhance my knowledge. distribution services Melbourne

    ReplyDelete
  3. Spectrum Scale is a sw, so true SDS.
    But if you want to buy packaged building blocks we created the ESS. As far as the complexity it is taken care of by our Lab Services team(Senior Consulting) as 5 days of their time is included / ESS.

    The only thing customer need to do is normal daily admin tasks, from the GUI, which has been enhanced twice since this was written.

    ReplyDelete
    Replies
    1. Hi Qurt, thanks for the comments! It supports what I have said, it is complex. Think about it: even an IBM Senior Consultant needs 5 days for implementation!! He is trained, done this many times...but still FIVE days! That backs up my and customer experiences with the solution. Installation is one thing, but management must be done by the customer (think about SW updates, firmware updates, disk firmware updates, patches,cluster topology changes, extensions...). I know it got somehow better compared to very early days. But as you said Spectrum Scale is SW so YOU (the customer) must take care of all the integration and management work! That is the biggest difference in both platforms.

      Delete
  4. Nice to see your post, this is a great platform to get some useful information and facts!

    disaster recovery amazon

    ReplyDelete