28 June 2015

How to optimize Tivoli Storage Manager operations with dsmISI and the OneFS Scale-Out File System

In a previous blog, Using Isilon as a Backup Target for TSM*), I explained why Isilon is a fantastic target for backup and archive data. While our field experiences have matured through many successful implementations, there are still things that need improvement. Especially the unawareness of TSM regarding scale-out file systems has several side effects, which I’ll explain in this blog and how they can be solved with a solution called dsmISI, developed by General Storage.

*) IBM has recently re-branded some of their products. Also the IBM Tivoli Storage Manager, TSM, got a new name. It is now called IBM Spectrum Protect (TM). However, since the name TSM is known for decades in the user community, I’ll still use it in this article.

The Manual Approach

Consider a setup illustrated in figure 1.
Figure 1: Two TSM Servers, connected via 10 Gigabit Ethernet to a 5 node Isilon cluster.

Isilon provides a smart and dynamic load balancing feature called SmartConnect [1]. SmartConnect works as an advanced DNS service which can be configured to consider various metrics for a dynamic load balancing. However, this does not work well with TSM as TSM requires NFS mounts on Unix type systems or SMB/UNC connections on Windows, which are static by nature. Once a mount has been performed, it stays for a very long time. In the worst case, all mounts from a TSM server are performed at the same time, which may results in a situation where all mounts are created using the very same Isilon node. From Isilon’s backend perspective, this is not a problem since all data blocks are distributed by OneFS equally across all nodes via the internal Infiniband network. But two other problems would arise here: 
  1. We The data transfer would only utilize one out of five available network links.
  2. TSM is not aware of the fact that it ‘speaks’ to a scale-out filesystem. As a results, it would always go back to the same node if a restore is required.
We would address issue number 1.) by creating one subdirectory and mount point per Isilon node for TSM volumes:

mount node1:/ifs/tsm /tsm/n1
mount node2:/ifs/tsm /tsm/n2
mount node3:/ifs/tsm /tsm/n3
mount node4:/ifs/tsm /tsm/n4
mount node5:/ifs/tsm /tsm/n5

To make TSM aware of this, we add all mount points to the device class definition in TSM:
define devclass <name> devtype=file … 

By doing this, we make sure that the TSM server uses all network paths and thus, using all Isilon resources equally to store data. TSM would place its volumes in all the above subdirectories and decides which path to use next by examining how much capacity is left on each directory (file system). Since TSM is unaware that /tsm/n1 to /tsm/n5 reside on the same file system, it would always continue to distribute the capacity equally since it sees always the same free capacity on all nodes (which is a desired behavior of scale-out file systems). At this time all is fine as this is what we want. Distribution of workloads may not be perfect because TSM balancing doesn’t care about individual throughputs per path – but it is still a lot better than using just a single Isilon node

The side effect of TSM’s unawareness of a scale-out filesystem

When extending a cluster of a scale-out file system with additional nodes, it becomes apparent, that TSM is not aware of the fact that all its volumes reside in the same file system. Assume that we extend the Isilon cluster with a 6th node. The first thing we would be doing is to add a new mount point on the TSM server and add the path to the device definition. Since we have a scale-out file system, TSM  would continue to ‘see’ the same free capacity in all directories. Therefore, it would continue to equally store the same portion of data on each node. This is illustrated in figure 2.


Figure 2: TSM’s capacity load balancing on a scale-out filesystem after a new node has been added to the system.

At the time t1 and t2 we have 5 nodes in the cluster. Over time, all TSM volumes on all nodes get filled equally. Now, just a moment after t2, we add a new Isilon node. As said, TSM would continue to load balance equally since it has no idea that all paths end up in one filesystem and it sees the same amount of free capacity on all paths. Assume that we backup 25TB between t2 and t3. Due to TSM’s dumb distribution, we’d have 75TB on node1 to node5 and only 5TB on node6 at the point in time of t3. Please be aware that this is just the logical view from TSM. From the backend perspective, OneFS makes sure that the physical data is distributed to all nodes. However, since TSM is not aware of the OneFS internal load balancing and the fact that it could access all its volumes through all Isilon nodes, it would always go back to node1 to access/read data that is has stored via that network path.

As we typically add more and more Isilon nodes over time, the access patterns (not the physical data) would get unbalanced as more and more older TSM volumes would reside on the nodes that we had when the cluster was initially set up (from TSM’s perspective). Figure 3 illustrates this behavior.


Figure 3: Due to TSM’s unawareness of scale-out file systems, it would continue to balance capacity equally, resulting on more volumes being stored on the first (oldest) nodes in the cluster. Consequently, subsequent restores and other data access may get unbalanced.

Addressing the issue with dsmISI

The unawareness of TSM regarding scale-out file systems is not unique to Isilon. It’s a TSM issue that also applies to any other scale-out filesystem like GPFS, IBM Elastic Storage Server or similar systems. To my knowledge, Isilon is the only platform where the problem has been solved with a software solution called dsmISI.

dsmISI has been developed by General Storage and it runs as a daemon on the TSM server. It supports UNIX, Linux and Windows and is only available for Isilon. The mounts described above will be managed automatically by dsmISI and it acts as an application aware load balancer for TSM. Consequently, the device class definition for TSM gets very simple: it’s just containing a single directory like /tsm1. During runtime, dsmISI will then load balance the data based on factors such as CPU load and network response times. This addresses the described issue as the dumb TSM round robin is avoided. The cluster will be balanced for all times including restores which will be made from the fastest available nodes instead of the node to which TSM has backed up the data previously.


Figure 5: TSM server running dsmISI with dynamic load balancing

dsmISI simplifies the TSM management

As mentioned, the TSM device class definition gets very short and you don’t need to change it when you add new Isilon nodes. As dsmISI talks actively to the cluster, it recognizes new nodes and will create required mounts automatically and include them into the load balancing. In summary, dsmISI does the following things:

  • Dynamic multipathing: the TSM data stream will automatically use the network paths with the lowest latencies. This works for a single TSM Server instance as well as many Servers on many machines writing and reading data to and from the same Isilon clusters simultaneously.
  • It creates active NFS/SMB connections to all Isilon nodes automatically
  • It automatically detects failures, removals, and additions of nodes in Isilon clusters
  • It supports TSM versions 5, 6, and 7 under Linux, UNIX, and Windows
  • It supports multiple Isilon systems
  • It is installed on the operating system for the TSM server as a daemon/service. If it fails, backup and recovery operations will continue to run but without dynamic balancing.


dsmISI has been installed by many customers, ranging from large automotive customers, retailers, manufacturing companies as well as international banks. Over the past couple of years, multiple petabyte of TSM backup data have been backed up with dsmISI to Isilon. It is very mature and we got very positive feedback from customers using it.


Thanks to Lars Henningsen from General Storage for various discussions around this topic. Thanks to Matthias Radtke for review and improving my non-native language writing.



For questions and any comments feel free to connect to the author via LinkedIn: https://de.linkedin.com/in/drstefanradtke

You may also connect to General Storage for support, commercial or other topics at:

References and further reading

[1] Isilon External Network Connectivity Guide - Routing, Network Topologies, and Best Practices for SmartConnect, White Paper, April 2015. https://support.emc.com/docu58740_Isilon_External_Network_Connectivity_Guide_-_Routing,_Network_Topologies,_and_Best_Practices_for_SmartConnect.pdf?language=en_US

[2] Isilon Community Forum: https://community.emc.com/community/products/isilon


  1. This comment has been removed by a blog administrator.

  2. How about Spectrum Accelerate? I think is better than Isilon. What do you think?