Stefan Radtke's Blog: Backup OneFS Data over NFS/CIFS with TSM

In several of my previous posts I have mentioned the shortcomings of NDMP. One of it is the lack of support for an incremental forever strategy, a feature that TSM users typically used to have. Furthermore, TSM support for NDMP is way below average, compared to other backup Software solutions (for example, EMC Networker can create and maintain OneFS Snapshots, roll them over to tape and index them. Watch out for my blogpost here soon).

One way around this would be to backup the files in the filesystem via NFS or CIFS. To avoid the required file system scan (or treewalk), the ideal solution would be something like the isi_change_list feature mentioned in a previous post. However, the first version of that change_list API has not shown been very efficient with TSM so we have to wait for the next version which we’ll anticipate at the end of 2015. Until then, the only way to accelerate a backup via CIFS/NFS is massive parallelism. General Storage has developed a solution for this called MAGS – Massive Attack General Storage.

MAGS

During backup, TSM scans through file systems, compares their content with what it already backed up and transfers changed/new files, expires deleted files, updates properties etc. TSM is doing backups in a multi-threaded fashion, spawning off up to 128 independent threads (officially 10) doing the actual transports of data (resourceutil parameter).
However TSM does NOT multi-thread effectively when it comes to scanning a file system, hence the act of comparing existing files with backed-up versions may take a very long time –even if only little or no data changed.
Backing up an Isilon Filesystem with TSM could be as easy as entering

dsmcincremental \\my_isilon_cluster\ifs

on any TSM Windows client. Provided the appropriate permissions are in effect, this will work. But it will take a very long time. Depending on the file system structure, network latency, kind of Isilon cluster nodes and the TSM Infrastructure, there will probably be no more than 5,000 to 10,000 file system objects (files and directories) scanned per second. On an Isilon, hosting 500,000,000 file system objects, scanning alone would theoretically take about 20 hours. In real life, it usually takes much longer. Working around that “scanning” bottleneck usually involves trying to logically split the file system and backing it up with multiple jobs. So instead of running:

dsmcincremental \\my_isilon_cluster\ifs

You could run:

dsmcincremental \\my_isilon_cluster\ifs\my_first_dir
dsmcincremental \\my_isilon_cluster\ifs\my_second_dir
dsmcincremental \\my_isilon_cluster\ifs\my_third_dir
…
dsmcincremental \\my_isilon_cluster\ifs\my_nth_dir

That would certainly speed things up but you’d consider the following:

You would have to keep track of your directories –adding a new one means it won’t get backed up unless you explicitly add it to your TSM jobs.
It means you have to balance the number of jobs running against which directory manually. They won’t all be of the same size –there’ll be a couple very big ones and others will be small.
It will require monitoring a potentially large number of jobs, their individual logs etc.
It won’t take care whether your client can handle the number of parallel, memory-hungry jobs you’re starting, so you’ll constantly have to tune it yourself.

So for the time being, General Storage developed MAGS to address these issues and automated the massive parallel approach to backup. It requires one ore multiple Windows Servers where MAGS runs as a TSM client wrapper. It is started exactly like TSM’s own dsmc

magsincremental \\my_isilon_cluster\ifs

Then, MAGS performs the following steps:

It scans the file system recursively to a configurable depth (i.e. 6 levels, which usually takes no more than a few seconds).
It starts as many parallel TSM clients for each sub-tree found as can be handled by the machine’s memory on which it is running (maximum number of jobs and memory threshold is configurable).
It preserves the entire file system structure in a single, consistent view. For the user running a restore via the standard TSM GUI or command line, there will only be one TSM node to address and only one tree to search.´
It can spread its TSM Client sessions across more than one Windows machine (practical limit is about 8).
It can be scheduled via the standard TSM client scheduler, logs to a single schedlog/errorlogand ends in a single return code.

MAGS usually shortens backup times to 10-30% of what a “normal” incremental would take, depending on the infrastructure and other bottlenecks associated with TSM servers, network etc. There are some large customers using it already and is seems to do a good job. The plan of General Storage is to include version 2.0 of Isilon’s change_list API once it is available and tested. This will the accelerate the scan-time dramatically and will most probably also reduce the required resources on the TSM client machines.

Figure 1: Workflow of massive parallel OneFS backups with TSM using MAGS

Requirements

At least one Windows/Intel 2008R2 or 2012 machine with at least 16 GB RAM
Microsoft .Net4.5
TSM Windows Backup/Archive Client V6.3 or newer
EMC Isilon with configured SMB shares for the data to backup
At least 3 GB of free disk space on each machine running MAGS

Impact of SSDs for meta data acceleration

File System scans and as well as all other meta data operations can be accelerated very much by using SSDs for Meta Data (Read/Write) Acceleration in OneFS. Until recently, SSDs usage in NL-Nodes, which are typically used for backup/archive targets, has not been supported. This has not been a technical reason and EMC has recently announced that SSDs can now be used as well in NL- and HD-Nodes. This is good news because even a single SSD per Node may help to accelerate scans quite significantly.

More Info

Thanks to Lars Henningsen from General Storage for providing the information. If you are interested in MAGS you may drop me a mail and I’ll forward to the right people or contact General Storage directly at http://www.general-storage.de

Stefan on linkedin:
https://de.linkedin.com/in/drstefanradtke/

Stefan Radtke's Blog

23 January 2015

Backup OneFS Data over NFS/CIFS with TSM

MAGS

Requirements

Impact of SSDs for meta data acceleration

More Info

8 comments: