27 December 2016

Isilon Search: Mining user-generated Data on OneFS in Real Time

 
During an average week, an ‘interaction worker[1]’ spends 19% of the time searching and gathering information[2]. Another source specifies that in 2013 content searches cost companies over $14,000 per worker and nearly 500 hours per worker[3]. Utilizing an efficient tool to assist in this process can have a considerable ROI.

On Dell EMC’s Isilon Scale-Out NAS platforms, users generate petabytes of unstructured data and billions of files. Data is created by individual users and machine generated data is exploding due to the growing number of sensors, log files, security devices etc. To be able to mine and search within the growing data lakes (imagine, the average size on Isilon clusters is approaching 1PB!), Dell EMC is working on a Search Appliance that Indexes data on OneFS in real time and allows users and admin to search metadata and content in a fraction of a second.  Alongside the functionality to increase corporate efficiency through search, we are embarking on a journey to mine and analyze user-generated data and further leverage it to create additional business-value.

In it's first version, the planned features are:

  • Index files from multiple Isilon clusters 
  • Search for files by name, location, size, owner, file type, and date.
  • Index files within containers such as zip and tar files.
  • Perform a targeted full content index (FCI) on search results to view a preview of the content and search for keywords and content inside.
  • Perform advanced search queries including symbols, wildcards, filters, and operators.
  • Preview and download content.

For example, administrators and end-users can execute the following use-cases on Isilon arrays:
  • As an End-user, find all my MS-Word files from last year, and then index the full-content of the files, and show me all the files with ‘project Andromeda’ in it
  • As an End-user, show me a chart of how my files breakdown by size and/or last-accessed date and/or size
  • As an Admin, find all PDF files owned by corp/user1 that were modified in the first three months of this year, compact them, and export them to a specified location
  • As an Admin, find all MPG files that are over 1GB in the /ifs/recordings subtree
  • As an Admin, find all Word, Excel, and PowerPoint documents that have not been accessed in a year
To get an Idea of the capabilities of the search appliance in it’s coming first release, watch the following video.



 It's a true Scale-Out Solution

The product is a virtual appliance with Wizards for configuration, and it relies on Elasticsearch indexing technology and the Lucene search engine; it has a ‘google-like’ UI with visual Filtering capabilities. The technology is scaleable: Search nodes can be added ‘hot’, and it scales to billions of files and provides responses in 1-2 seconds. Once the user filters appropriately, s/he can execute actions such as export, and full-content indexing on the results.

 

Real time Indexing

While the initial index scan may take some time to complete, the solution will update the index in real time by plugging into Isilon’s audit notifications and the CEE framework.  The solution will index meta data such as filename, file type, path, size, dates (last modified, create, last accessed), owner/uid, group/gid and access mode. Optionally we can index full content and application specific meta data.
 
Overview_Components
Fig 1: Components of the solution

It uses the OneFS API to perform certain actions like deletes and other stuff. The protocol auditing (create, delete, modify,...) forwards notifications to a CEE server (typically running on a VM) so that index updates can be made in real time (watch the video so see it). A current limitation is that only file changes carried out via SMB and NFS are monitored and updated. Changes via FTP, the OneFS API (HTTP), HDFS, or on the local file system will not be reflected in the index without a re-scan at this point in time. User actions such as downloading files that show up in a search result is performed via an SMB share.

 

Searches

It is important to mention that searches are done against the index and regardless of the complexity of the query; the OneFS cluster will not be affected by the search.  The UI is very simple to use and allows filtering, it shows detailed metadata of search matches, visualizations and allows user actions such as preview, download, export etc.

Search_UI
Fig 2: The search UI

 

Installation

The install is self-contained.  The user does not need to ‘leave’ the  UI at all during the whole process.


Interested in Beta testing?

If your customer is interested in participating in the Beta test, please register here. Be aware that we are interested in serious feedback and discussion with the user. The program is not required to have a nice test and play experience.


Requirements for the Beta Test

The customer needs to provide the following to be able to run the Alpha code:
  • VMWare ESX v 5.x or 6.x
  • Resources for the VM
    • 32GB RAM
    • 8 vCPUs
    • Can be reduced for smaller Isilon clusters
    • 556GB disk space
    • Can be increased up to 2TB disk space
    • Can be thin provisioned
    • 2TB is enough for 6+ billion files and folders
  • Isilon Cluster with OneFS 7.2 or higher
  • Chrome or Firefox web browser  (IE will be supported for GA)
  • External Active Directory or LDAP server(s) (optional)
    • The Isilon Search virtual appliance has a built-in OpenLDAP server
  • Add additional external AD or LDAP servers to support specific users/groups for search or administration
  • OneFS must expose an SMB share on /ifs. The user specified when the Isilon Search is configured must have full access to this share. The share is used to download files and access them for full content indexing
Isilon Search will automatically:
  • Enable protocol auditing for all Access Zones (Indexing per Access Zone is planned for a future release.
  • Point “Event Forwarding” to the CEE server on the Isilon Search virtual appliance
  • For the Beta, no existing CEE audit servers may be configured.  This will not be a restriction for GA
  • Only one Isilon Search system can point to a single Isilon cluster
  • Event forwarding can only be set for one destination


How many objects on your cluster ?

To determine the total objects on the Isilon Cluster, SSH into one of the nodes and run isi job start lincount.  This will return a job number.  Use isi job reports view <job number>  to see the results once it completes. It may take a while to complete – typically about 30m for 1 billion object (like always depending on utilization, node types etc.).

 

More to come

Join us for this journey of creating business-value from user-generated data.  The next stations are support for additional Dell EMC platforms, and for more high-value use-cases.

 

References

[1] Defined by McKinsey as “high-skill knowledge workers, including managers and professionals”
[2] McKinsey Global Institute (MGI) report, July 2012: “The social economy: Unlocking value and productivity through social technologies”.
[3] Source: https://aci.info/2013/11/06/the-cost-of-content-clutter-infographic/



















2 comments: