Search Wiki:
Resource Page Description
This resource page describes a sample near real time classification application that builds on the File Classification Infrastructure (FCI) in Windows Server 2008 R2. This sample tool uses public FCI api to demonstrate one possible way for implementing near real time classification solution. The code sample is developed using C# and Visual Studio 2008. The sample includes a Visual Studio project file that can be used to compile and run the code.
This sample shows 3rd party developers how to interact with FCI in Windows Server 2008 R2 for the purposes of extending the inbox functionality. This sample functionality is targeted at enabling customers and partners to implement a complete end-to-end scenario with a little investment in code that builds on in-box FCI capabilities.


Introduction


The ever expanding storage requirements and capacity has resulted in increase in data management cost. Effective data management has become more important than ever. File Classification Infrastructure (FCI), shipping with Windows Server 2008 R2, is our attempt to help administrators gain insight into their data. FCI is also an important step towards fulfilling our commitment to provide best tools to administrators for managing data more effectively, reduce cost and mitigate risk.

The following blog entries dive dipper into the FCI.

FCI allows administrators to schedule classification and policy execution. However, in some scenarios, administrators may want to classify and optionally apply policies on a file in real time - right when a file is created or modified. For e.g. whenever an employee from a finance department creates a spreadsheet, administrator may want that spreadsheet to be classified as High Business Impact file and optionally apply leakage prevention policy on it.

Generally, a user creates a file and edits its content in long editing session. During editing session, the user may save the file multiple times. A real-time file classification solution classifies the file right after each modification. Classifying a file utilizes system resources. In some situations, it is useful to classify the file only after it is stable, for e.g. at the end of editing session, to conserve system resources. This is referred as near real time file classification.

This resorce page presents a solution that builds on in-box FCI capabilities to achieve near real-time file classification. The solution presented here allows administrators to specify UpdateWindow - time period to wait for file to become stable before
classifying it. The UpdateWindow determines how close the solution is to a real-time solution.

What this sample does


  • This sample near real-time classification console application classifies specific file(s) after UpdateWindow time period has elapsed since they are created and/or modified. Optionally, the sample application also applies policy to classified file(s), if it meets policy application criteria specified using a classification property.
  • This sample monitors user actions (file create, rename, modify, change in file size, system attributes, last write time, or security permissions of a file) on the target scope.
  • Target scope can only be a path on local computer. If a target scope contains a mount point, mount point and all its subdirectories are excluded from the target scope. However, a mount point or any directory underneath it, is a valid target scope. For example, if the path C:\foo\bar is a valid mount point, then
    • Target scope C:\foo excludes c:\foo\bar and all its subdirectories.
    • Target scope C:\foo\bar is a valid target scope.
    • Target scope C:\foo\bar\subDir is also a valid target scope.
  • This sample application allows users to specify wildcard filters to describe target scope to monitor such as "D:\Dir1\*.docx".
  • It allows users to specify action triggering policy condition using classification property. Only one property condition is supported in the form of "PropertyName=PropertyValue".
  • It also supports user specified policy action in terms of command to be executed. While executing this command, it expands the path of the file being processed. This command execution is triggered by above specified action triggering policy condition.
  • This sample application runs as administrator and executes the specified action command using administrator credentials.
  • It has embedded usage help.

What this sample does not do


  • This sample near real-time classification console application does not gaurantee 0% file miss rate.
  • This sample does not support rich configuration such as complex condition involving multiple file properties.
  • It does not present a customized UI for configuration.

Commandline Parameters


  • Path: Takes a <string> specifying Path of the directory to be monitored for the changes. This is a required arguement.
  • IncludeSubdirectories: Takes a <boolean> specifying True, if you want to monitor subdirectories; false, otherwise. Default is false.
  • Filter: Takes a <string> specifying Filter used to determine what files are monitored. Filter string examples:
Filter String Monitors the following files
. All files. Default.
*.docx All files with a "docx" extension.
HR200?.ppt HR review 2009.ppt
SalseForcast.xls Only SalseForcast.xls
  • Policy: Takes a <string> specifying Policy that governs execution of the command specified in the form of "PropertyName=PropertyValue". This parameter is optional.
  • Command: Takes a <string> specifying Command to be executed when the policy evaluates to true for a file. This parameter is optional.
  • CommandArguments: Takes a <string> specifying Arguments to be passed to the command. Use [FILEPATH] as a substitute for full path of the file being processed.
  • UpdateWindow: Takes a <number> specifying Time duration (in seconds) to wait after last change before classifying file. It defaults to 60.
  • CacheWindow: Takes a <number> specifying Time duration (in seconds) for caching information about recently classified files. It defaults to 300.
  • MaxAttempts: Takes a <number> specifying Maximum number of attempts to be made to classify an erroneous file. It defaults to 10.
  • ?: Display help.

Usage Samples:

  • Get embedded help:
NearRealTimeClassification.exe /?
  • Monitor and classify all files under c:\foo.
NearRealTimeClassification.exe /Path:c:\foo
  • Monitor and classify *.docx files under c:\foo including subdirectories under it. Invoke sample.exe with filename as parameter for files that have BusinessImpact=HBI classification property.
NearRealTimeClassification.exe /Path:c:\foo /IncludeSubdirectories:true /Filter:*.docx /Policy:BusinessImpact=HBI /Command:sample.exe /CommandArguments:[FILENAME]

Notes:

  • All parameters are optional, unless specified as required.
  • If a policy is specified, then Command argument is required.

Design


Data Flow Diagram:

NRTC_DataFlowDiagram.jpg

File Discovery:

This sample tool uses .Net System.IO.FileSystemWatcher class to monitor file changes.

Overview of FileSystemWatcher:

FileSystemWatcher listens to the file system change notifications and raises events when a directory, or a file in a directory, changes. FileSystemWatcher uses a buffer allocated from non-paged memory to receive file change notifications from the Windows operating system. If there are many changes in a short time, the buffer can overflow. This causes FileSystemWatcher to lose track of changes in the directory. In such cases, it only provides blanket notification and is silent about lost events. Note that it is possible to increase the size of the buffer. However, increasing the size of the buffer is expensive, as it comes from non-paged memory that cannot be swapped out to disk. Currently, this sample does not allow to change size of this buffer. Allowing administrators to increase the size of this buffer is a possible enhancement to this sample. See Enhancements section for full list of enhancements.

File Discovery using FileSystemWatcher:

This sample tool uses .Net System.IO.FileSystemWatcher class to receive file create, change and rename notifications. It registers event handlers for the following events raised by FileSystemWatcher class.
  • Changed: when changes are made to the size, system attributes, last write time, creation time, or security permission of a file or directory in the path being monitored.
  • Created: when a file or directory is created in the path being monitored.
  • Renamed: when a file or directory in the path being monitored is renamed.

This sample maintains a queue of files that need to be classified. Create, change, and rename event handlers append a new file entry (consisting of file name and current timestamp) to this queue. If a file entry already exists in the queue, then it is removed and appended at the end of queue with new timestamp. Hence, all the file entries in the queue are always sorted based on the time of last change notification received for them.

FileSystemWatcher may raise multiple events for one user file operation, for e.g. moving a file to a directory being monitored raises a file create and file changed events. Also, during an editing session, user may save document multiple times. It would be useful to avoid classifying a file multiple times in such scenarios to conserve system resources. One way to do this is to wait for file to become stable before classifying it. This sample tool uses UpdateWindow parameter to indicate time period to wait.
A file entry sits in the above queue at least for UpdateWindow time period. Once it is older than that, the corresponding file is ready for processing. Note that because all file entries are sorted using the timestamp, they are dequeued from front end of the queue and processed one at a time. A possible enhancement to the sample tool is to process multiple eligible file entries concurrently using multiple threads. See enhancements section for full list of possible enhancements.

File Processing:

A file entry is processed by classifying the corresponding file. To do so, this sample tool uses File Classification Infrastructure APIs. An instance of FsrmClassificationManager is created to classify file. The file is classified by calling IFsrmClassificationManager::EnumFileProperties api with FsrmGetFilePropertyOptions.FsrmGetFilePropertyOptionsNone option parameter. This api classifies file on the fly. This api also retrieves all properties from all enabled storage modules. However, currently this api does not set returned classification properties on the file. Hence, this sample tool next calls IFsrmClassificationManager::SetFileProperty api to save all file properties on the file. Note that properties are stored in all registered storage modules including in-file storage module, for e.g. docx files. Setting properties on file is an optional step and depends on intended use case scenario.

When a property is saved using in-file storage module, it results in change in file. FileSystemWatcher notices this change and fires another change event. Processing that event results in re-classification of file and these activities continue in a loop.

Also, consider a scenario, in which a file is updated without affecting its classification properties. For e.g. changing attributes of a file (hiding or un-hiding file) right after it is created, in the absence of a classification rule that depends on this file attribute. In such scenarios, FileSystemWatcher raises create and change events. If these two events occur during UpdateWindow time period, this sample tool classifies file only once. However, if the second change event occurs only after UpdateWindow time period has elapsed since the first create event, this tool classifies the file twice. One can argue that this can be avoided by increasing the UpdateWindow, after all its a parameter to tool. However, the longer the UpdateWindow time period the further away is this tool from being real time. Nonetheless, it would be useful not to set file properties again on file in such scenarios.

In order to avoid running into above mentioned loop and avoid setting file properties again if there is no change in properties, this sample tool maintains in-memory cache of recently classified files. After classifying a file, the tool first checks if the file exists in the cache of recently classified files. If so, it checks if the file classification properties are changed. If file classification properties are not changed, this tool skips setting file properties and policy evaluation on the file. Only if classification properties have changed, or the file is not found in the cache, this tool sets all FSRM classification file properties on the file by calling IFsrmClassificationManager::SetFileProperty api and then moves on to evaluating policy condition. Note that this cache can grow infinitely large if not pruned periodically. This sample tool uses CacheWindow parameter to indicate time period after which a file in this cache is eligible to be discarded from cache. This tools periodically discards all files older than CacheWindow time period from the cache.

Policy Action Execution:

Once all the classification properties are set on the file, this sample tool evaluates the policy condition, only if user has specified one. If the condition evaluates to true, this sample tool invokes the user specified command in a separate process and waits for it to finish. This sample tool also passes the user specified command arguments to the command, after expanding [FILEPATH] macro (if it exists in the user specified command arguments).

Error Conditions:

Only after successful policy action execution (if requested by user) or successfully setting properties (if policy action was not requested by user), recently classified files cache is updated to include the current file and its properties. The file entry from the files to be classified queue is removed.

In case of any error while classifying a file, the cache is not updated. In case of File does not exist error, the file entry is deleted from the files to be classified list. In case of any other error (for e.g. sharing violation), the file entry in the files to be classified queue is updated with new time stamp and it is moved to end of queue to reflect the change in time stamp. This results in periodic retry being made on the same file. Administrators can control maximum number of classification attempts using MaxAttempts parameter to the tool. This step also ensures that error in one file does not end up blocking other file entries from being processed from the files to be classified queue.

Limitations


  • Hidden files are not ignored.
  • Monitored files may be reported using short 8.3 file name format.
  • Monitoring mechanism may miss few triggers during heavy activity.

Enhancements


  • Support rich set of operators for specifying policy conditions. Please refer to FCI - File Management Task UI for examples.
  • Allow administrator to set BufferSize for the FileSystemWatcher.
  • Use USN Journal based file system watcher implementation in order to guarantee 0% file miss rate.
  • Support timeout for policy action execution to avoid long running policy action execution blocking classification of files.
  • Use thread pool for policy execution with N number of threads executing policy actions on multiple files in parallel.
Last edited Sep 4 2009 at 10:23 PM  by RonakDesai, version 8
Updating...
Page view tracker