Static Survey of File System Statistics (fsstats)

Introduction

Our goal is to make available tools and services that facilitate worldwide data collection on static file tree attributes and aggregate this data into a large database that can be queried and viewed by anyone.
In the past, people have collected data on how files within file systems change in terms of file size, access time, modification time, filename length and various other attributes. Our goal is for users to be able to gather this data for themselves and to facilitate sharing of this data.

We are attempting to aggregate file information from various sources and build a repository that reflects what people have on disk, across industries and users. We plan to make this data available to anyone who is interested. Hopefully the storage community will be able to gather useful information from this repository and use it to build better systems to store data.

More Information on the Tool, and Its Output

We are looking for voluntary contributions from organizations and users to this database. To enable users to gather data, we have written a small tool that parses file trees and gathers file system wide and file wide attributes.

We have been careful to gather only file attributes and information that is anonymous. For this reason we have been careful to separate what the user sees (such as file directory names in error reports) from what they upload (data on file attributes, with no file content or user specific information), so that anonymity is maintained and no private information is divulged. Here is a sample result that details the kind of information the tool will upload.

The tool itself is written in Perl (about 1200 lines), so that users concerned about privacy or security can inspect the script before running it.

Besides running through a file system, gathering attribute information, the tool has some other nice features. For instance: you can give it an option to create intermediate check-pointing files. Since gathering statistics on all files of a large file tree can take a long time, especially if ‘fsstats’ is run at a low priority, it is useful when you want to stop the indexing for some reason. You can simply kill the running script and later restart the tool from where it left by just pointing it to the right checkpoint file.

Related Publications

Characterizing HEC Storage Systems at Rest. Shobhit Dayal. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-109, July 2008.
Abstract / PDF [603K]

A Large-Scale Study of File-System Contents. John R. Douceur and William J. Bolosky. ACM SIGMETRICS'99, Atlanta, GA, May 1-4, 1999.
PDF

A Study of File Sizes and Functional Lifetimes. M. Satyanarayanan. Proceedings of the 8th ACM Symposium on Operating Systems Principles, Asilomar, CA, December 1981.
PDF

Contact Information

Garth Gibson, CMU

Acknowledgements

This material is based upon research sponsored supported by the DOE Office of Advanced Scientific Computing Research (ASCR) program for Scientific Discovery through Advanced Computing (SciDAC) under Award Number DE-FC02-06ER25767, in the Petascale Data Storage Institute (PDSI ). It is also supported by the Los Alamos National Lab under Award Number 54515-001-07, the CMU/LANL IRHPIT initiative.

We thank the members and companies of the PDL Consortium: American Power Conversion, Cisco Systems, EMC, Google, Hewlett-Packard Labs, Hitachi, IBM, Intel, LSI, Network Appliance, Oracle, Panasas, Seagate Technology, and Symantec for their interest, insights, feedback, and support.


Last updated 2011-02-23 | ©2011Carnegie Mellon University