Extended Overview | People | Publications & Traces


Parallel Log-Structured File System (PLFS)

 

Overview

Parallel applications running across thousands of processors must protect themselves from inevitable component failures. Many applications insulate themselves from failures by checkpointing, a process in which they save their state to persistent storage. Following a failure, they can resume computation using this state. For many applications, saving this state into a shared single file is most convenient. With such an approach, the size of writes are often small and not aligned with file system boundaries. Unfortunately for these applications, this preferred data layout results in pathologically poor performance from the underlying file system which is optimized for large, aligned writes to non-shared files.

To address this fundamental mismatch, we have developed a parallel log-structured file system, PLFS, which is positioned between the applications and the underlying parallel file system. PLFS remaps an application’s write access pattern to be optimized for the underlying file system. Through testing on Panasas ActiveScale Storage System and IBM’s General Parallel File System at Los Alamos National Lab and on Lustre at Pittsburgh Supercomputer Center, we have seen that this layer of indirection and reorganization can reduce checkpoint time by up to several orders of magnitude for several important benchmarks and real applications (Figure 1).

Figure 1 - Summary of results. This graph summarizes our results which are be explained in detail in our research report. The key ob servation here is that our technique has im proved checkpoint bandwidths for all seven studied benchmarks and applications by up to several orders of magnitude.

We expect that PLFS can improve the checkpoint bandwidth for any large parallel application that writes to a single file. The expected improvement is especially large for those applications doing unaligned or random IO, patterns which have become increasingly prevalent recently due to the wide-spread adoption of complex formatting libraries such as NetCDF and HDF5.

For detailed information on the design and implementation and an indepth evaluation of results of PLFS, please see LANL's PLFS web page and the LANL PLFS Research Report. An extensive collection of PLFS trace results is also available.

 

People

Los Alamos National Laboratory

Carnegie Mellon University

Pittsburgh Supercomputing Center

 

Publications

PLFS: A Checkpoint Filesystem for Parallel Applications. John Bent, Garth Gibson, Gary Grider, Ben McClelland, Paul Nowoczynski, James Nunez, Milo Polte, Meghan Wingate. LANL Technical Release LA-UR 09-02117, April 2009.
Abstract / PDF [415K]

PLFS Traces: PLFS has been used to generate many IO traces of benchmarks and real applications.

 

 

 

Last updated 2009-04-19 | ©2010Carnegie Mellon University