PDSI Research News

 

August 4, 2009 - CView software released by PNL

CView is a 3D graphics engine designed for displaying graphically represented cluster performance data. It also includes a data management library for representing groups of related data.

The Current Version is available on the ReleasePage

June 16, 2009 - PLFS Source Code Released on Sourceforge.net

On May 4, 2009, an initial release of the PLFS code was made through sourceforge.net by the Los Alamos National Lab (LANL). The release was made in order to attract collaborators in co-development and testing. Instructions for access to the code through CVS or Subversion are linked from the project's main sourceforge.net page at http://sourceforge.net/projects/plfs/. There are also open discussion and help forums, public mailing lists for announcements, users and developers, as well as a wiki page.

The first paper generated by this work, "PLFS: A Checkpoint Filesystem for Parallel Applications" has been accepted at Supercomputing '09 and is a finalist for the conference's best paper award. A preprint is available at http://www.pdsi-scidac.org/research/plfs.html and the final paper will be published in the SC'09 proceedings.

Background

Parallel applications running across thousands of processors must protect themselves from inevitable component failures. Many applications insulate themselves from failures by checkpointing, a process in which they save their state to persistent storage. Following a failure, they can resume computation using this state. For many applications, saving this state into a shared single file is most convenient. With such an approach, the size of writes are often small and not aligned with file system boundaries. Unfortunately for these applications, this preferred data layout results in pathologically poor performance from the underlying file system which is optimized for large, aligned writes to non-shared files.

To address this fundamental mismatch, we have developed a parallel log-structured file system, PLFS, which is positioned between the applications and the underlying parallel file system. PLFS remaps an application’s write access pattern to be optimized for the underlying file system. Through testing on Panasas ActiveScale Storage System and IBM’s General Parallel File System at Los Alamos National Lab (LANL) and on Lustre at the Pittsburgh Supercomputer Center, we have seen that this layer of indirection and reorganization can reduce checkpoint time by up to several orders of magnitude for several important benchmarks and real applications.

We expect that PLFS can improve the checkpoint bandwidth for any large parallel application that writes to a single file. The expected improvement is especially large for those applications doing unaligned or random IO, patterns which have become increasingly prevalent recently due to the wide-spread adoption of complex formatting libraries such as NetCDF and HDF5.

 

April 18, 2009 - LANL releases techreport "PLFS: A Checkpoint Filesystem for Parallel Applications" and traces

Parallel applications running across thousands of processors must protect themselves from inevitable component failures. Many applications insulate themselves from failures by checkpointing, a process in which they save their state to persistent storage. Following a failure, they can resume computation using this state. For many applications, saving this state into a shared single file is most convenient. With such an approach, the size of writes are often small and not aligned with file system boundaries. Unfortunately for these applications, this preferred data layout results in pathologically poor performance from the underlying file system which is optimized for large, aligned writes to non-shared files.

To address this fundamental mismatch, we have developed a parallel log-structured file system, PLFS, which is positioned between the applications and the underlying parallel file system. PLFS remaps an application’s write access pattern to be optimized for the underlying file system. Through testing on Panasas ActiveScale Storage System and IBM’s General Parallel File System at Los Alamos National Lab and on Lustre at Pittsburgh Supercomputer Center, we have seen that this layer of indirection and reorganization can reduce checkpoint time by up to several orders of magnitude for several important benchmarks and real applications.

We expect that PLFS can improve the checkpoint bandwidth for any large parallel application that writes to a single file. The expected improvement is especially large for those applications doing unaligned or random IO, patterns which have become increasingly prevalent recently due to the wide-spread adoption of complex formatting libraries such as NetCDF and HDF5.
[more...]

August 6, 2008 - Sandia Releases Application Traces

The research community has long desired traces at large-scale for real applications as synthetic benchmarks lack the fidelity and credibility of actual traces. As part of the Petascale Data Storage Institute (PDSI), Sandia researcher, Lee Ward, has released input-output (I/O) system-call trace data from two representative runs of Sandia’s ALEGRA simulation shock and multiphysics code suite.

The two runs, performed on the Sandia/NNSA RedStorm supercomputer, captured information about four checkpoint dumps, run logs, and terminal I/O. The two runs used 2,744 nodes and 5,832 virtual nodes, respectively. Links to race data and a short paper describing format and the environment from which the data was obtained follow.

For more information, contact Lee Ward.

Trace and Documentation Download

 

 

Code and Data Releases

 

 

 


 

Last updated 2009-08-04 | ©2010Carnegie Mellon University