Non-Blocking Collective I/O Routines

Non-Blocking Collective MPI I/O
Ticket #273
• I/O is one of the main bottlenecks in HPC
• Many applications or higher level libraries rely
on MPI-I/O for doing parallel I/O.
• Several optimizations have been introduced in
MPI-I/O to meet the needs of application
– Non-blocking individual I/O
– Different collective I/O algorithms
• Routines for non-blocking individual I/O
operations exist (MPI_File_i(read/write)(_at)
• Non-blocking point-to-point (existing) and
collective (to be added) communication
operations have demonstrated benefits.
• Split collective I/O operations have their
restrictions and limitations.
• What’s keeping us from adding non-blocking
collective I/O operations?
– Implementation
Usecase (I)
• HDF5 operations that modify metadata:
– Collective to keep the cache among all processes
– The Metadata cache uses an LRU eviction scheme.
– Items at the bottom of the list are evicted in a collective
write call to disk. The amount of data written is usually
small (< 1KB).
– Non blocking collective I/O would allow us to fire off those
writes and go do other stuff avoiding the I/O overhead.
Usecase (II)
• HDF5 Raw data operations:
– Chunking data in file is a key optimization HDF5 uses for
parallel I/O.
– If HDF5 can detect a pattern in the way chunks are
accessed, we can pre-fetch those chunks from disk.
– Asynchronous I/O operations would hide the cost of I/O
• Chunk cache for writes (currently disabled for
parallel HDF5):
– Similar concept to the metadata cache
New Routines
• MPI_File_iread_all (MPI_File fh, void *buf, int count, MPI_Datatype
type, MPI_Request *req);
• MPI_File_iwrite_all (MPI_File fh, void *buf, int count, MPI_Datatype
type, MPI_Request *req);
• MPI_File_iread_at_all (MPI_File fh, MPI_Offset offset, void *buf, int
count, MPI_Datatype type, MPI_Request *req);
• MPI_File_iwrite_at_all (MPI_File fh, MPI_Offset offset, void *buf, int
count, MPI_Datatype type, MPI_Request *req);
• Ordered read/write (add non blocking or deprecate ordered)
• Deprecate split collectives
• Straw Vote: 22 - 0 - 0
• Major difference between collective communication
and collective I/O operations:
– Each process is allowed to provide different volumes of
data to a collective I/O operation, without having
knowledge on the data volumes provided by other
• Collective I/O algorithms do aggregation.
• Need non-blocking collective communication
• Integrate with the progress engine
– Test/Wait on the request like other non-blocking
– Explicit or Implicit progress?
• Different collective I/O algorithms
• A recent implementation was done within an Open
MPI specific I/O library (OMPIO) and uses LibNBC:
– leverages the same concept of a schedule for non blocking
collective communication operations
– work is still at preliminary stages so a large scale evaluation
is not available
– done at PSTL at the University of Houston (Edgar Gabriel)
in collaboration with Torsten
– paper accepted at EuroMPI 2011:
• Design and Evaluation of Nonblocking Collective I/O Operations
Other MPI I/O Operations
• Several MPI I/O functions are considered
expensive other than read/write functions:
– Open/Close
– Sync
– Set view
– Set/Get size
• It would be valuable to have non-blocking
versions of some of those functions too.
• Applications that open a file but don’t touch the
file until a certain amount of computation has
been done
– Cost of opening a file will be hidden
• Non-blocking sync would also provide great
advantages in case we flush data items to disk
before we go do computation.
• The intention is to hide the cost (whenever
possible) of all the expensive MPI I/O operations.
Proposed Routines
• MPI_File_iopen (MPI_Comm comm, char* filename, int amode,
MPI_Info info, MPI_File *fh, MPI_Request *req);
• MPI_File_iclose (MPI_File fh, MPI_Request *req);
• MPI_File_isync (MPI_File file, MPI_Request *req);
• MPI_File_iset_view (MPI_File fh, MPI_Offset disp, MPI_Datatype
etype, MPI_Datatype filetype, char *datarep, MPI_Info info,
MPI_Request *req);
• MPI_File_iset_size (MPI_File fh, MPI_Offset size, MPI_Request *req);
• MPI_File_ipreallocate (MPI_File fh, MPI_Offset size, MPI_Request
• MPI_File_iset_info ( MPI_File fh, MPI_Info info, MPI_Request *req);
• Straw Vote: 15 – 1 – [5(need to think), 1(doesn’t care)]
• The need for non-blocking collective I/O is
fairly high.
• Implementation is the non-easy part.
• Performance benefits can be substantial.
• Users would also benefit from non-blocking
versions of some MPI I/O operations that are
considered fairly time consuming.

similar documents