[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [gridcpr-wg] GridLab: In scope or out
I vote In Scope. Although the extension to jobs that run across
multiple sites is beyond the scope of our problem statement.
N.
Paul Stodghill wrote:
In the GridLab project \cite{gridlab-homepage,gridlab-overview}, a job,
consisting of one or more processes, is running on a grid machine. In
the middle of the run, the job may be forced to migrate to a different
machine, possibly with a different architecture and/or number of CPUs.
The application program may either decide by itself to migrate (e.g.
poor performance on the current machine) or may be forced to do so,
either by the user (via an application manager) or by the local resource
management software that wishes to evict the job. The main purpose of
GridCPR in GridLab thus is the ability to interrupt and migrate a job
until it finally terminates. Fault-tolerance is only a secondary aspect.
An extension of the above use-case is dealing with jobs that run
concurrently at multiple grid sites.
Applications save their state to regular files. Checkpoint meta data can
be stored in GridLab's "advert service", allowing the checkpoint file(s)
to be found and retrieved after restart. File transport is done via
GridLab's data movement service (or via GridFTP) \cite{gridlab-day}.
Key functions:
\begin{itemize}
\item Services for checkpoint data transport, via GridLab's data
movement service or GridGTP.
\item Services for checkpoint data management, via Advert Service.
\end{itemize}
--
+-----------------------------+----------------------------------+
| Nathan T.B. Stone, Ph.D. | Pittsburgh Supercomputing Center |
| Advanced Systems Group | 4400 Fifth Avenue |
+-----------------------------+ Pittsburgh, PA 15213 |
| mailto:stone@psc.edu | phone: 412-268-4367 |
| http://www.psc.edu/~nstone/ | fax: 412-268-5832 |
+-----------------------------+----------------------------------+