[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Minutes from the GGF8 meeting
Sorry for being quite late. (Put 3 standard excuses here...)
The minutes are attached.
Thilo
--
Thilo Kielmann http://www.cs.vu.nl/~kielmann/
GGF Grid Checkpoint Recovery WG (GridCPR-WG)
Notes of the meeting at GGF8, June 26, 2003
Notes by Shantenu Jha <s.jha@ucl.ac.uk> and Thilo Kielmann <kielmann@cs.vu.nl>
Session 1, chaired by D. Simmel
Introductions:
Overview by D. Simmel
GridCPR objective:
- heterogeneity: checkpoint on one system, restart possibly somewhere else
- API does not imply any services, may also minimalistically be implemented
"somehow", like local libraries
- API needs to be separate from the services
- separate API from "best practices" recommendations how to use checkpoints
like coordinated or uncoordinated checkpointing
ACTION ITEM: everybody watch other groups w.r.t. overlap/contradictions
GridCPR Architecture Review
GridCPR Goal (API definition)
will require an application developer to
modify his/her code to use the API
GridCPR API Specification Audience
Grid Application Developers
GridCPR Service Specification Audience
Grid Platform Developers and Vendors
-- is this critically required?
i.e. should CPR be dependent on grid services whose
presence on every platform be guaranteed?
i.e. decouple the API from any Grid Service
Grid Resource Operators
GridCPR Architecture Discussion
Paul Stodgill: coordinate or uncoordinated checkpoint. v1 or v2?
e.g. multiphysics
Chairs notes:
map checkpoint to a certain application run
need to be able to NAME a checkpoint
management of checkpoint data
make assumption that jobs are parallel, sequential being a special case
checkpoint is a set of files that together
represent a recoverable state of an application
application run is not just a binary
Resource broker view
need to restart job
Security for the checkpoint files
State Approach? [Java-ish]
Register a certain set of variables?
when you call for a checkpoint, some built-in
[programmatically] thing to write those
variables out to the checkpoint
What do we need to restarat your application mid-stream?
HDF5 should be a plug-able component rather han a required data format
Application manager: keeps track of the progress of the job
from submission to completion
GridCPR service makes use of other [Grid] data movement and
job management services
Scope to be clarified; wording to be determined later.
SP: need for use cases
TK: need for .... analogous to Network weather Services
GridLab, RealityGrid, Paul Stodgill offer to write use cases
===============================================================================
Session II, chaired by D. Simmel
Derek Simmel: recap and agenda...
Agenda:
CPR API structured Brainstorming, Discussion
Action Items and Writing Assignments
Planning for GGF9,10 & onwards
Session objectives:
Gather requirements for a GridCPR API, Scope Issues,
Prioritize gridCPR API Requirements,
Draw line b/w v1.0 and v2.0...
Goal
---------------------
Objective
---------------------
Requirement
Objective:
Platform and OS independent
Development environment/tools independence
Predictable behaviour
Although gridCPR architecture may include separate, locally deployed
gridCPR services, the gridCPR API/library should be separable from them
use case presentations & discussions.
------------------------------------------------------------------------------
1. Mikel and Shantenu from RealityGrid
component (seq/par app.) writes set of files
(might also be read by other component, e.g. viz for monitoring progress)
what is going into the checkpoint file?
data of a program -- not system-level
not all user-defined?
middle: using API to do portable checkpoint
------------------------------------------------------------------------------
2. Paul Stodghill from Cornell
precompiler, programmer only needs to say where in the code, ckpt to be called
N->N ckpt / recovery
with virtualization, N->M possible, e.g. for load redistribution
API needed for compiler-generated code (underneath the compiler magic)
------------------------------------------------------------------------------
3. Tom Goodale and Thilo Kielmann from GridLab
application writes set of files
which can be tracked by the application manager (who is in charge of
monitoring an application from submission to completion)
checkpointing happens at application level
(part of API of GAT -- Grid Application Toolkit)
------------------------------------------------------------------------------
4. Karpjoo Jeony, Konkuk University
sequiential program, defines critical state + location info,
portable data format
always recover from the newest checkpoint?
no big deal giving a parameter "which one"
for some, it is a requirement
user-visible checkpoint labels
incremental checkpoints, relative to previous ones (like backups)
checkpoint performance prediction?? based on last checkpoint??
-> checkpointt rate
disk size of checkkpoint considered important
------------------------------------------------------------------------------
5. Heon Yeom, Seoul National Univ
recompile MPI-programs
checkpoint N->N processors (restart with same size)
coordinated and uncoordinated checkpoint, also message logging
job manager
------------------------------------------------------------------------------
case 1 and 3 are similar
case 4 also highlighted the need to do incremental checkpoint
[checkpoint difference]
case 5 does uncoordinated checkpoint with message logging
problems left open:
deal with open files?
save offset within checkpoint
-> journaling file system ??
transactions
standardize i/o interface??
goal:
Compile proceedings of the discussion and present response into an
informational document.
ACTION ITEM: all: send URLs to API info to the list