[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Minutes from the GGF7 GridCPR meeting
Meeting of the Grid Checkpoint and Recovery Working Group (GridCPR-WG)
at GGF7, Tokyo, Japan, March 6, 2003
Chaired by: Derek Simmel, Pittsburg Supercomputing Center, <dsimmel@psc.edu>
Minutes: Thilo Kielmann, Vrije Universiteit, <kielmann@cs.vu.nl>
CW Hobbs, VERITAS Software Corporation, <cw.hobbs@veritas.com>
Meeting agenda:
1. Opening, administrative updates
2. Discussion of first draft of GWD-I document
"An Architecture for Grid Checkpoint Recovery Service and a GridCPR API"
3. Review/Discussion of milestones and deliverables
4. Making better progress between meetings
5. Planning of next steps (e.g., GGF8)
6. Closing
1. Opening, administrative updates
----------------------------------
The group is now officially approved by GFSG
mailing list: gridcpr-wg@gridforum.org
web site: http://gridcpr.psc.edu/GGF (new contents in the making)
2. Discussion of first draft of GWD-I document
"An Architecture for Grid Checkpoint Recovery Service and a GridCPR API"
---------------------------------------------------------------------------
The document draft had been posted before GGF7:
http://gridcpr.psc.edu/GGF/docs/GridCPR001.doc
http://gridcpr.psc.edu/GGF/docs/GridCPR001.pdf
Derek briefly reviewed the document.
This spawned a lively discussion. The following statements are intended for
inclusion in further releases in the document:
- purpose of checkpoints: for fault-tolerance, and portability
- API: write/read from stable storage
and parameterize this, for efficiency
- incremental checkpoints
- what kind of data to checkpoint? system-dependent??? (better not)
- what if time differences make a difference after restart?
- again review papers/presentations from our GGF6 discussion to get
jump started
- communication channels may be critical, likely factor them out of the design.
- the GridCPR API shall talk to 1 or more (OGSA) services
- look at/combine APIs from EDG and PSC approaches
- data format might be HDF5
actual data format is application specific,
should be parameter to the checkpointing
- we need both an API and an infrastructure
for storing checkpoint data
- management of history of checkpoints of a job, naming checkpoints
goes to a tree of checkpoints
- using checkpoints for application steering
- interaction with AAA
- charging up to checkpoint only(?) if you do checkpoint
- we need to define which parts should be in the architecture
- e.g., data storage, meta data
- getting the a handle for the checkpoint data, goes to where??
- a scheduler? a job manager?
- if OGSA, then service extensibility can help building more or less
specific service interfaces
- look at Avaki: secure, global naming scheme (in a WG)
- SRB: naming, might be a starting point
- we need to check what other GGF WGs have produced for that already,
candidates: DATA, GRAAP
- collect users' requirements
3. Review/Discussion of milestones and deliverables
---------------------------------------------------
The discussion lead to the following agreed-upon list of milestones:
March 2003 - GGF7 Tokyo
- Discuss and ammend draft GFD-I document detailing scope of GridCPR API and
services; Discuss and establish GridCPR Working Group development (virtual)
meetings to be held on a regular schedule between GGF meetings.
(before next GGF) #1 Ratify Architecture for GridCPR Services & API
June 2003 - GGF8 Seattle
- Discuss initial draft of GridCPR API Specification; Discuss corresponding
Grid resource GridCPR service requirements, including interfaces to
underlying scheduling and accounting systems; Delegate specification writing
duties to selected authors.
Autumn 2003 - GGF9 Chicago
- Discuss proof of concept implementation of draft Grid Checkpoint Recovery
API/Service Specification.
Spring 2004 - GGF10 Frankfurt
- Publish Grid Checkpoint Recovery API/Service Specification 1.0
4. Making better progress between meetings
------------------------------------------
We need to make (much) more progress between the GGF meeting.
While discussing it, we have agreed on first pursuing discussion on the
mailing list, and spawning off phone conferences on a "by need" basis.
We also investigated AG meetings (instead of phone conferences).
Tom Goodale proposed the use of a tool for using AG from a single laptop
(e.g., without expensive installations): http://www.vrvs.org
5. Planning of next steps (e.g., GGF8)
--------------------------------------
We identified the following immediate steps:
- Complete consensus diagram of GridCPR Services & API Architecture
- Identify and describe core components
- Interfaces and messaging within and to/from core
- Clearly draw scope boundaries of GridCPR-WG
- Complete GWD-I Architecture document draft & discuss within GridCPR-WG
(via mailing list)
- Reconcile architecture with other existing GGF research/working groups'
services & spec's.
- Publish GWD-I GridCPR Architecture Doc.
6. Closing
----------
The meeting ended and everybody went off for lunch ;-)