[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Minutes from the GGF7 GridCPR meeting



Meeting of the Grid Checkpoint and Recovery Working Group (GridCPR-WG)
at GGF7, Tokyo, Japan, March 6, 2003 

Chaired by: Derek Simmel, Pittsburg Supercomputing Center, <dsimmel@psc.edu>
Minutes:    Thilo Kielmann, Vrije Universiteit, <kielmann@cs.vu.nl>
            CW Hobbs, VERITAS Software Corporation, <cw.hobbs@veritas.com>



Meeting agenda:

1. Opening, administrative updates
2. Discussion of first draft of GWD-I document
   "An Architecture for Grid Checkpoint Recovery Service and a GridCPR API"
3. Review/Discussion of milestones and deliverables
4. Making better progress between meetings
5. Planning of next steps (e.g., GGF8)
6. Closing


1. Opening, administrative updates
----------------------------------

The group is now officially approved by GFSG
mailing list: gridcpr-wg@gridforum.org
web site:     http://gridcpr.psc.edu/GGF     (new contents in the making)


2. Discussion of first draft of GWD-I document
   "An Architecture for Grid Checkpoint Recovery Service and a GridCPR API"
---------------------------------------------------------------------------

The document draft had been posted before GGF7:

http://gridcpr.psc.edu/GGF/docs/GridCPR001.doc
http://gridcpr.psc.edu/GGF/docs/GridCPR001.pdf

Derek briefly reviewed the document.
This spawned a lively discussion. The following statements are intended for
inclusion in further releases in the document:

- purpose of checkpoints: for fault-tolerance, and portability
- API: write/read from stable storage
  and parameterize this, for efficiency
- incremental checkpoints
- what kind of data to checkpoint? system-dependent??? (better not)
- what if time differences make a difference after restart?
- again review papers/presentations from our GGF6 discussion to get
  jump started
- communication channels may be critical, likely factor them out of the design.
- the GridCPR API shall talk to 1 or more (OGSA) services
- look at/combine APIs from EDG and PSC approaches
- data format might be HDF5
  actual data format is application specific,
  should be parameter to the checkpointing
- we need both an API and an infrastructure
  for storing checkpoint data
- management of history of checkpoints of a job, naming checkpoints
  goes to a tree of checkpoints
  - using checkpoints for application steering
- interaction with AAA
  - charging up to checkpoint only(?) if you do checkpoint
- we need to define which parts should be in the architecture
  - e.g., data storage, meta data
  - getting the a handle for the checkpoint data, goes to where??
  - a scheduler? a job manager?
- if OGSA, then service extensibility can help building more or less
  specific service interfaces
- look at Avaki: secure, global naming scheme (in a WG)
- SRB: naming, might be a starting point
- we need to check what other GGF WGs have produced for that already,
  candidates: DATA, GRAAP
- collect users' requirements


3. Review/Discussion of milestones and deliverables
---------------------------------------------------

The discussion lead to the following agreed-upon list of milestones:

March 2003 - GGF7 Tokyo 
- Discuss and ammend draft GFD-I document detailing scope of GridCPR API and
  services; Discuss and establish GridCPR Working Group development (virtual)
  meetings to be held on a regular schedule between GGF meetings.
  (before next GGF) #1 Ratify Architecture for GridCPR Services & API

June 2003 - GGF8 Seattle
- Discuss initial draft of GridCPR API Specification; Discuss corresponding 
  Grid resource GridCPR service requirements, including interfaces to 
  underlying scheduling and accounting systems; Delegate specification writing
  duties to selected authors.

Autumn 2003 - GGF9 Chicago 
- Discuss proof of concept implementation of draft Grid Checkpoint Recovery
  API/Service Specification.

Spring 2004 - GGF10 Frankfurt 
- Publish Grid Checkpoint Recovery API/Service Specification 1.0


4. Making better progress between meetings
------------------------------------------

We need to make (much) more progress between the GGF meeting.
While discussing it, we have agreed on first pursuing discussion on the
mailing list, and spawning off phone conferences on a "by need" basis.
We also investigated AG meetings (instead of phone conferences).
Tom Goodale proposed the use of a tool for using AG from a single laptop
(e.g., without expensive installations): http://www.vrvs.org


5. Planning of next steps (e.g., GGF8)
--------------------------------------

We identified the following immediate steps:

- Complete consensus diagram of GridCPR Services & API Architecture
- Identify and describe core components
- Interfaces and messaging within and to/from core
- Clearly draw scope boundaries of GridCPR-WG
- Complete GWD-I Architecture document draft & discuss within GridCPR-WG
  (via mailing list)
- Reconcile architecture with other existing GGF research/working groups'
   services & spec's.
- Publish GWD-I GridCPR Architecture Doc.

6. Closing
----------

The meeting ended and everybody went off for lunch ;-)