[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Minutes of GGF9 GridCPR-WG Meeting
Folks, below are the collected minutes from our meeting in Chicago.
Many thanks to our notetakers and other contributors - Comments,
additions and corrections welcome.
Please do not forget to send your (Grid)CPR Use Cases (short
descriptions are OK) to Paul Stodghill and the list ASAP - Thanks! -
Derek
GGF9 - Chicago, Illinois
Grid Checkpoint Recovery Working Group (GridCPR-WG)
Shoeraton Hotel, Michigan Room B
Tuesday October 7, 2003, 2:00~5:30pm
Session Chaired by Derek Simmel
Session Notetakers: Thilo Keilmann & Shantenu Jha
Session formalities:
Opened the meeting
Displayed and reviewed GGF Intellectual Property Policy with attendees
Circulated the attendee list
Reviewed the proposed Agenda:
2:00 Meeting Bootstrap, Introduction and Agenda Review
2:10 Review of GridCPR-WG after 1 year
3:00 Discussion/Revision of GridCPR Objectives,
Architecture, Diagram, Scope, Charter…
3:30 Break (30 minutes)
4:00 Structured Brainstorm to Generate Detailed Outline
of the GridCPR API & Services Architecture GWD-I document
5:00 Document Action Items and Writing Assignments
5:10 Planning for GGF10 (Frankfurt) and GGF11 (Hawaii),
Interim Objectives, and Charter Milestones Updates
5:30 Adjourn
Reviewed GridCPR-WG Administrative Status:
Website updates at http://gridcpr.psc.edu/GGF/
Mailing and Archive addresses
Reviewed GGF GridForge:
GridCPR Chairs (Derek Simmel and Thilo Kielmann) are administrators for
the GridCPR project.
All attendees are encouraged to create an account via
http://forge.gridforum.org/
Folks should send e-mail to Derek Simmel <dsimmel@psc.edu> to get your
GridForge account added to the GridCPR project.
Reviewed 1st year of GridCPR-WG activities:
GGF5 (Edinburgh)
- Held a BoF session to gauge community interest in GridCPR
GGF6 (Chicago)
- Presentations
"Grid Checkpointing in the European DataGrid project" - Massimo
Sgaravatto
"The PSC CPR System: Scope, Applicability, and Implementation (SAI)"
- Nathan Stone
"Application-level Checkpointing for Parallel Applications" - Paul
Stodghill
"Checkpoint Recovery in Cactus 4.0" - Gabrielle Allen
"Checkpoint Recovery in Condor" - Derek Simmel
- Initial Scoping Discussion
GGF7 (Tokyo)
- Working Group Charter approved by GFSG (November 2002)
- Initial Draft of GridCPR Architecture Document reviewed
- More requirements, interactions and scope discussion
GGF8 (Seattle)
- GridCPR Architecture Discussion
Coordinated vs. Uncoordinated Checkpoints
Named Checkpoints
Checkpoint Data Management
Job Run complexity (not just a single binary, multiple systems,…)
Resource broker perspectives
Security for checkpoint files
- Use Case discussions
Examined various different views of GridCPR as represented among
attendees
Mikel and Shantenu from RealityGrid
Paul Stodghill from Cornell
Tom Goodale and Thilo Kielmann from GridLab
Karpjoo Jeony, Konkuk University
Heon Yeom, Seoul National Univ
"Do we still want to write these (and others we will gather) up in more
detail and publish them as an informational document?" (e.g. Paul
Stodghill's use-case summary sent to the list)
(Rough consensus among attendees is to pursue development of a
use-cases document)
Reviewed GridCPR "Rough Sketch" and Architecture Goals/Audience from
GGF8
GridCPR Goal:
Applications developed correctly using the GridCPR API, which write
periodic checkpoint data sets, and which are interrupted during
execution, will be able to continue operations on a remote system
within a Grid, starting at an interim state represented by a retrieved
checkpoint data set recorded during the original execution.
GridCPR API Specification Audience:
- Grid Application Developers
GridCPR Service Specification Audience:
- Grid Platform Developers and Vendors
- Grid Resource Operators
Other Stakeholders:
- Grid Standards and Specifications Developers
"What is the GRID in GridCPR?"
- User-level checkpointing
- Heterogeneity of source and restart platforms
- Ability to migrate a job from n nodes of a system to m nodes (likely
on a different system)
- Checkpoint API should be available everywhere
- Data/Checkpoint files are reusable anywhere
- Simple recompile for new platforms - no need to recode for a new
platform's checkpoint service
(Additional requirements)
- the grid must not be visible in the API (opaque)
- no dependencies to local names (e.g., volume names)
Action: vikas.deolaliker@sun.com to submit use case
Discussion ensued regarding existing APIs / uses of CPR
Derek Simmel reviewed the European Data Grid CPR API from Massimo
Sgaravatto's presentation.
Nathan Stone reported on portable CPR API early experience at the
Pittsburgh Supercomputing Center
- user-level API implemented originally on Lemieux (Terascale Computing
System)
- file-oriented semantics
- can also accommodate memory semantics
- checkpoint database (metadata) must be secured against tampering
Paul Stodgehill (Cornell)
- making checkpointing transparent with compiler (and middleware)
- file-oriented APIs compatible
- expect GridCPR API and Service to work as a drop-in replacement
(Break 3:20-3:45pm)
(Grid)CPR Use Cases Document (to become GWD-I document)
Action: Paul Stodghill taking lead on pulling this informational
document together
Will generate rough template with points of comparison
Problem Statement
- What’s difficult for this particular scenario w.r.t. GridCPR
“How we would use GridCPR” for {this use case}
Within scope defined by Charter statement
Will send call for scenarios to mailing list
No formatting and other requirements
- Make it easy for people to send in content
Due dates
- Scenarios in to Paul by October 31, 2003
- Paul send out first draft November 30, 2003
- Revise via mailing list discourse for complete draft by January 15,
2004
- Send draft to be presented at GGF10 for final editing to GGF
February 1, 2004
- Complete final document review on/shortly after GGF10 -> GGF Doc
editor
Example use cases could include:
- migrate from n to m nodes
- migrate from platform a to platform b
GridCPR Architecture Document (revamp original GWD-I document)
Action: Nathan Stone, Derek Simmel & Raghu Reddy (PSC ) will draft the
outline for the Architecture Document
Headlines for major sections
Unresolved issues
Scope questions
Due Dates
- Draft out to mailing list by October 31, 2003.
- Review period - via mailing list - November 30.
- Including meat added by gridcpr-wg members as we go
- Adding meat to bones…
- Jan. 15th - Checkpoint of the Architecture Document Outline
- Out for discussion (send in to GGF) Feb 1.
GridCPR World View Illustration
Action: Thilo will send a diagram of “Thilo’s view of GridCPR” (with
cunningly inserted errors as an exercise to readers ;)
Send to mailing list by October 20, 2003.
Supplement & organizing view for architecture document
It will be a piece of meat for the Architecture Doc
Comments back to mailing list between Oct 20~Nov 30, 2003.
Documentation of Existing APIs
Action: Those with some documentation of existing related APIs should
send pointers to them to the mailing list, e.g.,
- Nathan Stone, PSC CPR API
- European Data Grid
- Cactus
- Paul Stodghill, Cornell
- Others?
Send your descriptions to the mailing list
We will post them to the GridCPR website (later GridForge archive)
Action: Derek Simmel will review requirements for publishing a GGF
Recommended API regarding number and form of required independent
implementations.
Action: Stephen Pickles to ensure [Reality Grid] Use case is contributed
Discussion of Charter:
Original milestones were too ambitious given limited availability of
resources
Proposed Charter revisions to milestones:
Spring 2004 - GGF10 Frankfurt
Review Use Cases Document
Review Architecture Document
Initiate Draft API document
Summer 2004 - GGF11 Hawaii
Review Final Use Case & Architecture documents
Submit finalized GWD-I documents to GGF Editor
Review Draft API
Fall 2004 - GGF12
Review Existing Implementation Efforts
Review Final API document & submit to GGF Editor
Determine the nature and scope of required underlying GridCPR services
Spring 2005 - GGF13
Review Draft GridCPR Services Doc (if needed)
Summer 2005 - GGF14
Review Final GridCPR Services Doc & submit to GGF Editor
Action: Derek Simmel will submit charter milestone revisions to GFSG
for review.
Meeting Adjourned at 5:30pm
---
Derek Simmel <dsimmel@psc.edu>
Grid Computing Specialist
Pittsburgh Supercomputing Center
4400 Fifth Avenue
Pittsburgh PA 15213
(412) 268-1035