[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Minutes from the GGF8 meeting



Sorry for being quite late. (Put 3 standard excuses here...)


The minutes are attached.


Thilo
-- 
Thilo Kielmann                                 http://www.cs.vu.nl/~kielmann/
GGF Grid Checkpoint Recovery WG (GridCPR-WG)

Notes of the meeting at GGF8, June 26, 2003

Notes by Shantenu Jha <s.jha@ucl.ac.uk> and Thilo Kielmann <kielmann@cs.vu.nl>


Session 1, chaired by D. Simmel

Introductions:

Overview by D. Simmel

GridCPR objective:
- heterogeneity: checkpoint on one system, restart possibly somewhere else 
- API does not imply any services, may also minimalistically be implemented
  "somehow", like local libraries
- API needs to be separate from the services
- separate API from "best practices" recommendations how to use checkpoints
  like coordinated or uncoordinated checkpointing

ACTION ITEM: everybody watch other groups w.r.t. overlap/contradictions


GridCPR Architecture Review
	GridCPR Goal (API definition)
		will require an application developer to
		modify his/her code to use the API
	GridCPR API Specification Audience
		Grid Application Developers
	GridCPR Service Specification Audience
		Grid Platform Developers and Vendors
		  -- is this critically required?
		     i.e. should CPR be dependent on grid services whose
		     presence on every platform be guaranteed?
		     i.e. decouple the API from any Grid Service
		Grid Resource Operators

GridCPR Architecture Discussion


	Paul Stodgill: coordinate or uncoordinated checkpoint. v1 or v2?
		       e.g. multiphysics


Chairs notes:
	map checkpoint to a certain application run
	need to be able to NAME a checkpoint

	management of checkpoint data

	make assumption that jobs are parallel, sequential being a special case
	checkpoint is a set of files that together 
            represent a recoverable state of an application
	application run is not just a binary

	Resource broker view
            need to restart job

	Security for the checkpoint files

	State Approach? [Java-ish]
	Register a certain set of variables?
		 when you call for a checkpoint, some built-in 
                 [programmatically] thing to write those
		 variables out to the checkpoint

        What do we need to restarat your application mid-stream?
	HDF5 should be a plug-able component rather han a required data format

        Application manager: keeps track of the progress of the job
		    from submission to completion

        GridCPR service makes use of other [Grid] data movement and 
                job management services


Scope to be clarified; wording to be determined later.

      SP: need for use cases
      TK: need for .... analogous to Network weather Services

      GridLab, RealityGrid, Paul Stodgill offer to write use cases

===============================================================================

Session II, chaired by D. Simmel

Derek Simmel: recap and agenda...

Agenda:
CPR API structured Brainstorming, Discussion
Action Items and Writing Assignments
Planning for GGF9,10 & onwards


Session objectives:
Gather requirements for a GridCPR API, Scope Issues, 
       Prioritize gridCPR API Requirements,

Draw line b/w v1.0 and v2.0...

      Goal
---------------------
      Objective
---------------------
      Requirement

Objective:
Platform and OS independent
Development environment/tools independence
Predictable behaviour
Although gridCPR architecture may include separate, locally deployed 
      gridCPR services, the gridCPR API/library should be separable from them

use case presentations & discussions.
------------------------------------------------------------------------------
1.    Mikel and Shantenu from RealityGrid

component (seq/par app.) writes set of files
(might also be read by other component, e.g. viz for monitoring progress)

what is going into the checkpoint file?

data of a program -- not system-level
                     not all user-defined?
                     middle: using API to do portable checkpoint
------------------------------------------------------------------------------
2.    Paul Stodghill from Cornell

precompiler, programmer only needs to say where in the code, ckpt to be called

N->N ckpt / recovery

with virtualization, N->M possible, e.g. for load redistribution

API needed for compiler-generated code (underneath the compiler magic)
------------------------------------------------------------------------------
3.    Tom Goodale and Thilo Kielmann from GridLab

application writes set of files
which can be tracked by the application manager (who is in charge of
monitoring an application from submission to completion)

checkpointing happens at application level
(part of API of GAT -- Grid Application Toolkit)
------------------------------------------------------------------------------
4.    Karpjoo Jeony, Konkuk University

sequiential program, defines critical state + location info, 
portable data format

always recover from the newest checkpoint?
no big deal giving a parameter "which one"
for some, it is a requirement

user-visible checkpoint labels

incremental checkpoints, relative to previous ones (like backups)

checkpoint performance prediction?? based on last checkpoint??
-> checkpointt rate

disk size of checkkpoint considered important

------------------------------------------------------------------------------
5.    Heon Yeom, Seoul National Univ

recompile MPI-programs
checkpoint N->N processors (restart with same size)
coordinated and uncoordinated checkpoint, also message logging

job manager

------------------------------------------------------------------------------

case 1 and 3 are similar
case 4 also highlighted the need to do incremental checkpoint 
       [checkpoint difference]
case 5 does uncoordinated checkpoint with message logging

problems left open: 

deal with open files?
save offset within checkpoint

-> journaling file system ??
transactions
standardize i/o interface??

goal:
Compile proceedings of the discussion and present response into an 
informational document.

ACTION ITEM: all: send URLs to API info to the list