Jeff, Anda, JP, Golbon, Here are some notes I took Friday. Please modify and add to them as you wish, in particular you may need to add something about your discussions later on Friday. Note where you disagree with what I have written. We'll eventually circulate these to the rest of the group. It is good to have a record of our discussions because it solidifies ideas and avoids the need to repeat things too many times. Steve ======== SUMMARY OF metaneos discussion of 3/12/99. PRESENT: Anda, Jean-Pierre, Golbon, Jeff, Steve DEFINITION OF METACOMPUTING. Anda had a 1992 paper of Catlett and Smarr that apparently states ths programming for a metacomputer should be as transparent as programming for a PC. Steve opined that this would be great but that general compilers that produced efficient code for the heterogeneous, dynamic, unreliable platform of a typical metacomputer for would not happen in our lifetimes or probably ever. We are constrained to work in the real world, working with and building on the tools and algorithms at hand. In metaneos, our job is to identify optimization problem classes, metacomputer platforms, and infrastructure tools that give us new capabilities (e.g. for solving large difficult optimization problems cheaply). This process inevitably involves a dialog between the metacomputing people and the optimizers; it motivates new algorithm/software developments in both areas. When a clear separation between the two areas can be identified (e.g. the use of MPICH-G to implement codes written with MPI) it is good to exploit it, but the algorithm/code should still be congnizant of the nature of the metacomputer platform, to ensure fault tolerance and to avoid gross inefficiencies in the use of the resources. BRANCH-AND-BOUND ALGORITHMS. Anda is working on general B&B frameworks in the context of metacomputing for her Masters thesis. She identified a tool called DIB, developed by a group at Madison, for performing tree search algorithms on parallel machines. B&B is a special case. Each node in the tree is a parent and is responsible for what it "gives away" for a child processor when it branches. Whenever it gives something to a child and then runs out of work before hearing back from the child, it starts to work on the task itself. If before finishing it hears from the child, it stops and accepts the child's result. There's redundancy here, but some fault tolerance is gained. However it's not clear that sharing of global information is supported by this model. Global information is essential to some B&B applications and problems, but not to all. CAN WE CLASSIFY B&B APPLICATIONS ACCORDING TO THEIR NEED FOR GLOBAL INFORMATION? Jeff said that this issue correlates strongly with the ease of finding a feasible solution. For instance, in the quadratic assignment problem (QAP) it is easy to find a feasible solution, so sharing of global information is not so critical. In some integer LPs, it is hard to find a feasible solution, so when one if found, it is critical to share this information, otherwise the B&B tree will not be pruned effectively. Sharing of information in integer LPs is needed to form pseudocosts, for instance - Jeff has experience of this in PARINO. MASTER-WORKER PARADIGM: COMMENTS. We have observed that the master-worker paradigm serves us quite well in a variety of optimization applications. What are some potential problems with it? - SCALABILITY. As the pool of tasks and workers grows, the master will find it harder and harder to process worker requests and manage the task pool. However, Jeff thinks that often this difficulty can be circumvented by changing the grain size. For instance, in an integer LP solver, each task could be something like "explore a subtree starting at this node, for 5 minutes" or "until you have checked out X nodes", and it could return to the master a bunch of new nodes to be added to the pool. (The current FATCOP uses a small grain size - basically a single branch at the given node. Michael and Qun - are you interested in investigating the extension just outlined?) Jeff commented that if many more nodes become available to execute tasks, the task size should be INCREASED so that the master doesn't get swamped with too many "send me work" requests from the workers. - ROBUSTNESS. The master has to stay up. It has to be checkpointed. What about replicating or splitting the master? Jeff has some experience with this idea in PARINO. Where does a worker go when it needs more work? CONDOR/GLOBUS. We talked about the ways in which Condor and Globus currently interact, and in particular recalled what Miron discussed at ths meeting of 2/26/99. Our understanding of the new facility is that it puts Condor at the center of things. A user starts a Condor job. They then use Globus (make a GRAM request) to grab some resources. When these resources are acquired, Condor is notified. Condor is thereafter free to allocate these nodes as part of its pool, but typically they would be available only to the person who claimed them by the GRAM call. This facility gives a way of getting scheduled resources allocated to your own job, by temporarily increasing the size of the Condor pool available to you. An earlier facility allowed submission to Condor through Globus. Golbon has already used this.