Notes and Comments from metaneos Meeting of March 12. Present: Jean-Pierre Goux, Adriana Iamnitchi, Jeff Linderoth, Steve Wright, Golbon Zakeri Topic 1. DEFINITION OF METACOMPUTING ------------------------------------- Anda brought a 1992 paper of Catlett and Smarr that apparently states ths programming for a metacomputer should be as transparent as programming for a PC. "The metacomputer is a network of heterogenous, computational resources linked by software in such a way that they can be used as easily as a personal computer". We felt that this would be great but that general compilers that produced efficient object code for the heterogeneous, dynamic, unreliable platform of a typical metacomputer for would not happen in our lifetimes or probably ever. We are not the only ones to see this model as pie in the sky. A quote from the site http://www.uni-paderborn.de/pc2/projects/mol/index.htm "Beware! If you think about metacomputing as a comprehensive environment for both program development and program execution, metacomputing will remain a dream! But if you think about metacomputing as a global load-sharing facility for a limited number of registered applications, metacomputing is feasible today! We believe that Metacomputing is for end-users - not for program developers! Metacomputing is only for time-consuming applications! Metacomputing is done by interconnecting high-performance systems!" Probably only the second of these claims about what metacomputing "is" is non-controversial, from our point of view. In the metaneos project, we need to define clearly what *we* believe a metacomputer to be. At the same time, we need to identify the optimization problem classes, algorithms, and metacomputing infrastructure tools that will yield interesting new capabilities. This process inevitably involves a dialog between the metacomputing people and the optimizers; it motivates new algorithm/software developments in both areas. The taxonomy paper we are working on might actually be a somewhat valuable contribution -- if for nothing else than to help define a metacomputer. When a clear separation between the two areas can be identified (e.g. the use of MPICH-G to implement codes written with MPI) it is good to exploit it, but the algorithm/code should still be congnizant of the nature of the metacomputer platform, to ensure fault tolerance and to avoid gross inefficiencies in the use of the resources. Topic 2. BRANCH-AND-BOUND ALGORITHMS. ------------------------------------- Anda is working on general B&B frameworks in the context of metacomputing for her Masters thesis. She identified a tool called DIB, developed by a group at Madison, for performing tree search algorithms on parallel machines. B&B is a special case. Each node in the tree is a parent and is responsible for what it "gives away" for a child processor when it branches. Whenever it gives something to a child and then runs out of work before hearing back from the child, it starts to work on the task itself. If before finishing it hears from the child, it stops and accepts the child's result. There's redundancy here, but some fault tolerance is gained. However it's not clear that sharing of global information is supported by this model. Global information is essential to some B&B applications and problems, but not to all. CAN WE CLASSIFY B&B APPLICATIONS ACCORDING TO THEIR NEED FOR GLOBAL INFORMATION? One type of globally useful information is the value of the best known feasible solution; there are others. The type of information sharing (or whether or not the information should be shared at all) should depend on the "importance" of the information being shared, the speed and ease of generating the information locally, and the underlying computer architecture and network architecture. Our research should find suitable information sharing schemes for the types of information generated by our algorithms in a metacomputing environment. Information sharing issues are present in a "simple" parallel implementation, but we need to think of them in light of a metacomputing implementation. Example: It is generally accepted (and shown through many computational experiments) that the value of the best known solution is an extremely important piece of information. In the quadratic assignment problem (QAP), it is easy to find a very good feasible solution, so the importance of sharing this information is lessened. In some integer LPs, it is hard to find a feasible solution, so when one if found, it is critical to share this information, otherwise the B&B tree will not be pruned effectively. Sharing of information in integer LPs is needed to form pseudocosts, for instance - Jeff has experience of this in PARINO. Topic 3. MASTER-WORKER PARADIGM ------------------------------- We have observed that the master-worker paradigm serves us quite well in a variety of optimization applications. What are some potential problems with it? - SCALABILITY. As the pool of tasks and workers grows, the master will find it harder and harder to process worker requests and manage the task pool. However, Jeff thinks that often this difficulty can be circumvented by changing the grain size. For instance, in an integer LP solver, each task could be something like "explore a subtree starting at this node, for 5 minutes" or "until you have checked out X nodes", and it could return to the master a bunch of new nodes to be added to the pool. (The current FATCOP uses a small grain size - basically a single branch at the given node. MICHAEL AND QUN - are you interested in investigating the extension just outlined?) Jeff commented that if many more nodes become available to execute tasks, maybe the task size should be INCREASED so that the master doesn't get swamped with too many "send me work" requests from the workers. There is a fine balance here, however. Increasing the grain size also makes the control more decentralized and perhaps the algorithm is more prone to "deceleration" anonmolies. By increasing the grain size, we won't be able to get around all scalability problems, but it does help alleviate some of the difficulties. - ROBUSTNESS. The master has to stay up. It has to be checkpointed. What about replicating or splitting the master? Jeff's experience is with a parallel branch and bound implementation that distributes the active set among processors. There was a "distributor" entity that was responsible for "overseeing" the load balance and "quality" balance between the pools of active nodes. The algorithm was not "robust" in any sense. If any one of the nodes went down, you lost all the nodes currently in its active set. Robustness is a difficult issue. Anda is working on this. Topic 4. CONDOR/GLOBUS. ----------------------- We talked about the ways in which Condor and Globus currently interact, and in particular recalled what Miron discussed at ths meeting of 2/26/99. Our understanding of the new facility is that it puts Condor at the center of things. A user starts a Condor job. They then use Globus (make a GRAM request) to grab some resources. When these resources are acquired, Condor is notified. Condor is thereafter free to allocate these nodes as part of its pool, but typically they would be available only to the person who claimed them by the GRAM call. This facility gives a way of getting scheduled resources allocated to your own job, by temporarily increasing the size of the Condor pool available to you. I think that this would be an interesting and useful exercise to see if we could get one of our MW or Condor/PVM applications to use this. Increasing the number and power of the resources available to us will almost assuredly... * Show us the scalability (or lack thereof) of our MW approach * Lead us to think of ways of overcoming this. In any event, in almost everyone's eyes, we won't *really* have done "metacomputing" (whatever that means) until we build an application that takes advantage of HUGE numbers of resources. Getting resources through the GRAM is a way to do this. An earlier facility allowed submission to Condor through Globus. Golbon has already used this.