This is a quite conservative approach, although it can still lead to errors. Following the automated assignment of function, we recommend that the user of WIT/WIT2 make a pass through the set of ORFs that have strong similarities to other proteins with known functional roles but for which no automated assignment could be made. WIT2 allows the user to peruse the BBHs for each protein, to align the protein against other proteins of known function, to analyze regions of similarity, and so forth. At this point, assignment of function is still a process of thoughtfully considering a wide range of alternatives, and the background of the user determines the quality of the assignments. We believe that the rapid addition of new genomes and the accumulation of a growing body of probable assignments of function, together with consistency checks based on clustering protein sequences, will lead to a situation in which most of the currently required judgment can be eliminated. However, we are not yet close to that point.
An Initial Set of Pathways
Once the initial assignment of functional roles has been completed (i.e., once the initial version of the entries in the protein-role table for the newly sequenced genome has been generated), one normally proceeds to the assertion of function diagrams (i.e., to the addition of entries to the asserted-diagrams table for the genome). As the collection of analyzed genomes increases, it becomes ever more likely that each new genome will contain a substantial similarity to a genome that has already been analyzed. If a fairly similar (biochemically and phenotypically) organism has already been analyzed, it is useful to begin the analysis of the new organism by asserting the diagrams that are believed to exist from the already analyzed organism. Some of the asserted pathways are likely to be wrong, but their removal can be deferred until after the initial assignment of pathways.
In any event, the user should move through the major areas of metabolism and ask the system to propose diagrams that might correspond to functionality present in the organism. A system supporting metabolic reconstruction should be able to support such requests. As we learn more about the reasoning required to accurately assert the presence of pathways, the proposal of pathways by the system can become increasingly precise. For now, we employ a very straightforward approach.
First, we take the entire collection of pathways and assign a score to each pathway. The score for a pathway is
(I + 0.5U) / (I + U + M),
where I is the number of functional roles in the diagram that have been connected to specific sequences in the genome, M is the number that have not been connected and for which known examples from other genomes exist, and U is the number of unconnected roles for which no exemplar exists from other genomes. This is a crude measure of the fraction of the functional roles that have been identified, considering that there are U roles for which reasoning by homology is impossible at this point.
Then, we sort the pathways by score and present to the user those that exceed some specified threshold. The user is expected to go through each proposed pathway and either assert it to the asserted-diagrams table or simply ignore the proposal.
Locating Missing Functions
After we have accumulated an initial set of asserted diagrams, a pass through this asserted set must be made, focusing on the functional roles that remain unconnected to specific ORFs in the genome. Here, the system can provide a very useful function by collecting all known sequences that have been assigned the functional role, tabulating all similarities between ORFs in the new genome and these existing exemplars, and summarizing which of the existing ORFs is most likely to perform the designated functional role. Without a tool like WIT/WIT2, this process would be extremely time-consuming (and, in fact, would almost never be done systematically). In WIT2, we made the design decision to precompute similarities between all ORFs from the analyzed genomes and between these ORFs and entries in the nonredundant protein database maintained by NCBI. This allows an immediate response to requests to locate candidates for unconnected functional roles, summarizing BHs, BBHs, and all other similarities. The disadvantage of such a design commitment is that the collection of similarities is out of date almost immediately. Such a trade-off is commonly faced in developing bioinformatics servers. In our case, the severity of the problem is inevitably reduced by the addition of more genomes � that is, while the system may well not have access to all relevant similarities, the chances of establishing a solid connection between a new sequence and a previously analyzed sequence with an established function improve dramatically as the set of completely sequenced (and increasingly analyzed) genomes grow.
Once the system has located candidates for an unconnected functional role, the process of actually coming to a conclusion about whether a given sequence should be connected to the functional role is arbitrarily complex and corresponds to the types of decisions made while doing the initial assignments. In this case, however, the user of the system has the additional knowledge that assignments based on weak similarities may be strongly supported by the presence of assignments to other functional roles from the same diagram. This represents one of the pragmatic motivations for developing metabolic reconstructions: they offer a means of developing strong support for assignments based on relatively weak similarities.
We emphasize that the assertion of specific diagrams (i.e., pathways) should be considered in the context of known biochemical and phenotypic data. A variety of assignments cannot be made solely based on sequence similarities. For example, one might consider the choice between malate dehydrogenase and lactate dehydrogenase. Although examples of sequences that play these roles are extremely similar (exhibiting almost arbitrarily strong similarity scores), the choice between these functional roles often can be made only by using biochemical evidence or a more detailed sequence analysis based on either the construction of trees or the analysis of "signatures" (i.e., positions in the sequence that correlate with the functional role). Similarly, the choice between assigning a functional role of aspartate oxidase, fumarate reductase, or succinate dehydrogenase will require establishing an overview of the lifestyle of the organism, followed by a detailed analysis of all related sequences present in the genome. These examples are unusually difficult; in most cases the determination of function is much more straightforward. Even in these cases, however, the accumulation of more data will dramatically simplify things.
Balancing the Model
We turn now to the more difficult and critical step of balancing the model. By balancing, we mean considering questions of the following form:
"Since we know this compound is present (because we have asserted a given pathway for which it is a substrate), where does it come from? Is it synthesized, or is it imported?"
This consideration holds for all substrates to pathways, coenzymes, prosthetic groups, and so forth. In addition, we need to consider the issue of whether products of pathways are consumed by other cellular processes or are excreted.
To begin this process, the user must first make tables including all substrates of asserted pathways and all products of asserted pathways. As we stated above, our simplified notion of function diagram does not require that substrates and products be included. However, if one wishes to automate this aspect of metabolic reconstruction (which we have not yet done), the data must be accurately encoded. Once such tables exist, we can remove all compounds that occur as both substrates and products. Two lists remain:
The user must go through these lists carefully and assess how best to reconcile the situation. This task may require searching for a protein that might be a potential transporter, asserting a new pathway for which a limited amount of evidence exists, or formulating some other hypothesis about what is going on.
Once the user has analyzed the situation as it relates to substrates and products of pathways, a similar analysis must be applied to known cofactors, coenzymes, and prosthetic groups. In this case, the logical issue of potential producers and consumers of specific compounds must be analyzed, but additional issues relating to volumes of flows can be analyzed. At this point, most of this type of analysis requires a substantial amount of expertise, and many of the decisions are necessarily impossible to make with any certainty. The situation is exacerbated by the difficulty of determining the precise function of a wide class of transport proteins, as well as by the potential for broad specificity for many enzymes. In this regard, while the situation is currently tractable only for those with substantial biochemical backgrounds (and not always by them), it is clearly possible that rapid advances in our ability to perform more careful comparative analysis and to acquire biochemical confirmation of conjectures will gradually simplify this aspect of metabolic reconstruction, as well.
Coordinating the Development of Metabolic Reconstructions
A metabolic reconstruction can be done by a number of individuals, often sharing a single model that is developed jointly. WIT2 includes the capability for multiple users either to work jointly on a single metabolic reconstruction or to develop such reconstructions in isolation. This is achieved as follows:
Our intent is that users develop metabolic reconstructions on many distinct Web servers, but that they be able to conveniently import the efforts of others working on the same genome.
Where Do We Stand?
At this point we are attempting to develop and maintain metabolic models for well over twenty organisms representing a remarkable amount of phylogenetic diversity (http://wit.at.msu). The development of these initial models will be, we believe, far more difficult than the efforts required to add new models for more organisms that are similar to these initially analyzed organisms. On the other hand, unicellular life exhibits an enormous amount of diversity; and when the task of analyzing multicellular organisms is contemplated, it is clear that an enormous amount of work is required to attain even approximate metabolic reconstructions.
As we develop these initial models, we have noted a clear core of functionality that is shared by a surprisingly varied set of organisms. Techniques for developing clusters of proteins that are clearly homologous and that perform identical functions in distinct organisms are now beginning to simplify efforts to develop metabolic reconstructions. Such techniques are also leading to a clear hypothesis about the historical origins of specific functions.
The task of constructing a detailed overview of the functional subsystems in specific organisms is closely related to the issue of characterizing the functions or genes in the gene pool. While specific organisms often have been analyzed in isolation, it is rapidly becoming clear that comparative analysis is the key to understanding even specific genomes and that characterization of the complete gene pool for unicellular life is far more tractable than previously imagined. Our goal is to develop accurate, although somewhat imprecise, functional overviews for unicellular organisms and to use these as a foundation for the analysis of multicellular eukaryotes. Just as protein families derived from unicellular organisms are beginning to form the basis for assigning function to many eukaryotic proteins, an understanding of the central metabolism of eukaryotes will be built on our rapidly expanding understanding of the evolution of functional systems within unicellular organisms.
A Growing Interest in Connecting Metabolic and Sequence Data
The growing perception that the metabolic structure must be encoded and used to interpret the emerging body of sequence data has resulted in a number of projects. Here we summarize the most successful of these projects at this time. With interest expanding so rapidly, the reader is encouraged to do a network search for other sites, which we believe will continue to appear at a growing rate.
Availability of the Pathways, Software, and Models
The PUMA (http://www.mcs.anl.gov/home/compbio/PUMA/Production/puma.
html), WIT (http://www.cme.msu.edu/WIT/) [7], and WIT2 (http://www.mcs.anl.gov/home/overbeek/WIT2/CGI/user.cgi) systems were developed at Argonne National Laboratory in close cooperation with the team of Evgeni Selkov in Russia. The beta release for WIT2 has been sent to four sites and is currently available. The first actual release of WIT2 is scheduled for October 1997. It will include all of the software required to install WIT2 and develop a local Web server, all of our metabolic reconstructions for organisms with genomes in the publicly available archives, and detailed instructions for adding any new genomes to the existing system (perhaps, for local use only). Just as widespread availability of the Metabolic Pathway Database has stimulated a number of projects relating to the analysis of metabolic networks, we hope that the availability of WIT2 will foster the development and open exchange of detailed metabolic reconstructions.
Acknowledgments
R.O. was supported by the U.S. Department of Energy, under Contract W-31-109-Eng-38. N.L. was supported by the Center for Microbial Ecology at Michigan State University (DEB 9120006). We also thank the Free Software Foundation and Larry Wall for their excellent software.
References