Metabolic Reconstruction Using the WIT/WIT2 Systems

This is a quite conservative approach, although it can still lead to errors. Following the automated assignment of function, we recommend that the user of WIT/WIT2 make a pass through the set of ORFs that have strong similarities to other proteins with known functional roles but for which no automated assignment could be made. WIT2 allows the user to peruse the BBHs for each protein, to align the protein against other proteins of known function, to analyze regions of similarity, and so forth. At this point, assignment of function is still a process of thoughtfully considering a wide range of alternatives, and the background of the user determines the quality of the assignments. We believe that the rapid addition of new genomes and the accumulation of a growing body of probable assignments of function, together with consistency checks based on clustering protein sequences, will lead to a situation in which most of the currently required judgment can be eliminated. However, we are not yet close to that point.

An Initial Set of Pathways

Once the initial assignment of functional roles has been completed (i.e., once the initial version of the entries in the protein-role table for the newly sequenced genome has been generated), one normally proceeds to the assertion of function diagrams (i.e., to the addition of entries to the asserted-diagrams table for the genome). As the collection of analyzed genomes increases, it becomes ever more likely that each new genome will contain a substantial similarity to a genome that has already been analyzed. If a fairly similar (biochemically and phenotypically) organism has already been analyzed, it is useful to begin the analysis of the new organism by asserting the diagrams that are believed to exist from the already analyzed organism. Some of the asserted pathways are likely to be wrong, but their removal can be deferred until after the initial assignment of pathways.

In any event, the user should move through the major areas of metabolism and ask the system to propose diagrams that might correspond to functionality present in the organism. A system supporting metabolic reconstruction should be able to support such requests. As we learn more about the reasoning required to accurately assert the presence of pathways, the proposal of pathways by the system can become increasingly precise. For now, we employ a very straightforward approach.

First, we take the entire collection of pathways and assign a score to each pathway. The score for a pathway is

(I + 0.5U) / (I + U + M),

where I is the number of functional roles in the diagram that have been connected to specific sequences in the genome, M is the number that have not been connected and for which known examples from other genomes exist, and U is the number of unconnected roles for which no exemplar exists from other genomes. This is a crude measure of the fraction of the functional roles that have been identified, considering that there are U roles for which reasoning by homology is impossible at this point.

Then, we sort the pathways by score and present to the user those that exceed some specified threshold. The user is expected to go through each proposed pathway and either assert it to the asserted-diagrams table or simply ignore the proposal.

Locating Missing Functions

After we have accumulated an initial set of asserted diagrams, a pass through this asserted set must be made, focusing on the functional roles that remain unconnected to specific ORFs in the genome. Here, the system can provide a very useful function by collecting all known sequences that have been assigned the functional role, tabulating all similarities between ORFs in the new genome and these existing exemplars, and summarizing which of the existing ORFs is most likely to perform the designated functional role. Without a tool like WIT/WIT2, this process would be extremely time-consuming (and, in fact, would almost never be done systematically). In WIT2, we made the design decision to precompute similarities between all ORFs from the analyzed genomes and between these ORFs and entries in the nonredundant protein database maintained by NCBI. This allows an immediate response to requests to locate candidates for unconnected functional roles, summarizing BHs, BBHs, and all other similarities. The disadvantage of such a design commitment is that the collection of similarities is out of date almost immediately. Such a trade-off is commonly faced in developing bioinformatics servers. In our case, the severity of the problem is inevitably reduced by the addition of more genomes � that is, while the system may well not have access to all relevant similarities, the chances of establishing a solid connection between a new sequence and a previously analyzed sequence with an established function improve dramatically as the set of completely sequenced (and increasingly analyzed) genomes grow.

Once the system has located candidates for an unconnected functional role, the process of actually coming to a conclusion about whether a given sequence should be connected to the functional role is arbitrarily complex and corresponds to the types of decisions made while doing the initial assignments. In this case, however, the user of the system has the additional knowledge that assignments based on weak similarities may be strongly supported by the presence of assignments to other functional roles from the same diagram. This represents one of the pragmatic motivations for developing metabolic reconstructions: they offer a means of developing strong support for assignments based on relatively weak similarities.

We emphasize that the assertion of specific diagrams (i.e., pathways) should be considered in the context of known biochemical and phenotypic data. A variety of assignments cannot be made solely based on sequence similarities. For example, one might consider the choice between malate dehydrogenase and lactate dehydrogenase. Although examples of sequences that play these roles are extremely similar (exhibiting almost arbitrarily strong similarity scores), the choice between these functional roles often can be made only by using biochemical evidence or a more detailed sequence analysis based on either the construction of trees or the analysis of "signatures" (i.e., positions in the sequence that correlate with the functional role). Similarly, the choice between assigning a functional role of aspartate oxidase, fumarate reductase, or succinate dehydrogenase will require establishing an overview of the lifestyle of the organism, followed by a detailed analysis of all related sequences present in the genome. These examples are unusually difficult; in most cases the determination of function is much more straightforward. Even in these cases, however, the accumulation of more data will dramatically simplify things.

Balancing the Model

We turn now to the more difficult and critical step of balancing the model. By balancing, we mean considering questions of the following form:

"Since we know this compound is present (because we have asserted a given pathway for which it is a substrate), where does it come from? Is it synthesized, or is it imported?"

This consideration holds for all substrates to pathways, coenzymes, prosthetic groups, and so forth. In addition, we need to consider the issue of whether products of pathways are consumed by other cellular processes or are excreted.

To begin this process, the user must first make tables including all substrates of asserted pathways and all products of asserted pathways. As we stated above, our simplified notion of function diagram does not require that substrates and products be included. However, if one wishes to automate this aspect of metabolic reconstruction (which we have not yet done), the data must be accurately encoded. Once such tables exist, we can remove all compounds that occur as both substrates and products. Two lists remain:

A list of substrates that are not synthesized by any process depicted in any of the asserted function diagrams, and

A list of products that are not consumed by an processes depicted by asserted diagrams.

The user must go through these lists carefully and assess how best to reconcile the situation. This task may require searching for a protein that might be a potential transporter, asserting a new pathway for which a limited amount of evidence exists, or formulating some other hypothesis about what is going on.

Once the user has analyzed the situation as it relates to substrates and products of pathways, a similar analysis must be applied to known cofactors, coenzymes, and prosthetic groups. In this case, the logical issue of potential producers and consumers of specific compounds must be analyzed, but additional issues relating to volumes of flows can be analyzed. At this point, most of this type of analysis requires a substantial amount of expertise, and many of the decisions are necessarily impossible to make with any certainty. The situation is exacerbated by the difficulty of determining the precise function of a wide class of transport proteins, as well as by the potential for broad specificity for many enzymes. In this regard, while the situation is currently tractable only for those with substantial biochemical backgrounds (and not always by them), it is clearly possible that rapid advances in our ability to perform more careful comparative analysis and to acquire biochemical confirmation of conjectures will gradually simplify this aspect of metabolic reconstruction, as well.

Coordinating the Development of Metabolic Reconstructions

A metabolic reconstruction can be done by a number of individuals, often sharing a single model that is developed jointly. WIT2 includes the capability for multiple users either to work jointly on a single metabolic reconstruction or to develop such reconstructions in isolation. This is achieved as follows:

For each organism, a list of master users is installed. When these users alter a model, the change is visible by all users of the system.

When a user logs into a version of WIT2, he chooses a "user ID". Any set of users sharing the same ID will be working on the same model.

When any non-master user alters a model (asserts the existence of a diagram or makes an entry to the protein-role table), the change is visible only to the group of users sharing the same user ID. The model constructed within a given user ID should be viewed as an extension to the "standard" model generated by the master users.

A metabolic reconstruction for an organism (corresponding to a designated user ID) can be exported (i.e., converted to an external format), which can later be imported to any other version of WIT2 that includes the data for the organism.

Our intent is that users develop metabolic reconstructions on many distinct Web servers, but that they be able to conveniently import the efforts of others working on the same genome.

Where Do We Stand?

At this point we are attempting to develop and maintain metabolic models for well over twenty organisms representing a remarkable amount of phylogenetic diversity (http://wit.at.msu). The development of these initial models will be, we believe, far more difficult than the efforts required to add new models for more organisms that are similar to these initially analyzed organisms. On the other hand, unicellular life exhibits an enormous amount of diversity; and when the task of analyzing multicellular organisms is contemplated, it is clear that an enormous amount of work is required to attain even approximate metabolic reconstructions.

As we develop these initial models, we have noted a clear core of functionality that is shared by a surprisingly varied set of organisms. Techniques for developing clusters of proteins that are clearly homologous and that perform identical functions in distinct organisms are now beginning to simplify efforts to develop metabolic reconstructions. Such techniques are also leading to a clear hypothesis about the historical origins of specific functions.

The task of constructing a detailed overview of the functional subsystems in specific organisms is closely related to the issue of characterizing the functions or genes in the gene pool. While specific organisms often have been analyzed in isolation, it is rapidly becoming clear that comparative analysis is the key to understanding even specific genomes and that characterization of the complete gene pool for unicellular life is far more tractable than previously imagined. Our goal is to develop accurate, although somewhat imprecise, functional overviews for unicellular organisms and to use these as a foundation for the analysis of multicellular eukaryotes. Just as protein families derived from unicellular organisms are beginning to form the basis for assigning function to many eukaryotic proteins, an understanding of the central metabolism of eukaryotes will be built on our rapidly expanding understanding of the evolution of functional systems within unicellular organisms.

A Growing Interest in Connecting Metabolic and Sequence Data

The growing perception that the metabolic structure must be encoded and used to interpret the emerging body of sequence data has resulted in a number of projects. Here we summarize the most successful of these projects at this time. With interest expanding so rapidly, the reader is encouraged to do a network search for other sites, which we believe will continue to appear at a growing rate.

KEGG (http://www.genome.ad.jp/kegg/kegg3.html) [4]: This outstanding effort, based at Kyoto University in Japan, represents an attempt to maintain metabolic overviews for sequenced genomes. It has connected the genes from specific organisms to metabolic functions with excellent visual depictions of metabolic maps.

Boehringer Manheim Biochemical Pathways (http://expasy.hcuge.ch/cgi-bin/search-biochem-index): This excellent collection of metabolic pathways has been recently integrated into the SwissProt effort, allowing one to move between pathways, enzymes, and sequence data.

EcoCyc (http://www.ai.sri.com/ecocyc/ecocyc.html#Overview) [5]: This database is a detailed encoding of the metabolism of Escherichia coli and Haemophilus influenzae. Besides just the metabolic network, this collection includes some of the kinetic and thermodynamic parameters (when they are known).

Biocatalysis/Biodegradation Database (http://dragon.labmed.umn.edu/~lynda/index.html) [3]: This database covers a small, but significant, set of pathways that are of special interest in the area of xenobiotic degradation.

SoyBase (http://probe.nal.usda.gov:8000/plant/aboutsoybase.html): This databases captures genetic and metabolic data for soybeans.

Maize DB (http://teosinte.agron.missouri.edu/) [6]: This database is a comprehensive collection of maize genetic and biochemical data.

Availability of the Pathways, Software, and Models

The PUMA (http://www.mcs.anl.gov/home/compbio/PUMA/Production/puma.

html), WIT (http://www.cme.msu.edu/WIT/) [7], and WIT2 (http://www.mcs.anl.gov/home/overbeek/WIT2/CGI/user.cgi) systems were developed at Argonne National Laboratory in close cooperation with the team of Evgeni Selkov in Russia. The beta release for WIT2 has been sent to four sites and is currently available. The first actual release of WIT2 is scheduled for October 1997. It will include all of the software required to install WIT2 and develop a local Web server, all of our metabolic reconstructions for organisms with genomes in the publicly available archives, and detailed instructions for adding any new genomes to the existing system (perhaps, for local use only). Just as widespread availability of the Metabolic Pathway Database has stimulated a number of projects relating to the analysis of metabolic networks, we hope that the availability of WIT2 will foster the development and open exchange of detailed metabolic reconstructions.

Acknowledgments

R.O. was supported by the U.S. Department of Energy, under Contract W-31-109-Eng-38. N.L. was supported by the Center for Microbial Ecology at Michigan State University (DEB 9120006). We also thank the Free Software Foundation and Larry Wall for their excellent software.

References

Badger, J. H., and Olsen, G. J. CRITICA: Coding Region Identification Tool Invoking Comparative Analysis, Molec. Bil. Evol., 1977, in press.

Borodovsky M., and Peresetsky, A. Deriving Non-Homogeneous DNA Markov Chain Models by Cluster Analysis Algorithm Minimizing Multiple Alignment Entropy. Comput Chemistry, 18, no. 3, 1994, pp. 259-267.

Ellis, L.B.M., and Wackett, L. P. A Microbial Biocatalysis Database, Soc. Ind. Microb. News. 45, no. 4, 1995, pp. 167-173.

Kanehisa, M., Toward Pathway Engineering: A New Database of Genetic and Molecular Pathways, Science and Technology Japan, 59, 1996, pp. 34-38.

Karp, P, Riley, M., Paley, S., and Pellegrini-Toole, A. EcoCyc: Electronic Encyclopedia of E. coli Genes and Metabolism, Nucleic Acids Research, 25, no. 1, 1997

Nelson, O., Coe, E., and Langdale, J., Genetic Nomenclature Guide. Maize. Trends Genet., March, 1995, 20-21

Overbeek O., Larsen, N., Smith, W., Maltsev, N., and Selkov, E.. Representation of Function: The Next Step. Gene-COMBIS (on-line): 31 January 1997; Gene 191, no. 1,: GC1-9

Selkov, E., Basmanova, S., Gaasterland, T., Goryanin, I., Gretchkin, Y., Maltsev, N., Nenashev, V., Overbeek, R., .Panushkina, E., Pronevitch, I., Selkov Jr., E.., and Yunus, I. The Methabolic Pathway Collection from EMP: The Enzymes and Metabolic Pathways Database. Nucleic Acids Research, 24, no. 1 (database issue), 1996, pp. 26-29.