Metabolic Reconstruction Using the WIT/WIT2 Systems

An Initial Set of Pathways

Once the initial assignment of functional roles has been completed (i.e., once

the initial version of the entries in the protein-role table for the newly

sequenced genome has been generated), one normally proceeds to the assertion

of function diagrams (i.e., to the addition of entries to the asserted-diagrams

table for the genome). As the collection of analyzed genomes increases, it

becomes ever more likely that each new genome will contain a substantial

similarity to a genome that has already been analyzed. If a fairly similar

(biochemically and phenotypically) organism has already been analyzed, it is

useful to begin the analysis of the new organism by asserting the diagrams that

are believed to exist from the already analyzed organism. Some of the asserted

pathways are likely to be wrong, but their removal can be deferred until after

the initial assignment of pathways.

In any event, the user should move through the major areas of metabolism

and ask the system to propose diagrams that might correspond to

functionality present in the organism. A system supporting metabolic

reconstruction should be able to support such requests. As we learn more

about the reasoning required to accurately assert the presence of pathways,

the proposal of pathways by the system can become increasingly precise.

For now, we employ a very straightforward approach.

First, we take the entire collection of pathways and assign a score to each

pathway. The score for a pathway is

(I + 0.5U) / (I + U + M),

where I is the number of functional roles in the diagram that have been

connected to specific sequences in the genome, M is the number that have not

been connected and for which known examples from other genomes exist, and

U is the number of unconnected roles for which no exemplar exists from other

genomes. This is a crude measure of the fraction of the functional roles that

have been identified, considering that there are U roles for which reasoning by homology is impossible at this point.

Then, we sort the pathways by score and present to the user those that

exceed some specified threshold. The user is expected to go through each

proposed pathway and either assert it to the asserted-diagrams table or

simply ignore the proposal.

Locating Missing Functions

After we have accumulated an initial set of asserted diagrams, a pass through

this asserted set must be made, focusing on the functional roles that remain

unconnected to specific ORFs in the genome. Here, the system can provide a

very useful function by collecting all known sequences that have been assigned

the functional role, tabulating all similarities between ORFs in the new

genome and these existing exemplars, and summarizing which of the existing

ORFs is most likely to perform the designated functional role. Without a tool

like WIT/WIT2, this process would be extremely time-consuming (and, in fact, would almost never be done systematically). In WIT2, we made the design

decision to precompute similarities between all ORFs from the analyzed

genomes and between these ORFs and entries in the nonredundant protein

database maintained by NCBI. This allows an immediate response to requests

to locate candidates for unconnected functional roles, summarizing BHs,

BBHs, and all other similarities. The disadvantage of such a design

commitment is that the collection of similarities is out of date almost

immediately. Such a trade-off is commonly faced in developing bioinformatics

servers. In our case, the severity of the problem is inevitably reduced by the

addition of more genomes – that is, while the system may well not have access

to all relevant similarities, the chances of establishing a solid connection

between a new sequence and a previously analyzed sequence with an

established function improve dramatically as the set of completely sequenced (and increasingly analyzed) genomes grow.

Once the system has located candidates for an unconnected functional role,

the process of actually coming to a conclusion about whether a given

sequence should be connected to the functional role is arbitrarily complex

and corresponds to the types of decisions made while doing the initial

assignments. In this case, however, the user of the system has the

additional knowledge that assignments based on weak similarities may be

strongly supported by the presence of assignments to other functional roles

from the same diagram. This represents one of the pragmatic motivations

for developing metabolic reconstructions: they offer a means of developing

strong support for assignments based on relatively weak similarities.

We emphasize that the assertion of specific diagrams (i.e., pathways)

should be considered in the context of known biochemical and phenotypic

data. A variety of assignments cannot be made solely based on sequence

similarities. For example, one might consider the choice between malate

dehydrogenase and lactate dehydrogenase. Although examples of

sequences that play these roles are extremely similar (exhibiting almost

arbitrarily strong similarity scores), the choice between these functional

roles often can be made only by using biochemical evidence or a more

detailed sequence analysis based on either the construction of trees or the

analysis of "signatures" (i.e., positions in the sequence that correlate with

the functional role). Similarly, the choice between assigning a functional

role of aspartate oxidase, fumarate reductase, or succinate dehydrogenase

will require establishing an overview of the lifestyle of the organism,

followed by a detailed analysis of all related sequences present in the

genome. These examples are unusually difficult; in most cases the

determination of function is much more straightforward. Even in these

cases, however, the accumulation of more data will dramatically simplify

things.

Balancing the Model

We turn now to the more difficult and critical step of balancing the model. By

balancing, we mean considering questions of the following form:

"Since we know this compound is present (because we have asserted a given pathway for which it is a substrate), where does it come from? Is it synthesized, or is it imported?"

This consideration holds for all substrates to pathways, coenzymes, prosthetic

groups, and so forth. In addition, we need to consider the issue of whether

products of pathways are consumed by other cellular processes or are excreted.

To begin this process, the user must first make tables including all

substrates of asserted pathways and all products of asserted pathways. As

we stated above, our simplified notion of function diagram does not

require that substrates and products be included. However, if one wishes to

automate this aspect of metabolic reconstruction (which we have not yet

done), the data must be accurately encoded. Once such tables exist, we

remove all compounds that occur as both substrates and products. Two lists

remain:

A list of substrates that are not synthesized by any process depicted in any of the asserted function diagrams, and

A list of products that are not consumed by an processes depicted by asserted diagrams.

The user must go through these lists carefully and assess how best to reconcile the situation. This task may require searching for a protein that might be a potential transporter, asserting a new pathway for which a limited amount of evidence exists, or formulating some other hypothesis about what is going on.

Once the user has analyzed the situation as it relates to substrates and

products of pathways, a similar analysis must be applied to known

cofactors, coenzymes, and prosthetic groups. In this case, the logical issue

of potential producers and consumers of specific compounds must be

analyzed, but additional issues relating to volumes of flows can

analyzed. At this point, most of this type of analysis requires a substantial

amount of expertise, and many of the decisions are necessarily impossible

to make with any certainty. The situation is exacerbated by the difficulty of

determining the precise function of a wide class of transport proteins, as

well as by the potential for broad specificity for many enzymes. In this

regard, while the situation is currently tractable only for those with

substantial biochemical backgrounds (and not always by them), it is clearly

possible that rapid advances in our ability to perform more careful

comparative analysis and to acquire biochemical confirmation of

conjectures will gradually simplify this aspect of metabolic reconstruction,

as well.

Coordinating the Development of Metabolic Reconstructions

A metabolic reconstruction can be done by a number of individuals, often

sharing a single model that is developed jointly. WIT2 includes the capability

for multiple users either to work jointly on a single metabolic reconstruction or

to develop such reconstructions in isolation. This is achieved as follows:

For each organism, a list of master users is installed. When these users alter a model, the change is visible by all users of the system.

When a user logs into a version of WIT2, he chooses a "user ID". Any set of users sharing the same ID will be working on the same model.

When any non-master user alters a model (asserts the existence of a diagram or makes an entry to the protein-role table), the change is visible only to the group of users sharing the same user ID. The model constructed within a given user ID should be viewed as an extension to the "standard" model generated by the master users.

A metabolic reconstruction for an organism (corresponding to a designated user ID) can be exported (i.e., converted to an external format), which can later be imported to any other version of WIT2 that includes the data for the organism.

Our intent is that users develop metabolic reconstructions on many distinct

Web servers, but that they be able to conveniently import the efforts of others

working on the same genome.

Where Do We Stand?

At this point we are attempting to develop and maintain metabolic models for

well over twenty organisms representing a remarkable amount of phylogenetic

diversity (http://wit.at.msu). The development of these initial models will be,

we believe, far more difficult than the efforts required to add new models for

more organisms that are similar to these initially analyzed organisms. On the

other hand, unicellular life exhibits an enormous amount of diversity; and

when the task of analyzing multicellular organisms is contemplated, it is clear

that an enormous amount of work is required to attain even approximate

metabolic reconstructions.

As we develop these initial models, we have noted a clear core of functionality that is shared by a surprisingly varied set of organisms. Techniques for developing clusters of proteins that are clearly homologous and

that perform identical functions in distinct organisms are now beginning to

simplify efforts to develop metabolic reconstructions. Such techniques are

also leading to a clear hypothesis about the historical origins of specific

functions.

The task of constructing a detailed overview of the functional subsystems

in specific organisms is closely related to the issue of characterizing the

functions or genes in the gene pool. While specific organisms often have

been analyzed in isolation, it is rapidly becoming clear that comparative

analysis is the key to understanding even specific genomes and that

characterization of the complete gene pool for unicellular life is far more

tractable than previously imagined. Our goal is to develop accurate,

although somewhat imprecise, functional overviews for unicellular

organisms and to use these as a foundation for the analysis of multicellular

eukaryotes. Just as protein families derived from unicellular organisms are

beginning to form the basis for assigning function to many eukaryotic

proteins, an understanding of the central metabolism of eukaryotes will be

built on our rapidly expanding understanding of the evolution of functional

systems within unicellular organisms.

A Growing Interest in Connecting Metabolic and Sequence Data

The growing perception that the metabolic structure must be encoded and used

to interpret the emerging body of sequence data has resulted in a number of

projects. Here we summarize the most successful of these projects at this time.

With interest expanding so rapidly, the reader is encouraged to do a network

search for other sites, which we believe will continue to appear at a growing

rate.

KEGG (http://www.genome.ad.jp/kegg/kegg3.html) [4]: This outstanding effort, based at Kyoto University in Japan, represents an attempt to maintain metabolic overviews for sequenced genomes. It has connected the genes from specific organisms to metabolic functions with excellent visual depictions of metabolic maps.

Boehringer Manheim Biochemical Pathways (http://expasy.hcuge.ch/cgi-bin/search-biochem-index): This excellent collection of metabolic pathways has been recently integrated into the SwissProt effort, allowing one to move between pathways, enzymes, and sequence data.

EcoCyc (http://www.ai.sri.com/ecocyc/ecocyc.html#Overview) [5]: This database is a detailed encoding of the metabolism of Escherichia coli and Haemophilus influenzae. Besides just the metabolic network, this collection includes some of the kinetic and thermodynamic parameters (when they are known).

Biocatalysis/Biodegradation Database (http://dragon.labmed.umn.edu/~lynda/index.html) [3]: This database covers a small, but significant, set of pathways that are of special interest in the area of xenobiotic degradation.

SoyBase (http://probe.nal.usda.gov:8000/plant/aboutsoybase.html): This databases captures genetic and metabolic data for soybeans.

Maize DB (http://teosinte.agron.missouri.edu/) [6]: This database is a comprehensive collection of maize genetic and biochemical data.

Availability of the Pathways, Software, and Models

The PUMA (http://www.mcs.anl.gov/home/compbio/PUMA/Production/puma.

html), WIT (http://www.cme.msu.edu/WIT/) [7], and WIT2

(http://www.mcs.anl.gov/home/overbeek/WIT2/CGI/user.cgi) systems were

developed at Argonne National Laboratory in close cooperation with the team

of Evgeni Selkov in Russia. The beta release for WIT2 has been sent to four

sites and is currently available. The first actual release of WIT2 is scheduled

for October 1997. It will include all of the software required to install WIT2

and develop a local Web server, all of our metabolic reconstructions for

organisms with genomes in the publicly available archives, and detailed

instructions for adding any new genomes to the existing system (perhaps, for

local use only). Just as widespread availability of the Metabolic Pathway

Database has stimulated a number of projects relating to the analysis of

metabolic networks, we hope that the availability of WIT2 will foster the

development and open exchange of detailed metabolic reconstructions.

Acknowledgments

R.O. was supported by the U.S. Department of Energy, under Contract W-31-

109-Eng-38. N.L. was supported by the Center for Microbial Ecology at

Michigan State University (DEB 9120006). We also thank the Free Software

Foundation and Larry Wall for their excellent software.

References

Badger, J. H., and Olsen, G. J. CRITICA: Coding Region Identification Tool Invoking Comparative Analysis, Molec. Bil. Evol., 1977, in press.

Borodovsky M., and Peresetsky, A. Deriving Non-Homogeneous DNA Markov Chain Models by Cluster Analysis Algorithm Minimizing Multiple Alignment Entropy. Comput Chemistry, 18, no. 3, 1994, pp. 259-267.

Ellis, L.B.M., and Wackett, L. P. A Microbial Biocatalysis Database, Soc. Ind. Microb. News. 45, no. 4, 1995, pp. 167-173.

Kanehisa, M., Toward Pathway Engineering: A New Database of Genetic and Molecular Pathways, Science and Technology Japan, 59, 1996, pp. 34-38.

Karp, P, Riley, M., Paley, S., and Pellegrini-Toole, A. EcoCyc: Electronic Encyclopedia of E. coli Genes and Metabolism, Nucleic Acids Research, 25, no. 1, 1997

Nelson, O., Coe, E., and Langdale, J., Genetic Nomenclature Guide. Maize. Trends Genet., March, 1995, 20-21

Overbeek O., Larsen, N., Smith, W., Maltsev, N., and Selkov, E.. Representation of Function: The Next Step. Gene-COMBIS (on-line): 31 January 1997; Gene 191, no. 1,: GC1-9

Selkov, E., Basmanova, S., Gaasterland, T., Goryanin, I., Gretchkin, Y., Maltsev, N., Nenashev, V., Overbeek, R., .Panushkina, E., Pronevitch, I., Selkov Jr., E.., and Yunus, I. The Methabolic Pathway Collection from EMP: The Enzymes and Metabolic Pathways Database. Nucleic Acids Research, 24, no. 1 (database issue), 1996, pp. 26-29.