An Initial Set of Pathways
Once the initial assignment of functional roles has been completed (i.e., once
the initial version of the entries in the protein-role table for the newly
sequenced genome has been generated), one normally proceeds to the assertion
of function diagrams (i.e., to the addition of entries to the asserted-diagrams
table for the genome). As the collection of analyzed genomes increases, it
becomes ever more likely that each new genome will contain a substantial
similarity to a genome that has already been analyzed. If a fairly similar
(biochemically and phenotypically) organism has already been analyzed, it is
useful to begin the analysis of the new organism by asserting the diagrams that
are believed to exist from the already analyzed organism. Some of the asserted
pathways are likely to be wrong, but their removal can be deferred until after
the initial assignment of pathways.
In any event, the user should move through the major areas of metabolism
and ask the system to propose diagrams that might correspond to
functionality present in the organism. A system supporting metabolic
reconstruction should be able to support such requests. As we learn more
about the reasoning required to accurately assert the presence of pathways,
the proposal of pathways by the system can become increasingly precise.
For now, we employ a very straightforward approach.
First, we take the entire collection of pathways and assign a score to each
pathway. The score for a pathway is
(I + 0.5U) / (I + U + M),
where I is the number of functional roles in the diagram that have been
connected to specific sequences in the genome, M is the number that have not
been connected and for which known examples from other genomes exist, and
U is the number of unconnected roles for which no exemplar exists from other
genomes. This is a crude measure of the fraction of the functional roles that
have been identified, considering that there are U roles for which reasoning by homology is impossible at this point.
Then, we sort the pathways by score and present to the user those that
exceed some specified threshold. The user is expected to go through each
proposed pathway and either assert it to the asserted-diagrams table or
simply ignore the proposal.
Locating Missing Functions
After we have accumulated an initial set of asserted diagrams, a pass through
this asserted set must be made, focusing on the functional roles that remain
unconnected to specific ORFs in the genome. Here, the system can provide a
very useful function by collecting all known sequences that have been assigned
the functional role, tabulating all similarities between ORFs in the new
genome and these existing exemplars, and summarizing which of the existing
ORFs is most likely to perform the designated functional role. Without a tool
like WIT/WIT2, this process would be extremely time-consuming (and, in fact, would almost never be done systematically). In WIT2, we made the design
decision to precompute similarities between all ORFs from the analyzed
genomes and between these ORFs and entries in the nonredundant protein
database maintained by NCBI. This allows an immediate response to requests
to locate candidates for unconnected functional roles, summarizing BHs,
BBHs, and all other similarities. The disadvantage of such a design
commitment is that the collection of similarities is out of date almost
immediately. Such a trade-off is commonly faced in developing bioinformatics
servers. In our case, the severity of the problem is inevitably reduced by the
addition of more genomes – that is, while the system may well not have access
to all relevant similarities, the chances of establishing a solid connection
between a new sequence and a previously analyzed sequence with an
established function improve dramatically as the set of completely sequenced (and increasingly analyzed) genomes grow.
Once the system has located candidates for an unconnected functional role,
the process of actually coming to a conclusion about whether a given
sequence should be connected to the functional role is arbitrarily complex
and corresponds to the types of decisions made while doing the initial
assignments. In this case, however, the user of the system has the
additional knowledge that assignments based on weak similarities may be
strongly supported by the presence of assignments to other functional roles
from the same diagram. This represents one of the pragmatic motivations
for developing metabolic reconstructions: they offer a means of developing
strong support for assignments based on relatively weak similarities.
We emphasize that the assertion of specific diagrams (i.e., pathways)
should be considered in the context of known biochemical and phenotypic
data. A variety of assignments cannot be made solely based on sequence
similarities. For example, one might consider the choice between malate
dehydrogenase and lactate dehydrogenase. Although examples of
sequences that play these roles are extremely similar (exhibiting almost
arbitrarily strong similarity scores), the choice between these functional
roles often can be made only by using biochemical evidence or a more
detailed sequence analysis based on either the construction of trees or the
analysis of "signatures" (i.e., positions in the sequence that correlate with
the functional role). Similarly, the choice between assigning a functional
role of aspartate oxidase, fumarate reductase, or succinate dehydrogenase
will require establishing an overview of the lifestyle of the organism,
followed by a detailed analysis of all related sequences present in the
genome. These examples are unusually difficult; in most cases the
determination of function is much more straightforward. Even in these
cases, however, the accumulation of more data will dramatically simplify
things.
Balancing the Model
We turn now to the more difficult and critical step of balancing the model. By
balancing, we mean considering questions of the following form:
"Since we know this compound is present (because we have asserted a given pathway for which it is a substrate), where does it come from? Is it synthesized, or is it imported?"
This consideration holds for all substrates to pathways, coenzymes, prosthetic
groups, and so forth. In addition, we need to consider the issue of whether
products of pathways are consumed by other cellular processes or are excreted.
To begin this process, the user must first make tables including all
substrates of asserted pathways and all products of asserted pathways. As
we stated above, our simplified notion of function diagram does not
require that substrates and products be included. However, if one wishes to
automate this aspect of metabolic reconstruction (which we have not yet
done), the data must be accurately encoded. Once such tables exist, we
remove all compounds that occur as both substrates and products. Two lists
remain:
The user must go through these lists carefully and assess how best to reconcile the situation. This task may require searching for a protein that might be a potential transporter, asserting a new pathway for which a limited amount of evidence exists, or formulating some other hypothesis about what is going on.
Once the user has analyzed the situation as it relates to substrates and
products of pathways, a similar analysis must be applied to known
cofactors, coenzymes, and prosthetic groups. In this case, the logical issue
of potential producers and consumers of specific compounds must be
analyzed, but additional issues relating to volumes of flows can
analyzed. At this point, most of this type of analysis requires a substantial
amount of expertise, and many of the decisions are necessarily impossible
to make with any certainty. The situation is exacerbated by the difficulty of
determining the precise function of a wide class of transport proteins, as
well as by the potential for broad specificity for many enzymes. In this
regard, while the situation is currently tractable only for those with
substantial biochemical backgrounds (and not always by them), it is clearly
possible that rapid advances in our ability to perform more careful
comparative analysis and to acquire biochemical confirmation of
conjectures will gradually simplify this aspect of metabolic reconstruction,
as well.
Coordinating the Development of Metabolic Reconstructions
A metabolic reconstruction can be done by a number of individuals, often
sharing a single model that is developed jointly. WIT2 includes the capability
for multiple users either to work jointly on a single metabolic reconstruction or
to develop such reconstructions in isolation. This is achieved as follows:
Our intent is that users develop metabolic reconstructions on many distinct
Web servers, but that they be able to conveniently import the efforts of others
working on the same genome.
Where Do We Stand?
At this point we are attempting to develop and maintain metabolic models for
well over twenty organisms representing a remarkable amount of phylogenetic
diversity (http://wit.at.msu). The development of these initial models will be,
we believe, far more difficult than the efforts required to add new models for
more organisms that are similar to these initially analyzed organisms. On the
other hand, unicellular life exhibits an enormous amount of diversity; and
when the task of analyzing multicellular organisms is contemplated, it is clear
that an enormous amount of work is required to attain even approximate
metabolic reconstructions.
As we develop these initial models, we have noted a clear core of functionality that is shared by a surprisingly varied set of organisms. Techniques for developing clusters of proteins that are clearly homologous and
that perform identical functions in distinct organisms are now beginning to
simplify efforts to develop metabolic reconstructions. Such techniques are
also leading to a clear hypothesis about the historical origins of specific
functions.
The task of constructing a detailed overview of the functional subsystems
in specific organisms is closely related to the issue of characterizing the
functions or genes in the gene pool. While specific organisms often have
been analyzed in isolation, it is rapidly becoming clear that comparative
analysis is the key to understanding even specific genomes and that
characterization of the complete gene pool for unicellular life is far more
tractable than previously imagined. Our goal is to develop accurate,
although somewhat imprecise, functional overviews for unicellular
organisms and to use these as a foundation for the analysis of multicellular
eukaryotes. Just as protein families derived from unicellular organisms are
beginning to form the basis for assigning function to many eukaryotic
proteins, an understanding of the central metabolism of eukaryotes will be
built on our rapidly expanding understanding of the evolution of functional
systems within unicellular organisms.
A Growing Interest in Connecting Metabolic and Sequence Data
The growing perception that the metabolic structure must be encoded and used
to interpret the emerging body of sequence data has resulted in a number of
projects. Here we summarize the most successful of these projects at this time.
With interest expanding so rapidly, the reader is encouraged to do a network
search for other sites, which we believe will continue to appear at a growing
rate.
Availability of the Pathways, Software, and Models
The PUMA (
http://www.mcs.anl.gov/home/compbio/PUMA/Production/puma.html), WIT (http://www.cme.msu.edu/WIT/) [7], and WIT2
(http://www.mcs.anl.gov/home/overbeek/WIT2/CGI/user.cgi) systems were
developed at Argonne National Laboratory in close cooperation with the team
of Evgeni Selkov in Russia. The beta release for WIT2 has been sent to four
sites and is currently available. The first actual release of WIT2 is scheduled
for October 1997. It will include all of the software required to install WIT2
and develop a local Web server, all of our metabolic reconstructions for
organisms with genomes in the publicly available archives, and detailed
instructions for adding any new genomes to the existing system (perhaps, for
local use only). Just as widespread availability of the Metabolic Pathway
Database has stimulated a number of projects relating to the analysis of
metabolic networks, we hope that the availability of WIT2 will foster the
development and open exchange of detailed metabolic reconstructions.
Acknowledgments
R.O. was supported by the U.S. Department of Energy, under Contract W-31-
109-Eng-38. N.L. was supported by the Center for Microbial Ecology at
Michigan State University (DEB 9120006). We also thank the Free Software
Foundation and Larry Wall for their excellent software.
References