Use of Contiguity on the Chromosome to Predict Functional Coupling

Ross Overbeek, Michael Fonstein^*, Mark D'Souza, Gordon D. Pusch and Natalia Maltsev

Mathematics and Computer Science Division, Argonne National Laboratory, IL 60439, USA
^* Department of Molecular Genetics and Cell Biology, University of Chicago, IL 60637, USA.
[email protected]

Edited by E. Koonin; received April 16, 1998; revised June 1, 1999; accepted June 3, 1999

ABSTRACT

The availability of a growing number of completely sequenced genomes opens new opportunities for understanding of complex biological systems. Success of genome-based biology will, to a large extent, depend on the development of new approaches and tools for efficient comparative analysis of the genomes and their organization. We have developed a technique for detecting possible functional coupling between genes based on detection of potential operons. The approach involves computation of "pairs of close bi-directional best hits", which are pairs of genes that apparently occur within operons in multiple genomes. Using these pairs, one can compose evidence (based on the number of distinct genomes and the phylogenetic distance between the orthologous pairs) that a pair of genes is potentially functionally coupled. The technique has revealed a surprisingly rich and apparently accurate set of functionally coupled genes. The approach depends on the use of a relatively large number of genomes, and the amount of detected coupling grows dramatically as the number of genomes increases.

Key words: microbial genomes; operons; gene function identification; genetic sequence analysis; comparative analysis

INTRODUCTION

The availability of a growing number of the completely sequenced genomes opens new opportunities. Success of genome-based biology [1] will to a large extent depend on the development of new approaches and tools for efficient comparative analysis of the genomes and their organization. The issue of how to determine the function of some specific gene within a genome has been present for many years. The desire to rapidly characterize large numbers of genes from the newly available genomes has led to a growing sense of urgency, as well as a number of ambitious experimental projects to address the need. We have developed an algorithm to detect potential operons based on comparative of multiple genomes. It depends on the availability of a relatively large number of genomes, however, it does not require these genomes be complete. Furthermore, it provides significant clues as to the function of many genes of unknown or hypothetical function. Our initial results indicate that at least a fifth of the sequenced genes show detectable functional coupling by this method, and that there is every reason to expect this percentage to grow rapidly as the number of sequenced genomes increases.

BASIC APPROACH

It is well known that in many organisms the genes responsible for related functions are located close to each other on the chromosome. If one could accurately predict operons, the availability of a growing number of prokaryotic genomes would offer numerous significant clues relating to the function of "hypothetical proteins" (those for which no reliable identification of function has yet been made). However, an accurate prediction of operons is nontrivial, and the only reliable identifications are currently based on wet-lab experimentation. Nevertheless, given the growing number of genomes, it is reasonable to ask whether accurate predictions of potential operons could be based on a weaker (but related) notion of functional coupling. In this paper, we propose a straightforward computation that is remarkably accurate in predicting functional coupling of genes.

We begin with a small set of definitions that will be used throughout the rest of this paper:

Definition 1:

A set of genes occurring on a prokaryotic chromosome will be called a "run" if and only if they all occur on the same strand, and the gaps between adjacent genes are 300 base pairs or less. Any pair of genes from within a single run are called "close".

Definition 2:

If we have two genes Xa and Xb from two genomes Ga and Gb, Xa and Xb are called "a bidirectional best hit (BBH)" if and only if

recognizable similarity exists between Xa and Xb (in our case, we required FASTA3 scores better than 1.0e-5),

there is no gene Yb in Gb that is more similar to Xa than Xb is, and

there is no gene Ya in Ga that is more similar to Xb than Xa is.

Definition 3:

Genes (Xa,Ya) from Ga and (Xb,Yb) from Gb form a "pair of close bidirectional best hits (PCBBH)" if and only if

Xa and Ya are close,

Xb and Yb are close,

Xa and Xb are a BBH, and

Ya and Yb are a BBH.

The basic notion of a PCBBH is really quite intuitive and is shown in Fig. 1.

Figure 1: The concept of a PCBBH

It is reasonably straightforward to compute all PCBBHs for a set of genomes, and we have done so for 24 prokaryotic genomes maintained within the WIT system developed at Argonne National Laboratory by R. Overbeek et al. [2, 3]. We employed data from a number of partial genomes as well as complete genomes (hence we altered the definition of BBH to state that there is "no known gene" more similar, rather than "no gene" more similar).

The existence of a pair of close BBHs is certainly a good indication that there might actually be two operons in two genomes, each containing a pair of corresponding orthologs. However, this is not always the case. Our hope is that with enough genomes, the cases in which the implication does not hold will be "washed out with large numbers". It should be noted that the vast majority of PCBBHs derived from closely related organisms (say, two strains of the same species) are of no significance whatsoever and certainly should not be used to infer functional coupling. Conversely, PCBBHs from distantly related organisms are of high significance, since they are unlikely to occur because of chance alone, and can be used to infer functional coupling.

Throughout the examples discussed in the remainder of this work, the reader will note our use of WIT ids for genes and their products. In WIT the ORF identifiers all have the form Rxxddddd, where xx identifies the organism. The organisms corresponding to these two-letter codes are as shown in Tab. 1.

Table 1: Genome Codes

AA Aquifex aeolicus MP Mycoplasma pneumoniae

AG Archaeoglobus fulgidus MT Mycobacterium tuberculosis

BS Bacillus subtilis NG Neisseria gonorrhoea

BB Borrelia burgdorferi NM Neisseria meningitidis

CA Clostridium acetobutylicum PA Pseudomonas aeruginosa

CY Synechocystis sp. PF Pyrococcus furiosus

DR Deinococcus radiodurans PH Pyrococcus horikoshii

EC Escherichia coli PN Streptococcus pneumoniae

HI Haemophilus influenzae RC Rhodobacter capsulatus SB1003

HP Helicobacter pylori ST Streptococcus pyogenes

MG Mycoplasma genitalium TP Treponema pallidum

MJ Methanococcus jannaschii TH Methanobacterium thermoautotrophicum

AA	Aquifex aeolicus	MP	Mycoplasma pneumoniae
AG	Archaeoglobus fulgidus	MT	Mycobacterium tuberculosis
BS	Bacillus subtilis	NG	Neisseria gonorrhoea
BB	Borrelia burgdorferi	NM	Neisseria meningitidis
CA	Clostridium acetobutylicum	PA	Pseudomonas aeruginosa
CY	Synechocystis sp.	PF	Pyrococcus furiosus
DR	Deinococcus radiodurans	PH	Pyrococcus horikoshii
EC	Escherichia coli	PN	Streptococcus pneumoniae
HI	Haemophilus influenzae	RC	Rhodobacter capsulatus SB1003
HP	Helicobacter pylori	ST	Streptococcus pyogenes
MG	Mycoplasma genitalium	TP	Treponema pallidum
MJ	Methanococcus jannaschii	TH	Methanobacterium thermoautotrophicum

We will use these codes as column headers in tables below.

Given a set of PCBBHs, we can use them to amass evidence that two particular genes on a given genome are actually co-expressed (and are therefore likely to be functionally coupled). The process proceeds as follows:

First, a pair of close genes occurring in at least one PCBBH is selected. As an example, consider the following pair from the Mycoplasma genitalium genome:

MG155 (sp|P47401),
MG154 (sp|P47400).

These IDs correspond to two ORFs from the Mycoplasma genitalium genome sequenced by TIGR [4].

This pair of genes occurs in PCBBHs that include Mycoplasma genitalium and each of eighteen other organisms (AG, BS, BB, DR, EC, HI, HP, TH, MJ, MT, MP, NG, NM, PA, ST, PN, CY, and TP). Clearly, this amounts to an overwhelming amount of evidence that the co-occurrences are not random. In fact, this pair of genes correspond to

MG155 (sp|P47401) SSU ribosomal protein S19P
MG154 (sp|P47400) LSU ribosomal protein L2P

After selecting a pair of genes from an organism and collecting the list of PCBBHs containing the pair, we "score" the evidence that the two genes are co-occurring. It is not clear exactly what scoring algorithm should be used, but the following seem to be desirable attributes of any potential approach:
- The value of a single PCBBH should increase with the phylogenetic distance between the two genomes, since closely related organisms are more likely to have similar genes nearby to one another due to chance alone (assuming no horizontal transfer).
- The existence of three pairs in the relationship

We have experimented with several scoring algorithms, and the final results seem to be almost identical. The algorithm used to produce the results presented in this paper was as follows:

Given a pair of genes Xa and Ya from genome Ga, the score reflecting the evidence that they co-occur was computed by adding an increment for each pair (Xi,Yi) from genomes Gi for which (Xa,Ya) and (Xi,Yi) form a PCBBH:

Add the distance between Ga and Gi to the score. By "distance" we mean some reasonable estimate of the distances between the organisms.
If there exists one or more third pairs from other genomes that form the triangular relationship, pick genome Gb that maximizes

MinD = the minimum of

the distance between Ga and Gb,

the distance between Ga and Gi, and

the distance between Gi and Gb.

Add MinD to the score.

The result of summing these increments is the score that offers a rough measure that the co-occurrences of Xa and Ya are meaningful.

In forming these scores, we used the distances from the phylogenetic tree distributed by the Ribosomal Database Project [6] as estimates of the "distance between genomes". These distances range from a minimum of 0 (between two strains of Rhodobacter capsulatus) and 1.53 (between Mycoplasma genitalium and Methanobacterium thermoautotrophicum). At an intuitive level, it would seem that a score greater than about 1.0 would certainly indicate that a careful examination of the evidence is warranted. In fact, we believe that any score greater than about 0.1 is quite suggestive and often does indicate a real functional coupling of the genes. Once we have scored all pairs of genes for which PCBBHs exist, the entire list of gene pairs is sorted by score (with higher scores corresponding to pairs of genes Xa and Ya for which functional coupling is more likely). The central question then becomes:

Are the pairs with relatively high scores actually functionally coupled?

Clearly many pairs of genes that are, in fact, functionally coupled will not show up with high scores. However, it will become clear that almost all the pairs with relatively high scores are functionally coupled.

RESULTS

For the 24 genomes we included in the analysis, we computed 34,644 PCBBHs. Using these PCBBHs, we computed scores for 23,144 pairs of genes. Of these, 10,531 pairs had scores greater than 1.0; 17,247 had scores greater than 0.1.

We will argue that scores above 0.1 do, in fact, indicate probable functional coupling, which motivates the following definition:

Definition 4:
Genes X and Y from G form a "clustered pair (CP)" if and only if the score assigned to the two genes by the method described above is equal to or greater than 0.1.

Below we will present anecdotal evidence that a clustered pair actually represents a useful clue in determination of gene function. Before we discuss specific examples, however, let us describe an experiment that we used to evaluate the significance of clustered pairs, by examining what fraction of a set of ORFs that we already have good reason to believe are functionally coupled appear in CPs as a function of the score-threshold.

We examined the pathways in the MPW Database [16] and selected those pathways that contain one or more functional roles (normally enzymes) occurring only in that single pathway, and no other. For every organism in WIT that has been asserted to utilize pathways in this set, we then selected those ORFs assigned to each of these "single-pathway functions"; we found 1720 such ORFs.
We then examined how many of these 1720 ORFs were in CPs, as well as how many of those CPs were with other ORFs in the same pathway. Since each of the selected ORFs occurs in only a single pathway, and since two ORFs believed to code for functions in the same pathway are almost certainly functionally coupled, the fraction of pairs containing a selected ORF and an ORF from the same pathway that are also CPs (i. e., that have scores exceeding the chosen threshold) should provide a rough measure of the significance of CPs at that threshold. We found that, of these 1720 selected ORFs:

1044 were not a member of any CP (i. e., were not in any pair having a score above the cutoff-threshold using the described technique).

Of the remaining 676 ORFs:

354 ORFs were connected via CPs to another ORF with a functional role from the same pathway; often, more than one link existed for a given ORF. There were 636 CPs linking these 354 ORFs to other functional roles within the same pathway.

The remaining 322 ORFs were connected via 1122 CPs to ORFs that either have not yet been assigned a functional role or were assigned a functional role that is in a different pathway. Manual analysis of the latter class of CPs showed that the vast majority of cases we have examined also represent either known or biochemically plausible examples of functional coupling.

Thus, out of a total of 1758 CPs (636 connections between ORFs within the same pathway, plus 1122 connections that are not in the same pathway), at least 36% (636/1758) almost certainly represent clear instances of known functional couplings, by virtue of the facts that both members of the CP have been assigned to have functions that are in the same metabolic pathway, and that at least one of them has been assigned a function that occurs in no other pathway.
Note that this estimate of 36% is a conservative lower bound, because ORFs that have not yet been assigned a function have been lumped in with ORFs that have been assigned a function that is in a different pathway. Note also that, just because two ORFs are not in the same metabolic pathway, it does not imply that they might not be functionally coupled, since the MPW pathways are not all independent -- many of them are in fact subsets of metabolic networks. Indeed, our manual analysis suggests that the majority of the remaining CPs also appear to represent what we believe to be real and biochemically justifiable instances of functional couplings -- they simply did not fit into the highly restrictive set of conditions imposed by our tests.

The results of this experiment are summarized in Tab. 2 below.

Table 2: Analysis of the ORFs with a functional roles that occur in a single pathway using the functional coupling algorithm.

CP Scores > 0.1

ORFs in 24 genomes with a function that occurs in a single pathway only:
1720 ORFs with a function that occurs in a single pathway only, but are not members of a CP:
1044

ORFs with a function that occurs in a single pathway only, and are members of a CP:
676 Number of ORFs connected by a CP to another ORF that has an assigned function that is in the same pathway:
354
(connected by 636 CPs)

Number of ORFs connected by a CP to another ORF that either has no assigned function, or has been assigned a function in a different pathway:
322
(connected by 1122 CPs)

We then repeated the same steps using a cutoff of 1.0 for the calculation of CPs (i.e., we altered the definition of CP). In this case, somewhat more than half as many CPs were detected (957 for a 1.0 cutoff, versus 1758 for a 0.1 cutoff). However, the ratio of the number of connections between genes within the same pathway to the total number of connections rises only slightly, from 36% to 38%. We consider this to be a strong indication that, while a score of 1.0 is certainly far stronger than a score of 0.1, both scores are excellent indicators of functional coupling.

This "experiment" is strongly suggestive that PCBBHs do reflect actual functional coupling. However we have not yet been able to accurately quantify the extent to which such evidence can be relied upon. We believe that the real value of this technology will become truly established only by actual verification of the hundreds of predictions that can easily be made from the existing data.

Tab. 3 offers an extraction of some pairwise scores to give a feel for the output.

Table 3: Pairwise Scores

Pairwise score Organism WIT id Other names Assigned Function

35.51 Archaeoglobus fulgidus RAG37125 AF1922 gi|2648624 LSU ribosomal protein L2P

RAG47409 AF1921
gi|2648642 SSU ribosomal protein S19P

19.26 Methanobacterium thermoautotrophicum   RTH01372 trpA MTH1660
gi|2622788 TRYPTOPHAN SYNTHASE ALPHA CHAIN (EC 4.2.1.20)

RTH02024 trpB MTH1659
gi|2622787 TRYPTOPHAN SYNTHASE BETA CHAIN (EC 4.2.1.20)

18.38 Archaeoglobus fulgidus RAG45695 dppD
gi|2648780 DIPEPTIDE TRANSPORT SYSTEM PERMEASE PROTEIN DPPD

RAG45696 dppC
gi|2648779 DIPEPTIDE TRANSPORT SYSTEM PERMEASE PROTEIN DPPC

13.96 Methanobacterium thermoautotrophicum   RTH01473 dnaJ
gi|2622399 DNAJ PROTEIN

RTH01629 dnaK
gi|2622398 DNAK PROTEIN

11.72 Deinococcus radiodurans   RDR01648 infA INITIATION FACTOR IF-1

RDR01651 rpsK SSU ribosomal protein S11P

11.38 Treponema pallidum   RTP00100 ntpI or ntpM V-TYPE SODIUM ATP SYNTHASE SUBUNIT I (EC 3.6.1.34)

RTP00107 ntpB V-TYPE SODIUM ATP SYNTHASE SUBUNIT B (EC 3.6.1.34)

11.11 Deinococcus radiodurans   RDR02462 aroK SHIKIMATE KINASE (EC 2.7.1.71)

RDR02463 aroB 3-DEHYDROQUINATE SYNTHASE (EC 4.6.1.3)

10.53 Methanobacterium thermoautotrophicum   RTH00210 trpE
gi|2622783 ANTHRANILATE SYNTHASE COMPONENT I (EC 4.1.3.27)

RTH00596 trpD
gi|2622789 ANTHRANILATE PHOSPHORIBOSYLTRANSFERASE (EC 2.4.2.18)

8.45 Helicobacter pylori   RHP00672 trpB
sp|P56142 TRYPTOPHAN SYNTHASE BETA CHAIN (EC 4.2.1.20)

RHP00673 trpC-trpF
gi|2314446 INDOLE-3-GLYCEROL PHOSPHATE SYNTHASE (EC 4.1.1.48) / N-(5'-PHOSPHO-RIBOSYL)ANTHRANILATE ISOMERASE(EC 5.3.1.24)

5.81 Clostridium acetobutylicum   RCA00929 thyBA THYMIDYLATE SYNTHASE (EC 2.1.1.45)

RCA00930 dfrA DIHYDROFOLATE REDUCTASE (EC 1.5.1.3)

4.45 Bacillus subtilis   RBS03520 ftsX gi|2618835 CELL DIVISION PROTEIN FTSX

RBS03521 ftsE
gi|2618833 CELL DIVISION ATP-BINDING PROTEIN FTSE

3.11 Bacillus subtilis RBS01632 fliN
sp|P24073 FLAGELLAR MOTOR SWITCH PROTEIN FLIN

RBS01636 fliQ
sp|P35535 FLAGELLAR BIOSYNTHETIC PROTEIN FLIQ

3.00 Clostridium acetobutylicum   RCA00178 tpi
gi|2829140 TRIOSEPHOSPHATE ISOMERASE (EC 5.3.1.1)

RCA00179 pgmI
gi|2829141 2,3-BISPHOSPHOGLYCERATE-INDEPENDENT PHOSPHOGLYCERATE MUTASE (EC 5.4.2.1)

1.91 Escherichia coli   REC06155 gyrB gi|1790134 DNA GYRASE SUBUNIT B (EC 5.99.1.3)

REC06549 dnaA
sp|P03004 CHROMOSOMAL REPLICATION INITIATOR PROTEIN DNAA

1.40 Escherichia coli   REC00696 sdhD
sp|P10445 SUCCINATE DEHYDROGENASE HYDROPHOBIC MEMBRANE ANCHOR PROTEIN

REC00697 sdhA
sp|P10444 SUCCINATE DEHYDROGENASE FLAVOPROTEIN SUBUNIT (EC 1.3.99.1)

0.58 Escherichia coli   REC00555 ylcD
sp|P77239 HYPOTHETICAL 44.3 KD PROTEIN IN NFRB-PHEP INTERGENIC REGION PRECURSOR

REC00556 ybdE
sp|P38054 HYPOTHETICAL 114.7 KD PROTEIN IN NFRB-PHEP INTERGENIC REGION

0.58 Clostridium acetobutylicum   RCA00176 gap gi|2829138 GLYCERALDEHYDE 3-PHOSPHATE DEHYDROGENASE (EC 1.2.1.12)

RCA00178 tpi
gi|2829140 TRIOSEPHOSPHATE ISOMERASE (EC 5.3.1.1)

It should be noted that of the 2,545 pairs with scores greater than 1.0, 725 included at least one gene that we have not yet been able to assign a reliable function to based on homology or other evidence; inferences about their functional couplings to other genes should provide valuable additional clues to identifying the functions of many of these ORFs.

To offer anecdotal evidence of the ability of these simple computations to effectively infer functional coupling, we looked at several fairly complex known operons and at the coupling scores produced for the genes in these operons.

Example 1:

To illustrate an efficiency of the described method in finding functional connections between the genes, we present the results of our analysis of 1313800 -- 1321100 region on the Escherichia coli chromosome (see Fig. 2).

Figure 2: Functional coupling between the ORFs in the trp region on the E. coli chromosome.

This region contains trp operon with the genes trpE, trpD, trpC, trpB and trpA. These genes code for the following enzymes: trpE - anthranilate synthetase (EC 4.1.3.27); trpD - glutamine amidotransferase-phophoribosyl anthranilate transferase; trpC - N-(5-phosphoribosyl)anthranilate isomerase/indole-3-glycerolphosphate synthetase; trpB - tryptophan synthetase (EC 4.2.1.20) beta chain; and trpA - tryptophan synthetase (EC 4.2.1.20) alpha chain. Solid lines connect clustered pairs; the numbers associated with the lines are the pairing scores (see above) which reflect the strength of the connections.

As one can see from Figure , TrpE, TrpD, TrpC, TrpB, and TrpA gene products form a highly conserved PCBBH cluster, which could be viewed as a potential operon. Each ORF in this cluster forms a high-scoring pair with the other members of the cluster, but not with any other ORF in the genome. All five members of this cluster belong to a well-studied tryptophan biosynthesis operon (for review see C. Yanofsky, 1996) [7], which has become the basic reference structure for studies on tryptophan metabolism. As was shown by C. Yanofsky et al. [8], the full-length polycistronic trp mRna encodes the five trp polypeptides (corresponding to trpA, trpB, trpC, trpD, and trpE genes) -- the enzymes of the tryptophan biosynthetic pathway. Our prediction of the potential operon therefore agrees with the literature.

Example 2: Diaminopimelate Biosynthetic Pathway

Three different pathways of D,L-diaminopimelate and L-lysine synthesis are known in prokaryotes [9]. The pathway shown below represents the lysine branch of the aspartic amino acid family biosynthetic pathway [10]. This pathway is implemented by nine distinct reactions catalyzed by the enzymes shown in Tab. 4. The PCBBHs relevant to this pathway are shown as well.

Table 4: Functional couplings between the enzymes of the lysine biosynthetic pathway.

Enzymes, catalysing the consecutive reactions in the pathway Gene name(s) Coupled ORFs

ASPARTATE KINASE (EC 2.7.2.4) ask; thrA; metL; lysC

ASPARTATE-SEMIALDEHYDE DEHYDROGENASE (EC 1.2.1.11) asd RCY44336 RPN01355 RBS01675

DIHYDRODIPICOLINATE SYNTHASE (EC 4.2.1.52) dapA RCY38981 RPN01356 RBS01677

RCA03322 RAG28035 RTH00849

DIHYDRODIPICOLINATE REDUCTASE (EC 1.3.1.26) dapB RCA03321 RAG48606 RTH01270

ACETYL-L,L-DIAMINOPIMELATE AMINOTRANSFERASE No sequences in the databases

TETRAHYDRODIPICOLINATE ACETYLTRANSFERASE No sequences in the databases

N-ACETYLDIAMINOPIMELATE DEACETYLASE (EC 3.5.1.47) No sequences in the databases

DIAMINOPIMELATE EPIMERASE (EC 5.1.1.7) dapF RPA02459 RTH01501

DIAMINOPIMELATE DECARBOXYLASE (EC 4.1.1.20) lysA RPA02462 RTH01007

Enzymes, catalysing the consecutive reactions in the pathway	Gene name(s)	Coupled ORFs
ASPARTATE KINASE (EC 2.7.2.4)	ask; thrA; metL; lysC
ASPARTATE-SEMIALDEHYDE DEHYDROGENASE (EC 1.2.1.11)	asd	RCY44336	RPN01355	RBS01675
DIHYDRODIPICOLINATE SYNTHASE (EC 4.2.1.52)	dapA	RCY38981	RPN01356	RBS01677
RCA03322	RAG28035	RTH00849
DIHYDRODIPICOLINATE REDUCTASE (EC 1.3.1.26)	dapB	RCA03321	RAG48606	RTH01270
ACETYL-L,L-DIAMINOPIMELATE AMINOTRANSFERASE	No sequences in the databases
TETRAHYDRODIPICOLINATE ACETYLTRANSFERASE	No sequences in the databases
N-ACETYLDIAMINOPIMELATE DEACETYLASE (EC 3.5.1.47)	No sequences in the databases
DIAMINOPIMELATE EPIMERASE (EC 5.1.1.7)	dapF	RPA02459	RTH01501
DIAMINOPIMELATE DECARBOXYLASE (EC 4.1.1.20)	lysA	RPA02462	RTH01007

Here, we have shown in distinct colors genes from distinct organisms that together form PCBBHs indicating linkages of function. From these PCBBHs, one can infer a tenuous connection between the three functions 1.2.1.11, 4.2.1.52, and 1.3.1.26, as well as the last two functions 4.1.1.20 and 5.1.1.7. Note, however, that no operon in any of the genomes actually couples more than two of these functions. The overall functional coupling emerges much like a holographic image from the set of genomes, but is not present in any single genome. We conjecture that the apparent lesser degree of "completeness" or "maturity" seen in the inferred connections between functions in this pathway might be the result of weaker evolutionary forces pushing these genes toward co-regulation, as compared to those that produced the obvious operons of translation, transcription, and other central cellular mechanisms. In cases such as shown in Tab. 4, we may still gain significant clues leading to meaningful inferences of functional coupling; however, a complete and compelling picture of such pathways might not fully emerge until we have many hundreds of genomes.

We have constructed a formatted version of these pairwise scores and made them available on the WWW via the WIT system [2, 3].

Example 3: Purine Metabolism

The PCBBHs that connect the enzymes participating in de novo synthesis of IMP are shown in Tab. 5. In this table each row corresponds to a single functional role -- an enzyme catalyzing a particular step in the Purine biosynthetic pathway. Occurrences of these particular functional roles in each genome are presented in each column of the table. We use distinct colors to show potential operon groupings within a column (that is, ORFs sharing a common color within a column are all close on the chromosome in the sense discussed in Definition 1). The first twelve rows of the table correspond to the known enzymatic roles in this pathway. The last row contains evidence that an unknown protein present in seven of the genomes is somehow functionally coupled to IMP biosynthesis.

Table 5: Functional couplings between the enzymes, participating in de novo synthesis of IMP.

Enzymes of Purine Biosynthetic Pathway DR CY ST PN BS CA MT PA EC HI AG TH MJ PF

AMIDOPHOSPHORIBOSYLTRANSFERASE (EC 2.4.2.14) - - RST01489 RPN00061 RBS00650 RCA01861 - - - - - - RMJ08546 -

PHOSPHORIBOSYLAMINE-GLYCINE
LIGASE (EC 6.3.4.13) RDR00668 - RST01483 RPN00065 RBS00654 RCA01857 - - REC06280 RHI05498 - - - -

PHOSPHORIBOSYLGLYCINAMIDE
FORMYLTRANSFERASE (EC 2.1.2.2) - - RST01487 RPN00063 RBS00652 RCA01859 RMT00037 - REC02440 RHI15219 - - - -

PHOSPHORIBOSYLFORMYLGLYCINAMIDINE
SYNTHASE (EC 6.3.5.3) RDR03710 RCY15677 - - RBS00648 - RMT00893 - - - - RTH01934 - RPF00609

PHOSPHORIBOSYLFORMYLGLYCINAMIDINE
SYNTHASE II (EC 6.3.5.3) RDR03711 - RST01490 RPN00060 RBS00649 - - - - - RAG50222 - - RPF00610

PHOSPHORIBOSYLFORMYLGLYCINAMIDINE
CYCLO-LIGASE (EC 6.3.3.1) - - RST01488 RPN00062 RBS00651 RCA01860 - - REC02439 RHI15218 - - RMJ11899 -

PHOSPHORIBOSYLAMINOIMIDAZOLE CARBOXYLASE
ATPASE SUBUNIT (EC 4.1.1.21) RDR01499 - RST01481 RPN00067 RBS00644 - RMT02837 RPA03721 REC04505 RHI10627 - - - RPF00920

PHOSPHORIBOSYLAMINOIMIDAZOLE CARBOXYLASE
CATALYTIC SUBUNIT (EC 4.1.1.21) RDR01497 - RST01482 RPN00066 RBS00643 RCA01863 RMT02836 RPA03720 REC04506 RHI10626 - - - RPF00919

PHOSPHORIBOSYLAMINOIMIDAZOLE-
SUCCINOCARBOXAMIDE SYNTHASE (EC 6.3.2.6) - - RST01491 RPN00059 RBS00646 RCA01862 - - - - - RTH00832 RMJ01710 -

ADENYLOSUCCINATE LYASE (EC 4.3.2.2) - - RST01479 RPN00069 RBS00645 - - - - - - - - -

PHOSPHORIBOSYLAMINOIMIDAZOLECARBOXAMIDE
FORMYLTRANSFERASE (EC 2.1.2.3) /
IMP CYCLOHYDROLASE (EC 3.5.4.10) - - RST01486 RPN00064 RBS00653 RCA01858 RMT00036 - REC06281 RHI09169 - - - -

unknown - RCY24689 - - RBS00647 - RMT00891 - - - RAG29290 RTH01716 RMJ00185 RPF00608

Enzymes of Purine Biosynthetic Pathway	DR	CY	ST	PN	BS	CA	MT	PA	EC	HI	AG	TH	MJ	PF
AMIDOPHOSPHORIBOSYLTRANSFERASE (EC 2.4.2.14)	-	-	RST01489	RPN00061	RBS00650	RCA01861	-	-	-	-	-	-	RMJ08546	-
PHOSPHORIBOSYLAMINE-GLYCINE LIGASE (EC 6.3.4.13)	RDR00668	-	RST01483	RPN00065	RBS00654	RCA01857	-	-	REC06280	RHI05498	-	-	-	-
PHOSPHORIBOSYLGLYCINAMIDE FORMYLTRANSFERASE (EC 2.1.2.2)	-	-	RST01487	RPN00063	RBS00652	RCA01859	RMT00037	-	REC02440	RHI15219	-	-	-	-
PHOSPHORIBOSYLFORMYLGLYCINAMIDINE SYNTHASE (EC 6.3.5.3)	RDR03710	RCY15677	-	-	RBS00648	-	RMT00893	-	-	-	-	RTH01934	-	RPF00609
PHOSPHORIBOSYLFORMYLGLYCINAMIDINE SYNTHASE II (EC 6.3.5.3)	RDR03711	-	RST01490	RPN00060	RBS00649	-	-	-	-	-	RAG50222	-	-	RPF00610
PHOSPHORIBOSYLFORMYLGLYCINAMIDINE CYCLO-LIGASE (EC 6.3.3.1)	-	-	RST01488	RPN00062	RBS00651	RCA01860	-	-	REC02439	RHI15218	-	-	RMJ11899	-
PHOSPHORIBOSYLAMINOIMIDAZOLE CARBOXYLASE ATPASE SUBUNIT (EC 4.1.1.21)	RDR01499	-	RST01481	RPN00067	RBS00644	-	RMT02837	RPA03721	REC04505	RHI10627	-	-	-	RPF00920
PHOSPHORIBOSYLAMINOIMIDAZOLE CARBOXYLASE CATALYTIC SUBUNIT (EC 4.1.1.21)	RDR01497	-	RST01482	RPN00066	RBS00643	RCA01863	RMT02836	RPA03720	REC04506	RHI10626	-	-	-	RPF00919
PHOSPHORIBOSYLAMINOIMIDAZOLE- SUCCINOCARBOXAMIDE SYNTHASE (EC 6.3.2.6)	-	-	RST01491	RPN00059	RBS00646	RCA01862	-	-	-	-	-	RTH00832	RMJ01710	-
ADENYLOSUCCINATE LYASE (EC 4.3.2.2)	-	-	RST01479	RPN00069	RBS00645	-	-	-	-	-	-	-	-	-
PHOSPHORIBOSYLAMINOIMIDAZOLECARBOXAMIDE FORMYLTRANSFERASE (EC 2.1.2.3) / IMP CYCLOHYDROLASE (EC 3.5.4.10)	-	-	RST01486	RPN00064	RBS00653	RCA01858	RMT00036	-	REC06281	RHI09169	-	-	-	-
unknown	-	RCY24689	-	-	RBS00647	-	RMT00891	-	-	-	RAG29290	RTH01716	RMJ00185	RPF00608

One notable feature of this example is how the evidence from fourteen distinct genomes combines together to present a vivid portrait of the pathway, as well as the possible coupling of ORFs of unknown function to this pathway. In particular, note that one would still have been able to reconstruct this pathway (as well as infer that it is connected to a protein of unknown function) even if the complete version of it in B. subtilis (BS) and the nearly complete versions in S. pneumoniae (PN) and S. pyogenes (ST) had not been included. Again, one sees that the pathways appear to emerge ``holographicly'' from sets of operons having a greater or lesser degree of ``completeness'' or ``maturity'' in different organisms.

Example 4: A Problematic Instance

It is known that genes that do not have obvious functional connections may be co-transcribed [14,15]. Recognition of these atypical operons could provide very important insights into the interconnections between the functional subsystems and regulatory mechanisms involved in complex biological processes.
Our analysis detected a cluster of genes composed of the ribosomal protein S2P (rpsB), elongation factor Ts (tsf), the ribosome recycling factor (rrf), the enzyme phosphatidate cytidylyltransferase (cdsA), uridylate kinase (pyrH), and two hypothetical proteins yaeS and yaeL.
Using the technique described above, we analyzed functional coupling between the ORFs in this region in several completely sequenced genomes.

As one can see from Fig. 3, in most of the genomes under consideration rpsB, tsf, pyrH and frr form a highly conserved gene cluster, which includes the uncharacterized genes yaeS and yaeL. With a few exceptions the cluster also contains cdsA, which encodes CDP- diacylglycerol synthase (EC 2.7.7.41).

Figure 3: The rps--yae region on the chromosome from ten organisms.

The order of the genes in the described cluster is preserved in all genomes except M. tuberculosis, in which the pyrH gene precedes rpsB. The functional connections between the genes in this cluster are not obvious. Known experimental data, however, could provide some valuable insights.

It is known that the ribosome recycling factor is an essential protein for bacterial life. The two known functions of ribosome recycling factor (rrf, originally called ribosome releasing factor), are described in [11]: The first function relates to the disassembly of the termination complex, which consists of mRNA, tRNA, and the ribosome bound to the mRNA at the termination codon. The second function of rrf is to prevent errors in translation. In polyphenylalanine synthesis programmed by polyuridylic acid, misincorporation of isoleucine, leucine, or a mixture of amino acids was stimulated up to 17-fold when rrf was omitted from the in vitro system. rrf did not influence the large error (10-fold increase) induced by streptomycin. This means that rrf participates not only in the disassembly of the termination complex but also in peptide elongation.

Yamanaka et al. [12] demonstrated that pyrH gene (formerly smbA) [13], which encodes uridylate kinase (an enzyme participating in pyrimidine biosynthesis), is also involved in chromosome partitioning in E. coli by suppressing mukB. purH was also found to be essential for cell proliferation in the range from 22 to 42 degrees C. Cells that lacked the pyrH protein ceased macromolecular synthesis. The pyrH mutants are sensitive to a detergent, sodium dodecyl sulfate, and they show a novel morphological phenotype under nonpermissive conditions, suggesting a defect in specific membrane sites.

These observations provide only a very indirect grasp of any functional coupling; however, the functions all do appear to be essential during rapid growth of a cell. It may well turn out that the conserved positional relationship of these genes does not convey significant information relating to functional coupling (and there are many instances like this). However, the number of cases in which the functional coupling is obvious suggests that these cases should be considered carefully before dismissing them as accidental curiosities.

The results of our analysis of the potential operons in 24 genomes are available at the following locations:

The display for ORFs in the WIT system supports queries about related potential operons and possible functional couplings.

A browser provides tabular and graphical representations of the groups of conserved PCBBHs between pairs of organisms and potential operons within an organism.

DISCUSSION

The ability of such a simple computation to produce such detailed insights into functional coupling is striking. In addition, it must be remembered that our results were produced from only 24 genomes, many of which were incomplete. The ability of the technique proposed here to accurately determine functional coupling will improve dramatically as the number of genomes included in the analysis increases; a crude analysis suggests that the set of PCBBHs should grow as the square of the number of genomes. Given the level of detailed functional data that can be immediately inferred from these PCBBHs, it might well be the case that the cheapest way to acquire detailed evidence of functional coupling would be to rapidly sequence another 50-100 genomes to the point where average contig lengths would be 3-5 Kb. To illustrate this point (albeit quite crudely), we present the following short table, which can provide some insight into the rate of growth in the number of PCBBHs, with sufficient evidence to allow inference of functional coupling:

Table 6: Growth of number of PCBBHs with number of genomes.

Number of Genomes    Number of PCBBHs with Scores > 0.1

4      998

8    4859

16    12570

Number of Genomes		Number of PCBBHs with Scores > 0.1
4		998
8		4859
16		12570

The first set of four genomes that we used were

Haemophilus influenzae
Mycoplasma genitalium
Methanococcus jannaschii
Synechocystis sp.

To arrive at the eight genomes, we added

Escherichia coli
Helicobacter pylori
Mycoplasma pneumoniae
Mycobacterium tuberculosis

Finally, to arrive at the sixteen, we added

Archaeoglobus fulgidus
Bacillus subtilis
Borrelia burgdorferi
Clostridium acetobutylicum
Deinococcus radiodurans
Methanobacterium thermoautotrophicum
Neisseria meningitidis
Streptococcus pneumoniae

Here, we counted all PCBBHs for which the simple scoring scheme described above produced scores of 0.1 or better (a weak score, but one suggesting a coupling).

It is difficult to make an accurate estimate of how rapidly the set of reliable PCBBHs will grow, given the number of variables in the genomes used for this tabulation (number of ORFs/genome, phylogenetic distribution, size of contigs, differing levels of operons in different domains, etc.). One would expect the growth to be better than linear.

We have described the computation of PCBBHs and illustrated their use by examples that are probably familiar to the reader. However, it should be stressed that the PCBBHs that we can already compute offer what appears to be substantial evidence coupling hundreds of genes with unknown function to complexes of genes for which the function is known. The detailed analysis of this data will require years, but it is already clear that this data provides a remarkably rich set of clues that should play a role in guiding wet lab verification or rejection of a rapidly emerging set of hypothesis.

ACKNOWLEDGEMENT

This work was supported in part by the U.S. Department of Energy under contract W-31-109-Eng-38.

	CP Scores > 0.1
ORFs in 24 genomes with a function that occurs in a single pathway only: 1720	ORFs with a function that occurs in a single pathway only, but are not members of a CP: 1044
	ORFs with a function that occurs in a single pathway only, and are members of a CP: 676	Number of ORFs connected by a CP to another ORF that has an assigned function that is in the same pathway: 354 (connected by 636 CPs)
		Number of ORFs connected by a CP to another ORF that either has no assigned function, or has been assigned a function in a different pathway: 322 (connected by 1122 CPs)

Pairwise score	Organism	WIT id	Other names	Assigned Function
35.51	Archaeoglobus fulgidus	RAG37125	AF1922 gi\|2648624	LSU ribosomal protein L2P
35.51	Archaeoglobus fulgidus	RAG47409	AF1921 gi\|2648642	SSU ribosomal protein S19P
19.26	Methanobacterium thermoautotrophicum	RTH01372	trpA MTH1660 gi\|2622788	TRYPTOPHAN SYNTHASE ALPHA CHAIN (EC 4.2.1.20)
19.26	Methanobacterium thermoautotrophicum	RTH02024	trpB MTH1659 gi\|2622787	TRYPTOPHAN SYNTHASE BETA CHAIN (EC 4.2.1.20)
18.38	Archaeoglobus fulgidus	RAG45695	dppD gi\|2648780	DIPEPTIDE TRANSPORT SYSTEM PERMEASE PROTEIN DPPD
18.38	Archaeoglobus fulgidus	RAG45696	dppC gi\|2648779	DIPEPTIDE TRANSPORT SYSTEM PERMEASE PROTEIN DPPC
13.96	Methanobacterium thermoautotrophicum	RTH01473	dnaJ gi\|2622399	DNAJ PROTEIN
13.96	Methanobacterium thermoautotrophicum	RTH01629	dnaK gi\|2622398	DNAK PROTEIN
11.72	Deinococcus radiodurans	RDR01648	infA	INITIATION FACTOR IF-1
11.72	Deinococcus radiodurans	RDR01651	rpsK	SSU ribosomal protein S11P
11.38	Treponema pallidum	RTP00100	ntpI or ntpM	V-TYPE SODIUM ATP SYNTHASE SUBUNIT I (EC 3.6.1.34)
11.38	Treponema pallidum	RTP00107	ntpB	V-TYPE SODIUM ATP SYNTHASE SUBUNIT B (EC 3.6.1.34)
11.11	Deinococcus radiodurans	RDR02462	aroK	SHIKIMATE KINASE (EC 2.7.1.71)
11.11	Deinococcus radiodurans	RDR02463	aroB	3-DEHYDROQUINATE SYNTHASE (EC 4.6.1.3)
10.53	Methanobacterium thermoautotrophicum	RTH00210	trpE gi\|2622783	ANTHRANILATE SYNTHASE COMPONENT I (EC 4.1.3.27)
10.53	Methanobacterium thermoautotrophicum	RTH00596	trpD gi\|2622789	ANTHRANILATE PHOSPHORIBOSYLTRANSFERASE (EC 2.4.2.18)
8.45	Helicobacter pylori	RHP00672	trpB sp\|P56142	TRYPTOPHAN SYNTHASE BETA CHAIN (EC 4.2.1.20)
8.45	Helicobacter pylori	RHP00673	trpC-trpF gi\|2314446	INDOLE-3-GLYCEROL PHOSPHATE SYNTHASE (EC 4.1.1.48) / N-(5'-PHOSPHO-RIBOSYL)ANTHRANILATE ISOMERASE(EC 5.3.1.24)
5.81	Clostridium acetobutylicum	RCA00929	thyBA	THYMIDYLATE SYNTHASE (EC 2.1.1.45)
5.81	Clostridium acetobutylicum	RCA00930	dfrA	DIHYDROFOLATE REDUCTASE (EC 1.5.1.3)
4.45	Bacillus subtilis	RBS03520	ftsX gi\|2618835	CELL DIVISION PROTEIN FTSX
4.45	Bacillus subtilis	RBS03521	ftsE gi\|2618833	CELL DIVISION ATP-BINDING PROTEIN FTSE
3.11	Bacillus subtilis	RBS01632	fliN sp\|P24073	FLAGELLAR MOTOR SWITCH PROTEIN FLIN
3.11	Bacillus subtilis	RBS01636	fliQ sp\|P35535	FLAGELLAR BIOSYNTHETIC PROTEIN FLIQ
3.00	Clostridium acetobutylicum	RCA00178	tpi gi\|2829140	TRIOSEPHOSPHATE ISOMERASE (EC 5.3.1.1)
3.00	Clostridium acetobutylicum	RCA00179	pgmI gi\|2829141	2,3-BISPHOSPHOGLYCERATE-INDEPENDENT PHOSPHOGLYCERATE MUTASE (EC 5.4.2.1)
1.91	Escherichia coli	REC06155	gyrB gi\|1790134	DNA GYRASE SUBUNIT B (EC 5.99.1.3)
1.91	Escherichia coli	REC06549	dnaA sp\|P03004	CHROMOSOMAL REPLICATION INITIATOR PROTEIN DNAA
1.40	Escherichia coli	REC00696	sdhD sp\|P10445	SUCCINATE DEHYDROGENASE HYDROPHOBIC MEMBRANE ANCHOR PROTEIN
1.40	Escherichia coli	REC00697	sdhA sp\|P10444	SUCCINATE DEHYDROGENASE FLAVOPROTEIN SUBUNIT (EC 1.3.99.1)
0.58	Escherichia coli	REC00555	ylcD sp\|P77239	HYPOTHETICAL 44.3 KD PROTEIN IN NFRB-PHEP INTERGENIC REGION PRECURSOR
0.58	Escherichia coli	REC00556	ybdE sp\|P38054	HYPOTHETICAL 114.7 KD PROTEIN IN NFRB-PHEP INTERGENIC REGION
0.58	Clostridium acetobutylicum	RCA00176	gap gi\|2829138	GLYCERALDEHYDE 3-PHOSPHATE DEHYDROGENASE (EC 1.2.1.12)
0.58	Clostridium acetobutylicum	RCA00178	tpi gi\|2829140	TRIOSEPHOSPHATE ISOMERASE (EC 5.3.1.1)