Use of Contiguity on the Chromosome to Predict Functional Coupling

Ross Overbeek, Michael Fonstein*, Mark D'Souza, Gordon D. Pusch and Natalia Maltsev




Mathematics and Computer Science Division, Argonne National Laboratory, IL 60439, USA
* Department of Molecular Genetics and Cell Biology, University of Chicago, IL 60637, USA.
[email protected]







ABSTRACT

The availability of a growing number of completely sequenced genomes opens new opportunities for understanding of complex biological systems. Success of genome-based biology will, to a large extent, depend on the development of new approaches and tools for efficient comparative analysis of the genomes and  their organization. We have developed a technique for detecting possible functional coupling between genes based on detection of potential operons. The approach involves computation of "pairs of close bi-directional best hits", which are pairs of genes that apparently occur within operons in multiple genomes.  Using these pairs, one can compose evidence (based on the number of distinct genomes and the phylogenetic distance between the orthologous pairs) that a pair of genes is potentially functionally coupled.  The technique has revealed a surprisingly rich and apparently accurate set of functionally coupled genes.  The approach depends on the use of a relatively large number of genomes, and the amount of detected coupling grows dramatically as the number of genomes increases.

Key words: microbial genomes; operons; gene function identification; genetic sequence analysis; comparative analysis


INTRODUCTION

The availability of a growing number of the completely sequenced genomes opens new opportunities. Success of genome-based biology [1] will to a large extent depend on the development of new approaches and tools for efficient comparative analysis of the genomes and their organization. The issue of how to determine the function of some specific gene within a genome has been present for many years. The desire to rapidly characterize large numbers of genes from the newly available genomes has led to a growing sense of urgency, as well as a number of ambitious experimental projects to address the need. We have developed an algorithm to detect potential operons based on comparative of multiple genomes. It depends on the availability of a relatively large number of genomes, however, it does not require these genomes be complete. Furthermore, it provides significant clues as to the function of many genes of unknown or hypothetical function. Our initial results indicate that at least a fifth of the sequenced genes show detectable functional coupling by this method, and that there is every reason to expect this percentage to grow rapidly as the number of sequenced genomes increases.


BASIC APPROACH

It is well known that in many organisms the genes responsible for related functions are located close to each other on the chromosome. If one could accurately predict operons, the availability of a growing number of prokaryotic genomes would offer numerous significant clues relating to the function of "hypothetical proteins" (those for which no reliable identification of function has yet been made). However, an accurate prediction of operons is nontrivial, and the only reliable identifications are currently based on wet-lab experimentation. Nevertheless, given the growing number of genomes, it is reasonable to ask whether accurate predictions of potential operons could be based on a weaker (but related) notion of functional coupling. In this paper, we propose a straightforward computation that is remarkably accurate in predicting functional coupling of genes.

We begin with a small set of definitions that will be used throughout the rest of this paper:

The basic notion of a PCBBH is really quite intuitive and is shown in Fig. 1.

 
Figure 1: The concept of a PCBBH


It is reasonably straightforward to compute all PCBBHs for a set of genomes, and we have done so for 24 prokaryotic genomes maintained within the WIT system  developed at Argonne National Laboratory by R. Overbeek et al. [2, 3]. We employed data from a number of partial genomes as well as complete genomes (hence we altered the definition of BBH to state that there is "no known gene" more similar, rather than "no gene" more similar).

The existence of a pair of close BBHs is certainly a good indication that there might actually be two operons in two genomes, each containing a pair of corresponding orthologs. However, this is not always the case. Our hope is that with enough genomes, the cases in which the implication does not hold will be "washed out with large numbers". It should be noted that the vast majority of PCBBHs derived from closely related organisms (say, two strains of the same species) are of no significance whatsoever and certainly should not be used to infer functional coupling. Conversely, PCBBHs from distantly related organisms are of high significance, since they are unlikely to occur because of chance alone, and can be used to infer  functional coupling.

Throughout the examples discussed in the remainder of this work, the reader will note our use of WIT ids for genes and their products. In WIT the ORF identifiers all have the form Rxxddddd, where xx identifies the organism. The organisms corresponding to these two-letter codes are as shown in Tab. 1.



 
Table 1: Genome Codes
AA Aquifex aeolicus MP Mycoplasma pneumoniae
AG Archaeoglobus fulgidus MT Mycobacterium tuberculosis
BS Bacillus subtilis NG Neisseria gonorrhoea
BB Borrelia burgdorferi NM Neisseria meningitidis
CA Clostridium acetobutylicum PA Pseudomonas aeruginosa
CY Synechocystis sp. PF Pyrococcus furiosus
DR Deinococcus radiodurans PH Pyrococcus horikoshii
EC Escherichia coli PN Streptococcus pneumoniae
HI Haemophilus influenzae RC Rhodobacter capsulatus SB1003
HP Helicobacter pylori ST Streptococcus pyogenes
MG Mycoplasma genitalium TP Treponema pallidum
MJ Methanococcus jannaschii TH Methanobacterium thermoautotrophicum

We will use these codes as column headers in tables below.

Given a set of PCBBHs, we can use them to amass evidence that two particular genes on a given genome are actually co-expressed (and are therefore likely to be functionally coupled). The process proceeds as follows:

  1. First, a pair of close genes occurring in at least one PCBBH is selected. As an example, consider the following pair from the Mycoplasma genitalium genome:
  2. MG155 (sp|P47401),
    MG154 (sp|P47400).

    These IDs correspond to two ORFs from the Mycoplasma genitalium genome sequenced by TIGR  [4].

    This pair of genes occurs in PCBBHs that include Mycoplasma genitalium and each of eighteen other organisms (AG, BS, BB, DR, EC, HI, HP, TH, MJ, MT, MP, NG, NM, PA, ST, PN, CY, and TP). Clearly, this amounts to an overwhelming amount of evidence that the co-occurrences are not random. In fact, this pair of genes correspond to

    MG155 (sp|P47401) SSU ribosomal protein S19P
    MG154 (sp|P47400) LSU ribosomal protein L2P
    and these functional roles occur within the same operon in many organisms.


  3. After selecting a pair of genes from an organism and collecting the list of PCBBHs containing the pair, we "score" the evidence that the two genes are co-occurring. It is not clear exactly what scoring algorithm should be used, but the following seem to be desirable attributes of any potential approach:

We have experimented with several scoring algorithms, and the final results seem to be almost identical. The algorithm used to produce the results presented in this paper was as follows:

Given a pair of genes Xa and Ya from genome Ga, the score reflecting the evidence that they co-occur was computed by adding an increment for each pair (Xi,Yi) from genomes Gi for which (Xa,Ya) and (Xi,Yi) form a PCBBH:

  1. Add the distance between Ga and Gi to the score. By "distance" we mean some reasonable estimate of the distances between the organisms.
  2. If there exists one or more third pairs from other genomes that form the triangular relationship, pick genome Gb that maximizes
  3. MinD = the minimum of Add MinD to the score.

The result of summing these increments is the score that offers a rough measure that the co-occurrences of Xa and Ya are meaningful.

In forming these scores, we used the distances from the phylogenetic tree distributed by the Ribosomal Database Project [6] as estimates of the "distance between genomes". These distances range from a minimum of 0 (between two strains of Rhodobacter capsulatus) and 1.53 (between Mycoplasma genitalium and Methanobacterium thermoautotrophicum). At an intuitive level, it would seem that a score greater than about 1.0 would certainly indicate that a careful examination of the evidence is warranted. In fact, we believe that any score greater than about 0.1 is quite suggestive and often does indicate a real functional coupling of the genes. Once we have scored all pairs of genes for which PCBBHs exist,  the entire list of gene pairs is sorted by score (with higher scores corresponding to  pairs of genes Xa and Ya for which functional coupling is more likely). The central question then becomes:

Are the pairs with relatively high scores actually functionally coupled?

Clearly many pairs of genes that are, in fact, functionally coupled will not show up with high scores. However, it will become clear that almost all the pairs with relatively high scores are functionally coupled.



RESULTS

For the 24 genomes we included in the analysis, we computed 34,644 PCBBHs. Using these PCBBHs, we computed scores for 23,144 pairs of genes. Of these, 10,531 pairs had scores greater than 1.0; 17,247 had scores greater than 0.1.

We will argue that scores above 0.1 do, in fact, indicate probable functional coupling, which motivates the following definition:

Below we will present anecdotal evidence that a clustered pair actually represents a useful clue in determination of gene function. Before we discuss specific examples, however, let us describe an experiment that we used to evaluate the significance of clustered pairs, by examining what fraction of a set of ORFs that we already have good reason to believe are functionally coupled appear in CPs as a function of the score-threshold.

We examined the pathways in the MPW Database [16] and selected those pathways that contain one or more functional roles (normally enzymes) occurring only in  that  single pathway, and no other. For every organism in WIT that has been asserted to utilize pathways in this set, we then selected those ORFs assigned to each of these "single-pathway functions"; we found 1720 such ORFs.

We then examined how many of these 1720 ORFs were in CPs, as well as how many of those CPs were with other ORFs in the same pathway. Since each of the selected ORFs occurs in only a single pathway, and since two ORFs believed to code for functions in the same pathway are almost certainly functionally coupled, the fraction of pairs containing a selected ORF and an ORF from the same pathway that are also CPs (i. e., that have scores exceeding the chosen threshold) should provide a rough measure of the significance of CPs at that threshold. We found that, of these 1720 selected ORFs:

Thus, out of a total of 1758 CPs (636 connections between ORFs within the same pathway, plus 1122 connections that are not in the same pathway), at least 36% (636/1758) almost certainly represent clear instances of known functional couplings, by virtue of the facts that both members of the CP have been assigned to have functions that are in the same metabolic pathway, and that at least one of them has been assigned a function that occurs in no other pathway.

Note that this estimate of 36% is a conservative lower bound, because ORFs that have not yet been assigned a function have been lumped in with ORFs that have been assigned a function that is in a different pathway. Note also that, just because two ORFs are not in the same metabolic pathway, it does not imply that they might not be functionally coupled, since the MPW pathways are not all independent -- many of them are in fact subsets of metabolic networks. Indeed, our manual analysis suggests that the majority of the remaining CPs also appear to represent what we believe to be real and biochemically justifiable instances of functional couplings -- they simply did not fit into the highly restrictive set of conditions imposed by our tests.

The results of this experiment are summarized in Tab. 2 below.

 
Table 2: Analysis of the ORFs with a functional roles that occur in a single pathway using the functional coupling algorithm.
 
CP Scores > 0.1
ORFs in 24 genomes with a function that occurs in a single pathway only:
1720
ORFs  with a function that occurs in a single pathway only, but are not members of a CP: 
1044
 
ORFs with a function that occurs in a single pathway only, and are members of a CP:
676
Number of ORFs connected by a CP to another ORF that has an assigned function that is in the same pathway:
354
(connected by 636 CPs)
Number of ORFs connected by a CP to another ORF that either has no assigned function, or has been assigned a function in a different pathway:
322
(connected by 1122 CPs)

We then repeated the same steps using a cutoff of 1.0 for the calculation of CPs (i.e., we altered the definition of CP). In this case, somewhat more than half as many CPs were detected (957 for a 1.0 cutoff, versus 1758 for a 0.1 cutoff). However, the ratio of the number of connections between genes within the same pathway to the total number of connections rises only slightly, from 36% to 38%. We consider this to be a strong indication that, while a score of 1.0 is certainly far stronger than a score of 0.1, both scores are excellent indicators of functional coupling.

This "experiment" is strongly suggestive that PCBBHs do reflect actual functional coupling. However we have not yet been able to accurately quantify the extent to which such evidence can be relied upon. We believe that the real value of this technology will become truly established only by actual verification of the hundreds of predictions that can easily be made from the existing data.

Tab. 3 offers an extraction of some pairwise scores to give a feel for the output.

 

Table 3: Pairwise Scores
Pairwise score Organism  WIT id  Other names Assigned Function
35.51  Archaeoglobus fulgidus  RAG37125 AF1922 gi|2648624  LSU ribosomal protein L2P 
RAG47409  AF1921  
gi|2648642 
SSU ribosomal protein S19P
19.26  Methanobacterium thermoautotrophicum   RTH01372  trpA MTH1660  
gi|2622788 
TRYPTOPHAN SYNTHASE ALPHA CHAIN (EC 4.2.1.20)
RTH02024  trpB  MTH1659  
gi|2622787 
TRYPTOPHAN SYNTHASE BETA CHAIN (EC 4.2.1.20) 
18.38  Archaeoglobus fulgidus  RAG45695  dppD  
gi|2648780 
DIPEPTIDE TRANSPORT SYSTEM PERMEASE PROTEIN DPPD
 RAG45696  dppC  
gi|2648779 
DIPEPTIDE TRANSPORT SYSTEM PERMEASE PROTEIN DPPC 
13.96   Methanobacterium thermoautotrophicum   RTH01473  dnaJ  
gi|2622399 
DNAJ PROTEIN 
RTH01629  dnaK  
gi|2622398 
DNAK PROTEIN 
11.72  Deinococcus radiodurans   RDR01648  infA  INITIATION FACTOR IF-1 
RDR01651  rpsK  SSU ribosomal protein S11P 
11.38  Treponema pallidum   RTP00100  ntpI or ntpM  V-TYPE SODIUM ATP SYNTHASE SUBUNIT I (EC 3.6.1.34) 
RTP00107  ntpB  V-TYPE SODIUM ATP SYNTHASE SUBUNIT B (EC 3.6.1.34) 
11.11  Deinococcus radiodurans   RDR02462  aroK  SHIKIMATE KINASE (EC 2.7.1.71) 
RDR02463  aroB  3-DEHYDROQUINATE SYNTHASE (EC 4.6.1.3) 
10.53  Methanobacterium thermoautotrophicum   RTH00210  trpE  
gi|2622783
ANTHRANILATE SYNTHASE COMPONENT I (EC 4.1.3.27) 
RTH00596  trpD  
gi|2622789 
ANTHRANILATE PHOSPHORIBOSYLTRANSFERASE (EC 2.4.2.18)
8.45  Helicobacter pylori   RHP00672  trpB  
sp|P56142
TRYPTOPHAN SYNTHASE BETA CHAIN (EC 4.2.1.20) 
RHP00673  trpC-trpF  
gi|2314446 
INDOLE-3-GLYCEROL PHOSPHATE SYNTHASE (EC 4.1.1.48) / N-(5'-PHOSPHO-RIBOSYL)ANTHRANILATE ISOMERASE(EC 5.3.1.24) 
5.81  Clostridium acetobutylicum   RCA00929  thyBA THYMIDYLATE SYNTHASE (EC 2.1.1.45) 
RCA00930  dfrA  DIHYDROFOLATE REDUCTASE (EC 1.5.1.3) 
4.45  Bacillus subtilis   RBS03520  ftsX gi|2618835 CELL DIVISION PROTEIN FTSX 
RBS03521  ftsE  
gi|2618833 
CELL DIVISION ATP-BINDING PROTEIN FTSE 
3.11  Bacillus subtilis  RBS01632  fliN  
sp|P24073
FLAGELLAR MOTOR SWITCH PROTEIN FLIN 
RBS01636  fliQ  
sp|P35535 
FLAGELLAR BIOSYNTHETIC PROTEIN FLIQ 
3.00  Clostridium acetobutylicum   RCA00178  tpi  
gi|2829140 
TRIOSEPHOSPHATE ISOMERASE (EC 5.3.1.1) 
RCA00179  pgmI  
gi|2829141
2,3-BISPHOSPHOGLYCERATE-INDEPENDENT PHOSPHOGLYCERATE MUTASE (EC 5.4.2.1)
1.91  Escherichia coli   REC06155  gyrB gi|1790134 DNA GYRASE SUBUNIT B (EC 5.99.1.3) 
REC06549  dnaA  
sp|P03004 
CHROMOSOMAL REPLICATION INITIATOR PROTEIN DNAA 
1.40  Escherichia coli   REC00696  sdhD  
sp|P10445
SUCCINATE DEHYDROGENASE HYDROPHOBIC MEMBRANE ANCHOR PROTEIN
REC00697  sdhA  
sp|P10444
SUCCINATE DEHYDROGENASE FLAVOPROTEIN SUBUNIT (EC 1.3.99.1) 
0.58  Escherichia coli   REC00555  ylcD  
sp|P77239
HYPOTHETICAL 44.3 KD PROTEIN IN NFRB-PHEP INTERGENIC REGION PRECURSOR 
 REC00556  ybdE  
sp|P38054 
HYPOTHETICAL 114.7 KD PROTEIN IN NFRB-PHEP INTERGENIC REGION 
0.58  Clostridium acetobutylicum   RCA00176  gap gi|2829138 GLYCERALDEHYDE 3-PHOSPHATE DEHYDROGENASE (EC 1.2.1.12) 
RCA00178  tpi  
gi|2829140 
TRIOSEPHOSPHATE ISOMERASE (EC 5.3.1.1) 

It should be noted that of the 2,545 pairs with scores greater than 1.0,  725 included at least one gene that we have not yet been able to assign a reliable function to based on homology or other evidence; inferences about their functional couplings to other genes should provide valuable additional clues to identifying the functions of many of these ORFs.

To offer anecdotal evidence of the ability of these simple computations to effectively infer functional coupling, we looked at several fairly complex known operons and at the coupling scores produced for the genes in these operons.

Example 1:

To illustrate an efficiency of the described method in finding functional connections between the genes, we present the results of our analysis of  1313800 -- 1321100 region on the Escherichia coli chromosome (see Fig. 2).

 
Figure 2: Functional coupling between the ORFs in the trp region on the E. coli chromosome.

This region contains trp operon with the genes trpE, trpD, trpC, trpB and trpA.  These genes code for the following enzymes: trpE - anthranilate synthetase (EC 4.1.3.27); trpD - glutamine amidotransferase-phophoribosyl anthranilate transferase; trpC - N-(5-phosphoribosyl)anthranilate isomerase/indole-3-glycerolphosphate synthetase; trpB - tryptophan synthetase (EC 4.2.1.20) beta chain; and trpA -  tryptophan synthetase (EC 4.2.1.20) alpha chain. Solid lines connect clustered pairs; the numbers associated with the lines are the pairing scores (see above) which reflect the strength of the connections.

As one can see from Figure ,  TrpE, TrpD, TrpC, TrpB, and TrpA gene products form a  highly conserved PCBBH cluster, which could be viewed as a potential operon. Each ORF in this cluster forms a high-scoring  pair with the other members of the cluster, but not with any other ORF in the genome.  All  five members of this cluster belong to a well-studied tryptophan biosynthesis operon (for review see C. Yanofsky, 1996) [7], which has become the basic reference structure for studies on tryptophan metabolism. As was shown by C. Yanofsky et al. [8], the full-length polycistronic trp mRna encodes the five trp polypeptides (corresponding to trpA, trpB, trpC, trpD, and trpE genes) -- the enzymes of the tryptophan biosynthetic pathway. Our prediction of the  potential operon therefore agrees with  the literature.

Example 2: Diaminopimelate Biosynthetic Pathway

Three different pathways of D,L-diaminopimelate and L-lysine synthesis are known in prokaryotes [9]. The pathway shown below represents the lysine branch of the aspartic amino acid family biosynthetic pathway [10]. This pathway is implemented by nine distinct reactions catalyzed by the enzymes shown in Tab. 4. The PCBBHs relevant to this pathway are shown as well.

 

Table 4: Functional couplings between the enzymes of the lysine biosynthetic pathway.
Enzymes, catalysing the consecutive reactions in the pathway Gene name(s) Coupled ORFs
ASPARTATE KINASE (EC 2.7.2.4) ask; thrA; metL; lysC
ASPARTATE-SEMIALDEHYDE DEHYDROGENASE (EC 1.2.1.11) asd RCY44336  RPN01355  RBS01675 
DIHYDRODIPICOLINATE SYNTHASE (EC 4.2.1.52)  dapA RCY38981  RPN01356  RBS01677 
RCA03322  RAG28035  RTH00849 
DIHYDRODIPICOLINATE REDUCTASE (EC 1.3.1.26) dapB RCA03321  RAG48606  RTH01270 
ACETYL-L,L-DIAMINOPIMELATE AMINOTRANSFERASE No sequences in the databases
TETRAHYDRODIPICOLINATE ACETYLTRANSFERASE No sequences in the databases
N-ACETYLDIAMINOPIMELATE DEACETYLASE (EC 3.5.1.47) No sequences in the databases
DIAMINOPIMELATE EPIMERASE (EC 5.1.1.7)  dapF RPA02459  RTH01501 
DIAMINOPIMELATE DECARBOXYLASE (EC 4.1.1.20)  lysA RPA02462  RTH01007 

Here, we have shown in distinct colors genes from distinct organisms that together form PCBBHs indicating linkages of function. From these PCBBHs, one can infer a tenuous connection between the three functions 1.2.1.11, 4.2.1.52, and 1.3.1.26, as well as the last two functions 4.1.1.20 and 5.1.1.7. Note, however, that no operon in any of the genomes actually couples more than two of these functions. The overall functional coupling emerges much like a holographic image from the set of genomes, but is not present in any single genome. We conjecture that the apparent lesser degree of "completeness" or "maturity" seen in the inferred connections between functions in this pathway might be the result of weaker evolutionary forces pushing these genes toward co-regulation, as compared to those that produced the obvious operons of translation, transcription, and other central cellular mechanisms. In cases such as shown in Tab. 4, we may still gain significant clues leading to meaningful inferences of functional coupling; however, a complete and compelling picture of such pathways might not fully emerge until we have many hundreds of genomes.

We have constructed a formatted version of these pairwise scores and made them available on the WWW via the WIT system [2, 3].

Example 3: Purine Metabolism

The PCBBHs that connect the enzymes participating  in  de novo synthesis of IMP are shown in Tab. 5. In this table each row corresponds to a single functional role -- an enzyme catalyzing a particular step in the Purine biosynthetic pathway. Occurrences of these particular functional roles in each genome are presented in each column of the table. We use distinct colors to show potential operon groupings within a column (that is, ORFs sharing a common color within a column are all close on the chromosome in the sense discussed in Definition 1). The first twelve rows of the table correspond to the known enzymatic roles in this pathway. The last row contains evidence that an unknown protein present in seven of the genomes is somehow functionally coupled to IMP biosynthesis.



 
Table 5: Functional couplings between the enzymes, participating in de novo synthesis of IMP.
Enzymes of Purine Biosynthetic Pathway DR  CY  ST  PN  BS  CA  MT  PA  EC  HI  AG  TH  MJ  PF 
AMIDOPHOSPHORIBOSYLTRANSFERASE (EC 2.4.2.14)  RST01489  RPN00061  RBS00650  RCA01861  RMJ08546 
PHOSPHORIBOSYLAMINE-GLYCINE  
LIGASE (EC 6.3.4.13) 
RDR00668  RST01483  RPN00065  RBS00654  RCA01857  REC06280  RHI05498 
PHOSPHORIBOSYLGLYCINAMIDE  
FORMYLTRANSFERASE (EC 2.1.2.2) 
RST01487  RPN00063  RBS00652  RCA01859  RMT00037  REC02440  RHI15219 
PHOSPHORIBOSYLFORMYLGLYCINAMIDINE  
SYNTHASE (EC 6.3.5.3) 
RDR03710  RCY15677  RBS00648  RMT00893  RTH01934  RPF00609 
PHOSPHORIBOSYLFORMYLGLYCINAMIDINE  
SYNTHASE II (EC 6.3.5.3) 
RDR03711  RST01490  RPN00060  RBS00649  RAG50222  RPF00610 
PHOSPHORIBOSYLFORMYLGLYCINAMIDINE  
CYCLO-LIGASE (EC 6.3.3.1) 
RST01488  RPN00062  RBS00651  RCA01860  REC02439  RHI15218  RMJ11899 
PHOSPHORIBOSYLAMINOIMIDAZOLE CARBOXYLASE  
ATPASE SUBUNIT (EC 4.1.1.21) 
RDR01499  RST01481  RPN00067  RBS00644  RMT02837  RPA03721  REC04505  RHI10627  RPF00920 
PHOSPHORIBOSYLAMINOIMIDAZOLE CARBOXYLASE  
CATALYTIC SUBUNIT (EC 4.1.1.21) 
RDR01497  RST01482  RPN00066  RBS00643  RCA01863  RMT02836  RPA03720  REC04506  RHI10626  RPF00919 
PHOSPHORIBOSYLAMINOIMIDAZOLE- 
SUCCINOCARBOXAMIDE SYNTHASE (EC 6.3.2.6) 
RST01491  RPN00059  RBS00646  RCA01862  RTH00832  RMJ01710 
ADENYLOSUCCINATE LYASE (EC 4.3.2.2)  RST01479  RPN00069  RBS00645 
PHOSPHORIBOSYLAMINOIMIDAZOLECARBOXAMIDE  
FORMYLTRANSFERASE (EC 2.1.2.3) /  
IMP CYCLOHYDROLASE (EC 3.5.4.10) 
RST01486  RPN00064  RBS00653  RCA01858  RMT00036  REC06281  RHI09169 
unknown  RCY24689  RBS00647  RMT00891  RAG29290  RTH01716  RMJ00185  RPF00608 

One notable feature of this example is how the evidence from fourteen distinct genomes combines together to present a vivid portrait of the pathway, as well as the possible coupling of ORFs of unknown function to this pathway. In particular, note that one would still have been able to reconstruct this pathway (as well as infer that it is connected to a protein of unknown function) even if the complete version of it in B. subtilis (BS) and the nearly complete versions in S. pneumoniae (PN) and S. pyogenes (ST) had not been included.  Again, one sees that the pathways appear to emerge ``holographicly'' from sets of operons having a greater or lesser degree of ``completeness'' or ``maturity'' in different organisms.

Example 4: A Problematic Instance

It is known that genes that do not have obvious functional connections may be co-transcribed [14,15]. Recognition of these atypical operons could provide very important insights into the interconnections between the functional subsystems and regulatory mechanisms involved in complex biological processes.
Our analysis detected a cluster of genes composed of the ribosomal protein S2P (rpsB), elongation factor Ts (tsf), the ribosome recycling factor (rrf), the enzyme phosphatidate cytidylyltransferase (cdsA), uridylate kinase (pyrH), and two hypothetical proteins yaeS and yaeL.
Using the technique described above, we analyzed  functional coupling between the ORFs in this region in several completely sequenced genomes.

As one can see from Fig. 3, in most of the genomes under consideration rpsB, tsf, pyrH and frr form a highly conserved gene cluster, which includes the uncharacterized genes yaeS and yaeL. With a few exceptions the cluster also contains cdsA, which encodes CDP- diacylglycerol synthase (EC 2.7.7.41).

 
Figure 3: The rps--yae region on the chromosome from ten organisms.

The order of the genes in the described cluster is preserved in all genomes except M. tuberculosis, in which the pyrH gene precedes rpsB. The functional connections between the genes in this cluster are not obvious. Known experimental data, however, could provide some valuable insights.

It is known that the ribosome recycling factor  is an essential protein for bacterial life. The two known functions of ribosome recycling factor (rrf, originally called ribosome releasing factor), are described in [11]: The first function relates to the disassembly of the termination complex, which consists of mRNA, tRNA, and the ribosome bound to the mRNA at the termination codon. The second function of rrf is to prevent errors in translation. In polyphenylalanine synthesis programmed by polyuridylic acid, misincorporation of isoleucine, leucine, or a mixture of amino acids was stimulated up to 17-fold when rrf was omitted from the in vitro system. rrf did not influence the large error (10-fold increase) induced by streptomycin. This means that rrf participates not only in the disassembly of the termination complex but also in peptide elongation.

Yamanaka et al. [12] demonstrated that pyrH gene (formerly smbA) [13], which encodes uridylate kinase (an enzyme participating in pyrimidine biosynthesis), is also involved in chromosome partitioning in E. coli by suppressing mukB. purH was also found to be essential for cell proliferation in the range from 22 to 42  degrees C. Cells that lacked the pyrH protein ceased macromolecular synthesis. The pyrH mutants are sensitive to a detergent, sodium dodecyl sulfate, and they show a novel morphological phenotype under nonpermissive conditions, suggesting a defect in specific membrane sites.

These observations provide only a very indirect grasp of any functional coupling; however, the functions all do appear to be essential during rapid growth of a cell. It may well turn out that the conserved positional relationship of these genes does not convey significant information relating to functional coupling (and there are many instances like this). However, the number of cases in which the functional coupling is obvious suggests that these cases should be considered carefully before dismissing them as accidental curiosities.

The results of our analysis of the potential operons in 24 genomes are available at the following locations:



DISCUSSION

The ability of such a simple computation to produce such detailed insights into functional coupling is striking. In addition, it must be remembered that our results were produced from only 24 genomes, many of which were incomplete. The ability of the technique proposed here to accurately determine functional coupling will improve dramatically as the number of genomes included in the analysis increases; a crude analysis suggests that the set of PCBBHs should grow as the square of the number of genomes. Given the level of detailed functional data that can be immediately inferred from these PCBBHs, it might well be the case that the cheapest way to acquire detailed evidence of functional coupling would be to rapidly sequence another 50-100 genomes to the point where average contig lengths would be 3-5 Kb. To illustrate this point (albeit quite crudely), we present the following short table, which can provide some insight into the rate of growth in the number of PCBBHs, with sufficient evidence to allow inference of functional coupling:

 
Table 6: Growth of number of PCBBHs with number of genomes.
Number of Genomes     Number of PCBBHs with Scores > 0.1 
 4      998
 8     4859
16    12570

The first set of four genomes that we used were

To arrive at the eight genomes, we added

Finally, to arrive at the sixteen, we added

Here, we counted all PCBBHs for which the simple scoring scheme described above produced scores of 0.1 or better (a weak score, but one suggesting a coupling).

It is difficult to make an accurate estimate of how rapidly the set of reliable PCBBHs will grow, given the number of variables in the genomes used for this tabulation (number of ORFs/genome, phylogenetic distribution, size of contigs, differing levels of operons in different domains, etc.). One would expect the growth to be better than linear.

We have described the computation of PCBBHs and illustrated their use by examples that are probably familiar to the reader. However, it should be stressed that the PCBBHs that we can already compute offer what appears to be substantial evidence coupling hundreds of genes with unknown function to complexes of genes for which the function is known. The detailed analysis of this data will require years, but it is already clear that this data provides a remarkably rich set of clues that should play a role in guiding wet lab verification or rejection of a rapidly emerging set of hypothesis.


ACKNOWLEDGEMENT

This work was supported in part by the U.S. Department of Energy under contract W-31-109-Eng-38.



REFERENCES



  1. Koonin, E. V. and Galperin, M. Y. (1997). Prokaryotic genomes: the emerging paradigm of genome-based microbiology. Curr. Opin. Genet. Dev. 6, 757-763.

  2. Overbeek, R., Larsen, N., Maltsev, N., Pusch, G. D. and Selkov, E., WIT: A system for metabolic reconstructions and  comparative analysis of the genomes. Molecular Biology Databases, ed. Letovsky, S. Kluwer, in press.

  3. WIT2 - http://wit.mcs.anl.gov/WIT/wit.html.

  4. Fraser, C. M. Gocayne, J. D., White, O., Adams, M. D., Clayton, R. A., Fleischmann, R. D., Bult, C. J., Kerlavage, A. R., Sutton, G., Kelley, J. M., et al. (1995). The minimal gene complement of Mycoplasma genitalium. Science 270, 397-403.

  5. Tatusov, R. L., Koonin, E. V. and Lipman, D. J. (1997). A genomic perspective on protein families. Science 278, 631-637.

  6. Maidak B. L., Olsen, G. J., Larsen, N., Overbeek R., McCaughey, M. J. and Woese, C. R. (1997). The RDP (Ribosomal Database Project). Nucleic Acids Res. 25, 109-111.

  7. Yanofsky, C., Konan, K. V. and Sarsero, J. P. (1996). Some novel transcription attenuation mechanisms used by bacteria. Biochimie 78, 1017-1024.

  8. Yanofsky C., Platt, T., Crawford, I. P., Nichols, B. P., Christie, G. E., Horowitz, H., VanCleemput, M. and Wu, A. M. (1981). The complete nucleotide sequence of the tryptophan operon of Escherichia coli. Nucleic Acids Res. 9, 6647-6668.

  9. Schrumpf, B., Schwarzer, A., Kalinowski, J., Puehler, A., Eggeling, L. and Sahm, H. (1991). A functionally split pathway for lysine synthesis in Corynebacterium glutamicium. J. Bacteriol. 173, 4510-4516.

  10. Pavelka, M. S. Jr. and Jacobs, W. R. Jr. (1996). Biosynthesis of diaminopimelate, the precursor of lysine and a component of  peptidoglycan, is an essential function of Mycobacterium smegmatis. J. Bacteriol. 178, 6496-6507.

  11. Janosi, L., Ricker, R. and Kaji, A. (1996). Dual functions of ribosome recycling factor in protein biosynthesis: disassembling the termination complex and preventing translational errors. Biochimie 78, 959-969.

  12. Yamanaka, K., Ogura, T., Niki, H. and Hiraga, S. (1992). Identification and characterization of the smbA gene, a suppressor of the mukB null mutant of Escherichia coli. J. Bacteriol. 174, 7517-7526.

  13. Serina, L., Blondin, C., Krin, E., Sismeiro, O., Danchin, A., Sakamoto, H., Gilles, A. M. and Barzu, O. (1995). Escherichia coli UMP-kinase, a member of the aspartokinase family, is a hexamer regulated by guanine nucleotides and UTP. Biochemistry 34, 5066-5074.

  14. Schmidt, J., Bubunenko, M. and Subramanian, A. R. (1993). A novel operon organization involving the genes for chorismate synthase (aromatic biosynthesis pathway) and ribosomal GTPase center proteins (L11, L1, L10, L12: rplKAJL) in cyanobacterium Synechocystis PCC 6803. J. Biol. Chem. 268, 27447-27457.

  15. Bachmann, B. J. (1990). Linkage map of Escherichia coli K-12, edition 8. Microbiol. Rev. 54, 130-197.

  16. Selkov, E., Galimova, M., Goryanin, I., Gretchkin, Y., Ivanova, N., Komarov, Y., Maltsev, N., Mikhailova, N., Nenashev, V., Overbeek, R., Panyushkina, E., Pronevitch, L. and Selkov, E. Jr. (1997). The metabolic pathway collection: an update. Nucleic Acids Res. 25, 37-38.