![]() |
|
2 What Mechanisms Exist for Accessing BioCyc Data?
3 Important Concepts
3.1 How are Pathway Boundaries Defined?
3.2 Super Pathways and Base Pathways
3.3 Pathway Variants
3.4 Do We Force a Pathway View of the Metabolic Network?
3.5 Reaction Direction
3.6 Reaction Locations and Compartments of Metabolites
3.7 Evidence Codes
4 Reaction Balancing and Protonation State in BioCyc
4.1 Background and Motivations
4.2 Protonation State Normalization
4.3 Computational Reaction Balancing for Hydrogen
4.4 Statistics on Reaction Balance and Protonation circa 2009
This document describes concepts involved in Pathway/Genome Databases (PGDBs) such as those in the BioCyc PGDB collection.
BioCyc data is accessible in several ways, which are described in more detail on the downloads page.
Query and visualization access is available through this BioCyc Web site
Data files for the BioCyc databases are available for download in multiple formats
The preceding databases can be loaded into SRI’s BioWarehouse relational database system (Oracle or MySQL based) for querying.
A downloadable "software/database bundle" is available that supports querying, visualization, and analysis of BioCyc data. It also allows users to create their own Pathway/Genome Databases. The software/database bundle includes functionality not available through the Web site, and also executes faster than the Web version.
Web services allow for various PGDB queries to be accepted via URLs and returned in XML for facile processing
The software/database bundle also allows users to query BioCyc data via the Java, Perl, and Common Lisp languages.
This section introduces a number of concepts that are important to understanding PGDBs.
Pathway boundaries are defined heuristically, using the judgement of expert curators. Curators consider the following aspects of a pathway when defining its boundaries.
What boundaries were defined historically for pathway?
When possible, we prefer to define boundaries at the 13 common currency metabolites:
Coincidence with regulatory units
Coincidence with metabolic units that are evolutionarily conserved
The preceding philosophy toward pathway boundary definition contrasts sharply with KEGG maps. KEGG maps are on average 4.2 times larger than BioCyc pathways because KEGG tends to group into a single map multiple biological pathways that converge on a single metabolite [Pathway05] .
We define a super-pathway as a cluster of related pathways. Typically, a super-pathway consists of a linked set of smaller pathways that share a common metabolite. For example, the super pathway superpathway of phenylalanine, tyrosine, and tryptophan biosynthesis consists of several pathways that converge at the metabolite chorismate.
The components of super-pathways include base pathways (pathways that are not themselves super-pathways), other super-pathways, and individual reactions that have not necessarily been assigned to base pathways. Those reactions typically serve to connect together the component pathways within a super-pathway.
Super-pathways are stored within each BioCyc PGDB – they are not computed dynamically.
Experimentalists have elucidated variants of a given metabolic pathway in different organisms. These pathway variants are defined as distinct pathway objects in MetaCyc when they differ in their constituent reactions. For example, MetaCyc contains several variations of the TCA cycle that have been observed in different organisms. Variant pathways are assigned names such as “TCA cycle variation II” and “arginine degradation III” (meaning the third form of arginine degradation in MetaCyc).
Two pathways are not considered variants of one another in MetaCyc if their constituent reactions are identical. For example, if the glycolysis pathway occurs with identical constituent reactions in two different organisms, but with different enzymes (different protein sequences), these occurrences of glycolysis are not considered to be pathway variants.
When copied to other Pathway/Genome databases [def] by the PathoLogic pathway prediction program, the names of variant pathways are not changed. This approach provides for consistent naming across DBs, but does have the side effect that names may seem inconsistent if considered only in the context of one DB, such as a DB that contains only “arginine degradation III” and “arginine degradation VIII”.
No. Pathways comprise a level defined on top of the reaction-based metabolic network. Users can choose to compute with the metabolic (reaction) network directly, ignoring the pathway layer, if they so choose. Note also that virtually every PGDB contains some metabolic reactions that are not assigned to any metabolic pathway.
How do PGDBs handle reaction direction?
The direction in which a reaction is stored in a PGDB has no implication for the physiological directionality of that reaction. Each reaction is stored as an instance of the Reactions class that includes two slots, Left and Right. It is possible that the reaction is bidirectional; it is possible that the reaction proceeds physiologically in the left-to-right direction, and it is possible that the reaction proceeds in the right-to-left direction.
The equilibrium constraint and change in Gibbs free energy stored for the reaction (if any) refer to the direction of the reaction as stored.
Currently, the best way to query the direction of a reaction is via the slot Reaction-Direction (within instances of classes Reactions and Enzymatic-Reactions). Values of this slot are computed on demand rather than stored.
An accurate representation of processes inside a cell needs to take into account that different compartments usually exist, and that the processes and metabolites are partitioned between these compartments. Even a simple cell consisting of a single compartment has a membrane boundary to the outside world, which is crossed by transport reactions. This section describes how BioCyc records the compartments where reactions and metabolites occur.
To avoid unnecessarily duplicating information, we store frames for metabolites and reactions only once, even if they may simultaneously be present in more than one compartment in a PGDB. Instead, the compartments are specified by auxiliary information attached to reactions.
There are two cases of reactions:
S: Reactions that have all of their metabolites in the same compartment. In a given PGDB, such a reaction could potentially occur in more than one compartment.
T: Reactions that have metabolites in multiple compartments. This can only happen at membranes, involving transport reactions or electron transfer reactions (ETRs). These reactions may use only the abstract directional compartments CCO-IN and CCO-OUT to annotate the metabolites. These abstract compartments need to be mapped to the actual compartments in a given PGDB for certain operations.
To store non-default compartment information, reactions have a slot called RXN-LOCATIONS, the values of which differ between the S and T cases, as follows:
S: If the reaction occurs in a non-default compartment, or in several compartments, then the RXN-LOCATIONS slot stores for every compartment the corresponding frame (a child of CCO-SPACE). The metabolites in the reaction’s LEFT and RIGHT slots do not have any COMPARTMENT annotations, as those would be redundant and could cause conflicts.
Example of slot values for a reaction in the periplasm:
LEFT: GLC
RIGHT: |D-Glucose|
RXN-LOCATIONS: CCO-PERI-BAC
SPONTANEOUS?: T
T: RXN-LOCATIONS contains one or more frames that are children of CCO-MEMBRANE , or potentially symbols that have to be unique in this slot, for situations where the metabolites are in spaces that are not directly adjacent to one membrane, or when 3 spaces are involved. If the reaction was not assigned to any particular membrane, then no value needs to be stored at all.
Additionally, each slot value in RXN-LOCATIONS will have annotations with the labels CCO-IN and CCO-OUT, and in the rare case of 3 compartments involved, also another label called CCO-MIDDLE. The values of these annotations have to be one valid child of CCO-SPACE each. These annotations define the mappings between the COMPARTMENT annotation values of the metabolites that are listed in the reaction’s LEFT and RIGHT slots, and the final compartments in this PGDB.
Every metabolite in the reaction’s LEFT and RIGHT slots needs to have a COMPARTMENT annotation, the value of which needs to be one of CCO-IN, CCO-OUT, or possibly CCO-MIDDLE in complex situations.
Example of slot values for a transport reaction in the periplasmic membrane:
LEFT:
AMMONIUM
---COMPARTMENT: CCO-OUT
RIGHT:
AMMONIUM
---COMPARTMENT: CCO-IN
RXN-LOCATIONS:
CCO-PM-BAC-NEG
---CCO-IN: CCO-CYTOSOL ---CCO-OUT: CCO-PERI-BAC
In BioCyc, compartments need to be valid frames in the CCO (Cell Compartment Ontology), and they can either be children of CCO-SPACE or CCO-MEMBRANE. The default compartment for metabolites and reactions is assumed to be CCO-CYTOSOL in the S case, if no other information is specified in the reaction’s RXN-LOCATIONS slot that would override this default. In the T case, nothing can truly be inferred about the default locations, if no membrane and other mappings for CCO-IN and CCO-OUT were specified. Such T reactions could not usefully become part of a reaction network model.
Whenever a reaction is transferred between PGDBs (by import or schema upgrade operations), all values in the RXN-LOCATIONS are filtered away (i.e. _not_ copied). This prevents inapplicable compartments from being introduced into other PGDBs. Instead, inferring appropriate compartments should be the responsibility of Pathologic and TIP. The values of the RXN-LOCATIONS are listed in the flatfiles of the PGDB, though.
If the reaction is catalyzed by more than one enzyme (i.e. it has more than one enzymatic-reaction attached), then each value in the RXN-LOCATIONS slot has to have an annotation called ENZRXNS, which has as its values the frame IDs of the corresponding enzymatic-reactions. This allows determining the precise compartment(s) in which the catalyzed reaction is occurring.
PGDB objects contain evidence codes that describe the types of evidence that support and justify the inclusion of that entry in the PGDB. For example, evidence codes on an enzyme entry indicate the type of evidence supporting the inference that the enzyme catalyzes its associated biochemical reaction. See [1] for a general description of the evidence-code system. The evidence-code ontology is available here.
Because the data in MetaCyc are derived from the experimental literature, the vast majority of evidence codes within MetaCyc will be experimental evidence codes that identify different classes of experimental methods that support data within MetaCyc. However, the evidence-system supports a larger set of evidence types, including evidence based on computational inferences, which are used more extensively in the other BioCyc databases.
Evidence codes appear as icons in the upper-right corner of displays pages such as pathway and enzyme pages (example). For example, a flask icon on a metabolic pathway page indicates that experimental evidence supports the existence of that pathway. Often a detailed evidence code has been assigned that indicates a specific type of experimental method – click on the flask icon to see a description of the type(s) of experimental method(s) used to elucidate the pathway (when available), and for associated citations (when available).
This section addresses the state of reaction mass balance and protonation state of chemical compounds in the BioCyc databases. Because these issues are still evolving and are influenced to a large degree by history, we include a historical discussion of these issues.
Our long-term goal is for all reactions in BioCyc to be fully mass balanced and charge balanced, and for all chemical compounds to be properly protonated at cellular pH. Although in some cases such a treatment may yield reactions or chemical structures that look non-traditional to biochemists, we believe this approach provides the most consistent and correct treatment. In addition, it provides a treatment that will facilitate automatic generation of flux-balance models from PGDBs.
Historically, the chemical structure data within BioCyc databases has been obtained from many different sources, including textbooks, articles from the primary research literature, and downloading from certain open databases. In the early years of the project we developed programs to check the mass balance and element balance of reactions within BioCyc databases. We found that these programs were extremely valuable because identification of unbalanced reactions allowed us to identify errors in both the reaction equations, and in the chemical structures. However, we also found that, because of the diverse sources from which we obtained chemical structure data, the structures were protonated inconsistently. Therefore, for many years we ignored element imbalances due to hydrogen only, while correcting imbalances due to other elements.
In 2008, we began to address the problem of inconsistent protonation to facilitate automatic generation of flux-balance models. Work was completed on ensuring that reactions in the MetaCyc and EcoCyc PGDBs are completely mass-balanced. The first releases of those fully mass-balanced MetaCyc and EcoCyc DBs were version 13.0 in early 2009. In time, other BioCyc PGDBs will become mass balanced as well. For example, because we periodically regenerate the Tier 3 BioCyc PGDBs, the next time these PGDBs are generated from version 13.0 or higher of MetaCyc, they will be based on the consistently protonated compounds, and the fully mass-balanced reactions.
The following sections describe the methodology by which the protonation-state normalization and reaction mass balancing were achieved.
For a given chemical compound, there can be atoms that will bind a variable number of hydrogen atoms, depending on their chemical structure and the pH of their environment. A term for the isomers of a compound that differ in the number of hydrogens bound to these atoms is proto-isomer. A term for the atoms with variable numbers of bonded hydrogens is the proto-isomerization centers of a compound. Oxygen, sulfur, phosphorus, and nitrogen are examples of typical proto-isomerization centers.
In order to bring a greater degree of consistency to our PGDBs, we protonated (i.e., assigned the correct number of bound hydrogens to the proto-isomerization centers of a compound) the compounds of EcoCyc with a reference pH value of 7.3, using the Marvin (version 5.1.02) computational chemistry software available from ChemAxon, Ltd [2]. The pH value of 7.3 was selected based on a paper on the measurement of cytoplasmic pH of E. coli [3]. In order to easily exchange compound data between MetaCyc and EcoCyc, MetaCyc was also protonated with a reference pH value of 7.3. This step is an approximation since MetaCyc contains reactions and compounds from many organisms and many cellular compartments.
The Marvin software calculates the protonation state of a compound’s proto-isomerization centers by first determining their pKA. The pKA of the proto-isomerization centers of a compound were obtained by computing the partial charge distribution. This, in turn, is calculated using a numerical partial differential equation solver, which computes the distribution by means of the structure of the compound, and the known electronegativities of the constituent atoms. Although we have worked with ChemAxon to improve the accuracy of their calculations to match that of experimentally-verified pKA’s of many biochemically-relevant compounds, this calculation is still based on an approximation technique, and will not necessarily yield fully correct pKA’s for every substance.
Some caveats about our protonation of compounds:
Some compounds are present in multiple reactions that take place in various different compartments in a cell, or across membranes, where the pH might vary from our stated value of 7.3.
For any given compound, only one proto-isomer is present in our PGDBs. We do not represent the other proto-isomers, nor do we represent the proto-isomerization reactions that inter-convert the various proto-isomers of a compound.
Sometimes a pKA value for a proto-isomerization center is very close to the pH of the solution, and therefore there is approximately a 50 / 50 split between the relative abundance of the two proto-isomers of that compound in solution. The Marvin software will select the most likely proto-isomer based on a comparison of the floating point value of the relative abundance in such situations.
Our compounds might have a slightly different structure than what you will find for the same compound in an alternate chemical compound database. Please ensure that you are comparing the two compounds for the correct protonation state at a reference pH value of 7.3.
Once the compounds of EcoCyc and MetaCyc were protonated, all reactions that had a mass-imbalance due only to hydrogen atoms were computationally balanced. This balancing procedure added or removed instances of the proton from the appropriate side of a reaction to achieve mass-balance.
Some caveats about our computational reaction balancing:
One might notice some reactions that have more or less protons participating than what you would typically see depicted. This might be most evident in our EC reactions. One reason for this, beyond our computational reaction balancing, is that traditionally protons and other small, ubiquitous chemical moieties were considered auxiliary to the main function of a reaction and thus not depicted. In general, our EC reactions may vary from the IUBMB reactions by including more or fewer protons than the original reaction.
For use in FBA models, one must be aware that we are only representing one of the possibly many proto-isomers of a compound. We also do not represent the fast protonation reactions that inter-convert the proto-isomers. Thus, a FBA model that is attempting to simulate the flux of hydrogen in a PGDB may be inaccurate.
As of 2009, reactions that were computationally balanced for mass are not necessarily balanced for charge.
This table provides information on the small-molecule reaction balance state for both EcoCyc and MetaCyc as of early 2009. The categories below represent reactions that are balanced, unbalanced, and those for which it is not possible to determine the balance state.
Reactions that remain unbalanced are due to non-trivial imbalances (i.e., imbalances not due solely to hydrogens or protons). These imbalances are usually due to omissions or errors in the structures and/or reaction composition obtained from the literature. Our curation staff are actively researching such compounds and reactions and correct the data whenever possible.
For the category of reactions where it is not possible to determine the balance state, these are mainly due to:
Reactions that have compound classes as substrates
Polymerization reactions. As of the beginning of 2009, BioCyc.org is working to extend our representation of polymerization reactions to allow for mass and charge balance.
Reactions with substrates that lack a chemical structure
Reactions with substrates that include R-groups
| Number of Reactions | |
| EcoCyc: Balanced Reactions | 801 |
| EcoCyc: Unbalanced Reactions | 3 |
| EcoCyc: Reactions that cannot be balanced | 160 |
| MetaCyc: Balanced Reactions | 5,098 |
| MetaCyc: Unbalanced Reactions | 317 |
| MetaCyc: Reactions that cannot be balanced | 1,143 |
| [1] |
P.D. Karp, S. Paley, C.J. Krieger, and P. Zhang.
An evidence ontology for use in pathway/genome databases.
In R. Altman and T. Klein, editors, Proc Pacific Symposium on
Biocomputing, pages 190–201, Singapore, 2004. World Scientific.
|
| [2] |
DELETEAUTHOR.
Marvin Chemical Editor.
http://www.chemaxon.com/products/marvin/.
|
| [3] |
J.C. Wilks and J.L. Slonczewski.
ph of the cytoplasm and periplasm of escherichia coli: rapid
measurement by green fluorescent protein fluorimetry.
J Bacteriol, 189(15):5601–7, 2007.
Epub 2007 Jun 1.
|