DOE Genomes
Human Genome Project Information  Genomics:GTL  DOE Microbial Genomics  home
-

Report on the Computational Biology Workshop for the Genomes to Life Program
U.S. Department of Energy, Germantown, Maryland
August 7–8, 2001

Summary of Overview Talks and Discussions

A series of overview talks were presented with the goal of summarizing the current state of the art most relevant to the five aims of computational biology as stated in the GTL roadmap. Although these talks covered different topics (see summaries below), there were a number of common issues that surfaced in all the presentations and subsequent discussions. Most prominent were the issues of data integration, data mining, derivation of knowledge from diverse data sources, data management, and synthesis of information from a large number of scientific publications.

Aim 1. Develop Methods for High-Throughput Automated Genome Assembly and Annotation

Genome assembly relies on mature approaches and algorithms. Current implementations can assemble whole mammalian genomes in a matter of tens of hours or less, using currently available computers. A recent assembly of all current public mouse genome sequences (approximately 2.7X coverage) took 8 hours using the NERSC Phase II system. Continuing development needs to be done in the area of highly repetitive sequence domains, and with respect to assembling sequences from mixtures of microorganisms.

Sequence annotation and comparative analyses across multiple genomes are recurring computational tasks that require a high-performance computing infrastructure to ensure that regular information updates are part of the most current annotation and to facilitate interactive exploratory genome analyses. For example, the genome analysis resource established at Oak Ridge National Laboratory (ORNL) is making extensive use of the computing resources at the ORNL Center for Computational Sciences, which include multiteraflop systems by IBM and Compaq. Annotation goes far beyond finding coding regions in genome sequences. Finding regulatory elements is an unsolved research problem in even the simplest genomes and is expected to involve significant computational and mathematical challenges. Some analysis of regulatory regions can be accomplished by large-scale genome comparisons.

There remain significant research challenges in high-level annotation, including assignment of functions to every gene found in whole-genome sequences. This is particularly difficult because the pathway databases are incomplete and the microbial genomes encode for metabolic pathways about which there is very little biochemical data. At this time, most of the genes found in new genomic sequences do not have assigned functions. Some functions can be inferred by computational structure determination and protein folding, but a wide range of research problems remains to be solved in this area. Challenges in large-scale genome annotation easily could outpace the development of high-performance computer hardware and the software environments for effectively using that hardware. Within the next 5 years, genome sequences likely are to be completed at rates 10 to 100 times the current pace. High-throughput analytical approaches, as well as the informatics capabilities to manage the data and information for easy access by the biological research community, will present significant research challenges in this time period.

Aim 2. Develop Computational Tools to Support High-Throughput Experimental Measurements of Protein-Protein Interactions and Protein-Expression Profiles

The presentation focused primarily on high-throughput analysis of gene-expression profiles and relatively less on protein expression or protein-protein interactions. An enormous amount of data is being produced by experiments involving microarrays of oligonucleotides, cDNAs, and proteins/antibodies—all involving various tissues, exposures, other experimental conditions, and time-course studies. There are challenges associated with data quality, statistical analysis, variability of assays, and, in general, data-set reproducibility. Several analysis methods have been applied to microarray data sets. Various clustering approaches, singular-value decomposition, and pattern-recognition methods including several classes of neural net–based methods have been used. All current approaches fail to integrate into the analysis the often-substantial body of pre-existing knowledge, and most fail to account for experimental errors.

The situation is similar for the analysis and management of other types of biological data, such as mass spectral expression data or yeast two-hybrid data on protein-protein interactions. For all high-throughput experimental methods in biology, significant work is required to develop the tools for statistical analysis, interpretation, annotation, and curation of the data. Furthermore, the full promise of these experimental methods will be achieved only if methods can be developed for integrating the different data types. For example, MS, NMR, and crystallography generate complementary data on proteins and complexes that, if integrated appropriately, can have significant impact on accomplishing the stated goals of the GTL program.

Aim 3. Develop Predictive Models of Microbial Behavior Using Metabolic-Network Analysis and Kinetic Models of Biochemical Pathways

One of the ultimate goals of the GTL program is predictive modeling of microbes and microbial communities. This presentation described the current state of the art for data-driven approaches to deriving metabolic networks from “parts lists” of enzymes involved in the pathways. The approach presented involves subjecting metabolic networks to known constraints that lead to descriptions of a solution space that shows how and under what conditions and particular biochemical behavior the reactions will occur. Constraints include capacity, maximum flux, connectivity, systemic stoichiometry, and physical/chemical factors (e.g., osmotic pressure, enzyme kinetics, and regulation). The red blood cell metabolic network was presented as an example, with 32 reactions, 29 external signals, and 19 metabolites. Recent work also has shown that using genome data and other information to predict many of the characteristics of Saccharomyces cervisiae is possible. Shifts in gene-expression profiles can be predicted with 75% to 80% accuracy.

As this systems-level approach to understanding bacterial systems develops, several questions must be addressed. What are the biological design variables? Can biological systems be modeled in the same detail as physical/chemical systems? How do physical/chemical principles and approximations developed for modeling nonliving systems apply to the simulation of living systems? Are numerical values for parameters, such as enzyme-catalyzed reaction rates, known, or even knowable, since such properties change with time and environmental conditions, and from individual to individual?

Remaining challenges include the incorporation of kinetics and regulatory controls in current modeling approaches. Some molecules, including certain proteins and chemical signals, occur in such small numbers in the cell that they cannot be described accurately in terms of continuous concentrations. Instead, they must be described using discrete numbers of molecules, an approach that requires more complex mathematics and extensive statistical sampling; this approach is better simulated on novel architectures. Other challenges arise with questions of optimality criteria used in biological systems. For example, Bacillus subtilis is not optimized for growth while Escherichia coli does appear to be.

The discussions made clear that reaching the ultimate goal of predictively modeling such complex biological systems as cells requires many fundamental advancements, ranging from a better understanding of nonequilibrium processes, to the collection of complete data sets describing the properties of a cell. Hence, at this time, the limiting issue is not the availability of computational resources. Finally, the point was made that a significant amount of needed computing work actually requires integer arithmetic and rule-based systems. Participants recognized that vendors in the high-performance computing arena are unlikely to produce
special-purpose hardware, but there may be opportunities to encourage vendors to optimize future processors for integer operations.

Aim 4. Develop and Apply Advanced Molecular and Structural Modeling Methods for Biological Systems

This talk described the current state of the art in the whole range of molecular-simulation methods, from the computational prediction of protein structure based on experimental data, to first-principles simulations of biochemical processes. The presentation began with a description of the wide range of size and time scales involved in biological systems, pointing out that different simulation approaches would be appropriate at different levels of description. These methods include, at the highest levels, qualitative network analyses of biological pathways (e.g., with Petri nets) and quantitative network analysis (e.g., using the Monte Carlo approach), and range all the way down to molecular simulations of protein-protein interactions and quantum mechanical (QM) predictions of chemical reaction energies.

The talk included an overview of methods for predicting protein structures and also discussed many of the challenges to first-principles predictions of protein structure. These challenges include the long time scales (milliseconds to seconds) and very subtle energetics (often less than 10 kcal/mole) for protein folding. Nevertheless, empirically based methods including comparative modeling and “threading” often can successfully predict protein structure based on sequence similarities to proteins for which structures are known.

The talk went on to describe the two principal approaches to modeling biological processes at the molecular level. The most accurate are QM methods, which involve approximately solving the Schroedinger wave equation for the electronic motion of electrons in atoms and molecules. There is a large hierarchy of methods for solving the electronic Schroedinger equation, ranging from those that scale almost linearly in the number of atoms to much more accurate methods that scale as the seventh power of the number of atoms in the system. Although the best of these methods can achieve accuracies for energies and structures as good as or better than experimental methods, they are too computationally costly to be applied to most biochemical processes. Research is needed to develop versions of these methods that scale less steeply with system size, or to develop ways to empirically correct less costly methods.

The other approach to modeling molecular systems uses the much less accurate classical (ball-and-spring) force fields to describe the atomic interactions, but this method can be applied to much larger systems and much longer time scales. Such approaches include both MD, in which the motion in time of each atom is simulated, and Monte Carlo, in which a large ensemble of atomic configurations is randomly generated and sampled.

A continuing challenge is the long-time MD for slow events (actually involving multiple time scales). The issue of reaching macroscopic time scales from MD simulations cannot be solved solely by increases in hardware—the number of processors. Development of theoretically sound, time-coarsening methodologies is needed to permit dynamics-based methods for traversing much longer time scales. Another related high-priority research area is the development of improved force fields for MD, such as those that include polarization effects.

There are many areas of active research aimed at improving molecular-simulation methods.  Promising emerging methods include mixed QM/molecular mechanics methods that may allow accurate QM methods to be applied only in the regions where they are necessary, such as in enzyme-active sites, while the larger system is modeled classically. Another area of active research is first-principles molecular dynamics simulation, which involves using a fully QM description of the atomic interactions and electronic structure calculations (Car-Parinello approaches). These methods have been demonstrated to yield extremely accurate properties for water, solvated ions, and very small biochemical systems, but they are limited computationally to very short time scales and system sizes.

The talk concluded by exploring the developments necessary to transform biology into a “systems science.” Systems biology as described in the GTL roadmap requires significant expertise and resources that cross traditional disciplinary boundaries. Also needed is the development of new theories and mathematics, as well as the development of new algorithms, their implementation on high-performance computer systems, and extensive use of large, distributed, and heterogeneous databases with wide availability to make software and computer systems usable. Ultimately, this will lead to a new model for biological analysis that will involve a cycle beginning with the computational synthesis of available biological information to formulate specific biological hypotheses that will drive new experiments or, in some cases, specific computational simulations in place of experiments. The data from the new experiments will feed back into the next round of synthesis and hypothesis development.

Aim 5. Develop the Groundwork for Large-Scale Biological Computing Infrastructure and Applications

In addition to addressing requirements in terms of compute cycles and connectivity of computational resources for GTL, an important step is to address and resolve serious issues concerning data resources and access methods. The current state of the art in this arena for biology is less than desirable. There are a myriad of data silos and a few monolithic, asymmetric cross-references. A consequence of this poor data integration is the propagation of spurious information in databases; for example, there is the not-infrequent situation where gene A has a low level of similarity with gene B in another organism, and researchers find a gene similar to gene A and then claim it has the function of gene B. Many data resources have limited, idiosyncratic querying capabilities that are designed mostly for browsing human data. There are no third-party annotation mechanisms in common use. The distributed annotation system effort (http://stein.cshl.org/das/) under development shows great promise to remedy this deficiency. There is a lack of accepted standards for defining, querying, and transmitting common data objects nor are there effective strategies for discouraging data hoarding (delayed releases of data are not uncommon).

GTL will span the entire range of genomics—including sequence, proteins, expression function, and pathways—and the resolution of the data problems outlined above is paramount to the success of GTL. Scaling is a huge challenge for GTL, but scaling of data volume is only one part of the problem. An equally difficult challenge will be the seamless integration of such data resources as genomic sequence, protein analysis, genomic and protein expression arrays, and pathway information. Accomplishing the scaling among multiple laboratories will be even harder. Integration in the field of genomics is historically spotty at best, and GTL will bring in different disciplines, each with its own agenda. 

“What databases does GTL need to build?” This is not the problem as much as a real need to establish a free and level market for data, so that GTL has a chance to scale and succeed. With such a free market for data, open competition could establish the needed data resources and integration. Free-market design principles for GTL data resources should include:

Data integration should be competed openly, not with the establishment of monolithic sites. GTL should provide grants for information-integration services and tools, and it should actively participate in genomics standards/integration efforts in the larger community. Traditional integration methods may have merit for some aspects of GTL: language-based approaches; flat file, text retrieval, and search engines; data federation and distributed databases; classical data warehousing; centralization; and web robots/agents. Each method will benefit from free-market data access.

Collaboratories and computational grids collect resources under a common set of middleware. The details of specific distributed resources are not apparent. Biology already has grids that come from a natural method of scientific investigation (i.e., inference from many data sources and analyses). However, the biology community neglected to use computer science terminology for this environment. An explicit GTL grid would encompass data and computational resources as well as collaboration technologies. Common technologies would enable annotation jamborees and other intensely interactive and computer-enabled biological investigations without scientists having to be physically at one site. A GTL grid would include several experimental devices, such as mass spectrometers, NMR systems, light and neutron sources, and other experimental facilities. This grid would tightly couple the experimentalists with computational experts and resources.

Application software infrastructure is equally important. The GTL program should create a free market for GTL software, with open sources and access available to consortium members.

The computer science community believes petaflop machines will be possible and personal teraflop machines will be available in the next 5 years. The amount of computing machinery that will be available as distributed resources will be amazing. As the GTL program develops, the next generation of computers—possibly with hundreds of thousands, or millions, of processors—will become operational. Biologists like other user communities will face problems related to algorithms that will scale to petaflops. The problem of systems integration will become more important than in any biology program before GTL. The Defense Advanced Research Projects Agency has been dealing with issues of systems integration in several of its programs (e.g., Bio-Spice and Image Understanding). This paradigm is worth evaluating for GTL. Moreover, with a revolution in broadband networking expected over the next 5 years, raw, long-haul bandwidth may not be a limiting factor for the success of GTL.

Several developments are under way with respect to standards. For example, researchers at the University of Washington and the California Institute of Technology are writing CellXML to simulate cell functions. The successful example of the U.S. Department of Defense enforcing a hardware design standard indicates that an agency can make a huge difference toward developing a culture of interoperability. GTL needs to be more than the sum of independent, lab-centric projects bolted together. DOE could impact significantly a set of interoperability standards for the biology community. GTL’s chances for success will be seriously compromised if its informatics and computational biology infrastructure is not treated as a first-class component of the program from the beginning.