DOE Genomes
Human Genome Project Information  Genomics:GTL  DOE Microbial Genomics  home
-

Report on the Computational Biology Workshop for the Genomes to Life Program
U.S. Department of Energy, Germantown, Maryland
August 7–8, 2001

Summary of Breakout Discussions

Based on the issues raised during the overview talks and discussions, three broad areas were chosen for more detailed analysis by three breakout groups. Each group, consisting of 10 to15 people, met for 2 hours and then reported back to the entire workshop. These discussions are summarized below.

Biological Data Management, Analysis, and Access

Slides in HTML-web and PowerPoint

This breakout group addressed an issue that emerged repeatedly during the workshop: the special challenge of data management in achieving the goals of Genomes to Life. A key component of GTL (and systems biology generally) is data integration, and there is a critical need for tools that allow biologists to derive inferences from massive amounts of heterogeneous and distributed biological data. The working group developed a long list of recommendations that provide a general framework for planning in this area that ranged from issues related to data sharing and ownership, to the computational hardware and communication bandwidth necessary to manage biological data. Most critically, this group emphasized that the challenges of data management and integration need to be addressed with high priority from the start of GTL.

Technically, GTL will need a flexible data framework because biology is moving at a fast pace. The types of data will be determined by experiments and also will impact infrastructure requirements. For this reason, the data-analysis and -storage strategies should be allowed to evolve over time in an organized and timely way. Despite this need for flexibility, the program needs a conceptually centralized integration repository—one portal to access data, with principles that define data interfaces.

The working group concluded that this data-management effort is too large to be independently solved within any single program. In particular, GTL should leverage the tools and intellectual output of SciDAC and other efforts in collaborative computing environments and scientific visualization. Investments are needed in integrated databases and new and improved algorithms that scale as the volume of data grows and the GTL program matures. However, the group also stressed that many of the issues in informatics for GTL will be solved by novel applications of existing techniques in computer science, mathematics, and statistics and will not always require fundamental research in these disciplines. For this reason, some mechanism is necessary to recognize and reward collaborative work among disciplines that primarily involves the transfer of established methods.

Finally, this subgroup concluded that a number of tasks in data management and annotation will have very large computational demands and, therefore, that high-performance computing resources must be available to the biology user community involved in data assembly, annotation, and curation. For some applications, compute-cycle requirements can be predicted, but for others the nature of the problems requires advancements in methods, so the algorithms and high-performance computing requirements are not yet clear.

Ultimately, the success of GTL will be judged by how well the program is accepted and serves groups within DOE and, just as importantly, the broader life sciences community. To achieve this success, the GTL program needs a new paradigm on data ownership in which the data is openly available.

Computational Prediction of Structure, Function, and Interactions

Slides in HTML-web and PowerPoint

This subgroup focused on three aspects of GTL that will involve molecular-level simulations and prediction: high-throughput protein-structure prediction for genome functional annotation; integrated experimental and computational approaches to structures and function for hard-to-isolate proteins and complexes; and, for a selected set of proteins and protein complexes critical to the GTL program, advanced molecular simulations of biochemical activity.

Prediction of protein function will involve the use of a number of methods, including structure prediction by comparative modeling and threading, “Rosetta”-type methods, and those based on phylogeny. All of these methods will need extensive further development to be applied automatically, especially for large, multidomain proteins. There also is a need for research into such new approaches as evolutionary methods to analyze structure/function relationships.

Another issue emphasized by the subgroup was that because all current methods for annotating structure and function require finished genome sequences, either resources must be devoted to completely finishing the genomes or computational approaches must be developed to effectively annotate unfinished sequences.

Whole-genome functional annotation will require significant computer resources. For example, estimates based on recent high-throughput protein-threading studies predict that a one-half–teraflop computer could thread 200 genes per day, so that threading of a whole bacterial genome would take from 2 to 4 weeks. Assuming that the GTL program will involve sequencing 20 bacteria per year, then 2 to 5 teraflops of sustained computing time will be required to keep up with that sequencing rate. Further, more advanced annotation methods will require significantly more computer resources, and there are certain types of protein structures (e.g., membrane-bound proteins), for which wholly new structure- and function-prediction methods will be necessary.

Ultimately, reliable, high-throughput determination of protein and protein-complex structures and functions will require computational methods capable of integrating several sources of experimental data, such as mass spectrometry (MS), protein arrays, crosslinking, nuclear magnetic resonance (NMR), and others. In many cases, even relatively sparse data can be used to derive constraints that speed up optimization approaches significantly and render them more accurate. High-throughput MS experiments involving complexes and crosslinkers pose significant informatics and computational challenges.

An important driver for high-performance computing systems will be modeling and simulation to predict the behavior of complexes for specific sets of proteins chosen from network analyses and other experiments. The computational requirements for such simulations are the best characterized among all of the areas of computational biology; moreover, many of these simulation methods are already implemented on teraflop-scale computers. Pure computing power is the major limitation on the size and accuracy of many biochemical simulations, which will involve data and models of protein-protein interactions, ligand-protein interactions, electron-transfer interactions, and membrane characteristics. Molecular dynamics (MD) and quantum mechanics-based molecular modeling will push high-end computing and require development of more effective scalable algorithms.

Finally, the subgroup emphasized the need for the GTL program to push the envelope for biophysical modeling, in particular, to develop the ability to predict the actual behavior of proteins and protein complexes for a selected set of biological processes chosen for their importance to GTL goals.

High-Level Modeling of Metabolic Pathways and Signaling Networks for Cells and Microbial Communities

Slides in HTML-web and PowerPoint

The ultimate goal of such research would be physically complete models of a cell that would be developed based on a mix of empirical and computed data. Such models ultimately would be able to predict how a cell’s genome and environmental factors combine to yield its phenotype. Models, therefore, would be powerful tools for both scientific discovery and the design of pathways or even whole microorganisms with novel capabilities. Such models have many drivers within the DOE mission areas, including environmental remediation, carbon sequestration, and alternative energy feedstocks.

The subgroup enumerated a series of specific scientific and engineering scenarios, including the engineering of modified Deinococcus radiodurans to clean up aromatic hydrocarbons in a radiation-intensive environment; elucidation of intercellular communication pathways in bacterial communities; and understanding the roles of cyanobacteria and diatoms for carbon sequestration. An ultimate culmination of such modeling methods would be the ability to automatically generate a complete description of a bacterium (as currently found in Bergey’s Manual) using only DNA sequence data from an environmentally collected sample.

The subgroup emphasized that achieving predictive capabilities will require overcoming many technical challenges. For example, cell modeling involves a more complex collection of components and materials than existing models of climate or mechanical systems. Many of the developments needed involve research in computer science and mathematics. New mathematical methods are needed for analysis of raw biological data for inclusion in models and the subsequent statistical design of experiments to validate those models. As described in the previous section, there are major research challenges related to database query and database design in support of modeling, as well as the development of effective databases to capture modeling output and the models themselves.

To create these extremely heterogeneous cellular models, advanced software-development techniques will be necessary. Relevant simulation levels range from individual molecules to molecular complexes, metabolic and signaling pathways, functional subsystems, individual cells, and, ultimately, cell communities. Any general cell-level model will involve a variety of components and “subgrid” models. Effective abstractions are needed for multiple modeling hierarchies. The subgroup concluded that the actual simulations would involve the use of collections of “community” codes, requiring robust interfaces for component coupling. Ultimately, such models will be most effective when integrated into problem-solving environments for integrating experimental data required to determine simulation parameters and to validate simulation results. Finally, the simulation codes need to be scalable from desktops to the largest machines.

Computationally, no single architecture is appropriate for all aspects of predictive cell modeling.  Whole-cell models will require tightly coupled parallel architectures, with smaller-component models running on workstations and whole-cell simulations on petaflop-scale systems. Whatever the form of the distributed computing infrastructure and data resources, they have to allow interactive access to both experimental groups and modeling groups. There may be a role for special-purpose hardware—for example, processors designed to allow very efficient integer operations.

Finally, the subgroup emphasized that a major issue in the development of such models is the interface between modeling and experiment. In particular, there will have to be a close coupling between the collection of cell data and its use in models, as well as validation of the models against very high quality experimental data sets.