Genomes to Life Contractor-Grantee Workshop II
February 29-March 2, 2004, Washington, D.C.
Genomics:GTL Program Projects
Oak Ridge National Laboratory and
Pacific Northwest National Laboratory
Genomics:GTL Center for Molecular and Cellular Systems
A Research Program for Identification and Characterization of Protein Complexes
6
Establishment of Protocols for the High Throughput Analysis of Protein Complexes at the Center for Molecular and Cellular Systems
Michelle V. Buchanan1 (buchananmv@ornl.gov), Gordon Anderson2, Robert L. Hettich1, Brian Hooker2, Gregory B. Hurst1, Steve J. Kennel1, Vladimir Kery2, Frank Larimer1, George Michaels2, Dale A. Pelletier1, Manesh B. Shah1, Robert Siegel2, Thomas Squier2, and H. Steven Wiley2
1Oak Ridge National Laboratory, Oak Ridge, TN and 2Pacific Northwest National Laboratory, Richland, WA
The first year of the Center for Molecular and Cellular Systems focused on evaluating methods for the efficient identification and characterization of protein complexes, identifying “bottlenecks” in the isolation and analysis processes, and developing approaches that could eliminate these bottlenecks. Oak Ridge National Laboratory (ORNL) and Pacific Northwest National Laboratory (PNNL) staff worked closely together to develop an integrated process for protein complex analysis. Emphasis has been placed on developing robust protocols that are adaptable to high throughput isolation and analysis methods. Progress has been made in all five major program areas—molecular biology, organism growth standardization, protein complex isolation/ purification, protein complex analysis, and bioinformatics/computation. During this first year, we have evaluated a two-phased approach to identify protein complexes. The first is an exogenous bait approach using one or more purified proteins to pull down the components of the associated protein complex. The second is an endogenous approach involving the in vivo expression of tagged proteins that are used to pull down the components of the associated protein complex. These complementary approaches each have their advantages. The first permits the high-throughput isolation of complexes from a single sample grown under defined conditions, while the latter permits the identification of complexes under cellular conditions, plus it can be combined with the development of new imaging methods to identify synthesis, turnover, and complex localization in real time. To test the established protocols two organisms were employed, Rhodopseudomonas palustris and Shewanella oneidensis. Techniques were optimized and standard protocols were established for endogenous complex isolation and exogenous complex isolation that will be deployed in year two of the project.
Considerable progress has also been made in advancing capabilities for the characterization of protein complexes that will minimize current bottlenecks, reduce the amount of sample required, and automate sample handling and processing. We have made progress toward using an affinity-labeled crosslinker that allows selective isolation and subsequent mass spectrometric analysis of crosslinked peptides. Microfluidic technologies that reduce the amount of sample required for analysis and decrease the time required for separation have been applied to the analysis of peptides from protein complexes. Automated trypsinization and sample processing protocols have been developed that are designed around a 96-well format. Imaging of microbial cells, based upon introduction of fluorescent labels onto target proteins, has also been pursued. Automation of key parts of the cloning and complex isolation pipeline was initiated. Particular emphasis was given in this first year in establishing a common laboratory information management system (LIMS) and sample-tracking system that would facilitate distributed workflow across multiple laboratories. Results from the first year of this project have led to the design of a single, high-throughput production pipeline that will integrate efforts at both ORNL and PNNL. This will allow the high throughput analysis of hundreds of complexes during the next year. This pipeline will use complementary pull down methods, both endogenous and exogenous methods, to isolate protein complexes and provide greater confidence in complex characterization. This pipeline will be flexible to allow improved technologies to be incorporated as they are developed.
7
Isolation and Characterization of Protein Complexes from Shewanella oneidensis and Rhodopseudomonas palustris
Brian S. Hooker1 (Brian.Hooker@pnl.gov), Robert L. Hettich2, Gregory B. Hurst2, Stephen J. Kennel2, Patricia K. Lankford2, Chiann-Tso Lin1, Lye Meng Markillie1, M. Uljana Mayer-Clumbridge1, Dale A. Pelletier2, Liang Shi1, Thomas C. Squier1, Michael B. Strader2, and Nathan C. VerBerkmoes2
1Pacific Northwest National Laboratory, Richland, WA and 2Oak Ridge National Laboratory, Oak Ridge, TN
As part of the Center for Molecular and Cellular Systems pilot project, we have been evaluating both endogenous and exogenous approaches for the robust isolation and identification of protein complexes. Exogenous isolation uses bait proteins to capture the protein complexes. To evaluate various exogenous isolation approaches, five complexes with differing physical characteristics were employed, both stable and transiently associating protein complexes. These complexes included RNA polymerase, the degradosome, and oxidoreductase, all stable protein complexes of varying complexity, and protein tyrosine phosphatase (Ptp) and methionine sulfoxide reductase (Msr), which are signaling proteins that form transient protein complexes. Evaluation of several different approaches has shown that covalent immobilization of the affinity reagent to a solid support works well to isolate the protein complex away from nonspecifically bound proteins, whether this involves direct bait attachment or the immobilization of an antibody against the bait or epitope tag. Approaches evaluated include covalent attachment of bait protein to glass beads that were subsequently used to capture protein complexes and expression of bait proteins with 6xhis tags, which were used to isolate complexes with nickel-chelating resins.
For endogenous complex isolation, we have developed a convenient, broad host range plasmid system to prepare tagged proteins in the native host. A series of expression vectors have been developed that can be used to transfect E. coli or R. palustris. These expression vectors have been constructed based on the broad host range plasmid pBBR1MCS5. This vector was modified to contain the Gateway® pDEST multiple cloning region that allows site specific recombination cloning of targets from Gateway® entry plasmid. Four modified Gateway® destination vectors were constructed that can be used for expression of 6x histidine (6xhis) or glutathione(GST), N- or C- terminally tagged fusion proteins. Using this approach, methods have been developed to purify complexes using a double affinity approach (TAP) and complexes of suitable amounts and purity have been obtained for mass spectrometry evaluation. We have cloned a total of 22 R. palustris genes into these expression vectors to test expression and affinity purification methods for isolation of protein complexes using different affinity tags. The tested genes included those which code for proteins that are components of GroEL, GroES, ATP synthase, CO2 fixation, uptake hydrogenase, ribosome, photosynthesis reaction center, Clp protease, and signal recognition. Results suggest while there was no one affinity tag which worked well for all genes tested, there was at least one fusion protein that expressed well for each targets tested. The 6xhis and V5 tag combination, does in fact yield a highly purified product in the test cases examined to date. We have therefore focused our effort on using this TAP purification protocol, using the pBBRDEST-42 plasmid as it encodes both the V5 and 6xhis tags. This approach has been incorporated as a part of standard protocols in a high throughput system and a panel of 200 R. palustris genes are being processed to serve as the pilot group for this automated approach.
As a benchmark for developing and evaluating affinity-based methods for isolating molecular machines, we carried out a conventional biochemical isolation (sucrose density gradient centrifugation) of the R. palustris ribosome, followed by both “bottom-up” and “top-down” mass spectrometric analysis of the protein components of this large, abundant complex. We have identified 53 of the 54 predicted protein components of the ribosome using by the “bottom-up” method, and obtained accurate intact masses of 42 ribosomal proteins using the “top-down” approach. Combining results from these two approaches provided information on post-translational modification of the ribosomal proteins, including N-terminal methionine truncation, methylation, and acetylation.
8
Bioinformatics and Computing in the Genomics:GTL Center for Molecular and Cellular Systems - LIMS and Mass Spectrometric Analysis of Proteome Data
F. W. Larimer1 (larimerfw@ornl.gov), G. A. Anderson2, K. J. Auberry2, G. R. Kiebel2, E. S. Mendoza2, D. D. Schmoyer1, and M. B. Shah1
1Oak Ridge National Laboratory, Oak Ridge, TN and 2Pacific Northwest National Laboratory, Richland, WA
Scientists at the Oak Ridge National Laboratory/Pacific Northwest National Laboratory (ORNL/PNNL) Genomics:GTL (GTL) Center for Molecular and Cellular Systems are generating large quantities of experimental and computational data. We have developed a prototype Laboratory Information Management System (LIMS) for data and sample tracking of laboratory operations and processes in the various laboratories of the Center. We have also developed a mass spectrometry data analysis system for automating the mass spectrometry data capture and storage, and computational proteomic analysis of this data.
A Laboratory Information Management System for the GTL Center for Molecular and Cellular Systems. The Laboratory Information Management System (LIMS) for the GTL Center for Molecular and Cellular Systems is a central data repository for all information related to production and analysis of GTL samples. It maintains a detailed pedigree for each GTL sample by capturing processing parameters, protocols, stocks, tests and analytical results for the complete life cycle of the sample. Project and study data are also maintained to define each sample in the context of the research tasks that it supports.
The LIMS system is implemented using the Nautilus™ software from Thermo Electron Corporation. This software provides a comprehensive yet extensible framework for a LIMS that can be customized to meet the requirements of the GTL project. Nautilus uses client/server architecture to access data maintained in a central Oracle database and presents an interface based on the Windows Explorer paradigm. The latest Nautilus release includes Web access and this will be added to GTL LIMS in the near future.
The LIMS is configured by establishing workflows that parallel the processing steps completed in the laboratory. For each process it is necessary to define the laboratory environment (stocks, storage locations, instruments, protocols), identify the items to track, the process parameters to collect, the tests that will be conducted, and the test results that will be reported. This information is then used to develop LIMS workflows that will ensure the collection of all critical data.
The initial GTL LIMS system configuration has been completed. This required customization of Nautilus to include additional GTL data items such as primers, genes, and vectors, and programmatic extensions to do GTL specific tasks such as copying files to the central file server and displaying files stored on the central file server. Program extensions also had to be developed to handle some of the processing steps for stocks stored in 96-well plates.
Future plans for the LIMS include additional reporting capabilities, integration with the mass spec data analysis pipeline, barcode implementation, and refinement of the process workflows.
PRISM Mass Spectrometry Proteomic Data Analysis System. The Proteomics Research Information Storage and Management (PRISM) System manages the very large amounts of data generated by the mass spectroscopy facility and automatically performs the automated analytical processing that converts it into information about proteins that were observed in biological samples. PRISM also collects and maintains information about the biological samples and the laboratory protocols and procedures that were used to prepare them.
PRISM is composed of distributed software components that operate cooperatively on a network of commercially available PC computer systems. It uses several relational databases to hold information and a set of autonomous programs that interact with these databases to perform much of the automated file handling and information processing. A large and readily expandable data file storage space is provided by a set of storage servers. The basic database software is a commercial product, but the database schemata and content and the autonomous programs have all been developed in-house to meet the unique and continually evolving requirements of the MS facility.
PRISM has been in continuous operation since March 2000, and has been continually upgraded. There have been four major upgrade cycles, and numerous minor ones, including the addition of new functionality and the expansion of capacity as new instruments are added to the facility. Most recently, PRISM has been upgraded to maintain inter-system tracking information for GTL samples and the ability to maintain and process them in their as-delivered format (96-well plates).
PRISM manages data and research results for all of the mass spec based proteomics studies in our laboratory; this includes over 100 research campaigns or lines of investigation. This research has resulted in 15334 datasets from a number of different mass spectrometers. These datasets have required 41491 separate analysis operations to extract peptide and protein identifications. The total raw data volume managed by PRISM is in excess of 15 Tera bytes. The current rate of production results in approximately 800 datasets per month with significant increases expected in FY04.
Data Abstraction Layer (DAL). The DAL is middleware that will provide a level of abstraction for any data storage system in the proteomics pipeline (LIMS, Freezer Software, PRISM, etc.). It will provide a generic interface for building tools and applications that require access to the experimental data and analysis results. It will also allow the pipeline data to be extended without making changes in the manner in which an application already looks at the data. For example, it could be used to facilitate a query performed utilizing proteomic data originating from both PNNL and ORNL. The DAL will be used to provide an interface to the pipeline data as required by selected bioinformatics/analysis tools.
9
Advanced Computational Methodologies for Protein Mass Spectral Data Analysis
Gordon
Anderson1 (gordon@pnl.gov),
Joshua
Adkins1, Andrei Borziak2,
Robert
Day2, Tema Fridman2,
Andrey
Gorin2, Frank Larimer2,
Chandra Narasimhan2,
Jane
Razumovskaya2, Heidi Sophia1,
David
Tabb2, Edward Uberbacher2,
Inna
Vokler2, and
Li
Wang2
1Pacific Northwest National Laboratory, Richland, WA and 2Oak Ridge National Laboratory, Oak Ridge, TN
Completed analysis of a variety of genomes has lead to a revolution in the methods and approaches of what was traditionally protein biochemistry. Now, an analysis of a variety of protein functions can be undertaken on a genome wide level. Among some of the most interesting and complicated functions of proteins is their nature to form higher order functional complexes. Using a combination of protein pull-down techniques and combined capillary liquid chromatography/mass spectrometry (LC/MS) as a sensitive detector for proteins, new protein complexes are being identified as part of the Center for Molecular and Cellular Systems. Computational tools are being developed to assist in the interpretation of these data. For example, complications in these data arise from non-specific and transient protein interactions. Imperfect bioinformatic tools for peptide identifications that lead to protein identifications found in these complex pull-downs is also a problem. We are using the clustering program, OmniViz, as a tool for discovery of protein complexes in this combination of complicating protein identifications. This includes the ability to view various experiments in a virtual 1D dimension gel format to aid biologists in looking at the results and adjustable features that can be used to compare different ratios of sensitivity and specificity in the putative protein complexes. We are automating the process, leading to standardized approaches and reports for protein components of complexes.
Improved scoring algorithms for matching theoretical tandem mass spectra of peptides to observed spectra are being developed to replace existing scoring algorithms such as that used by SEQUEST. The likelihood of matches can be estimated by probabilistic analysis of fragment ions matches rather than computationally expensive cross-correlation. This greatly improves the speed and accuracy of the peptide scoring system. A computational system for peptide charge determination has also been developed with 98% accuracy using statistical and neural network methods. This allows a several-fold speedup in calculation time without loss of information.
Methods for de novo sequencing to construct sequence tags from MS/MS data have also been developed using a statistical combination of informational elements including the peaks in the neighborhood of expected B and Y ions. The approach utilizes all informational content of a given MS/MS experimental data set, including peak intensities, weak and noisy peaks, and unusual fragments. The ‘Probability Profile Method’ is capable of recognizing ion types with good accuracy, making the identification of peptides significantly more reliable. The method requires a training database of previously resolved spectra, which are used to determine “neighborhood patterns” for peak categories that correspond to ion types (N- or C-terminus ions, their dehydrated fragments, etc.). The established patterns are applied to assign probabilities for experimental spectra peaks to fit into these categories. Using this model, a significant portion of peaks in a raw experimental spectrum can be identified with a high confidence. PPM can be used in a number of ways: as a filter for peptide database lookup approach to determine peptides with post-translational modifications or peptide complexes, de novo approach and tag determination.
10
High-Throughput Cloning, Expression and Purification of Rhodopseudomonas palustris and Shewanella oneidensis Affinity Tagged Fusion Proteins for Protein Complex Isolation
Dale A. Pelletier1* (pelletierda@ornl.gov), Linda Foote1, Brian S. Hooker2, Peter Hoyt1, Stephen J. Kennel1, Vladimir Kery2, Chiann-Tso Lin2, Tse-Yuan Lu1, Lye Meng Markillie2, and Liang Shi2
1Oak Ridge National Laboratory, Oak Ridge, TN and 2Pacific Northwest National Laboratory, Richland, WA
This poster will describe the approaches and progress in the joint Oak Ridge National Laboratory/Pacific Northwest National Laboratory (ORNL/PNNL) Center for Molecular and Cellular Systems pilot project on protein complexes. We have adopted the following process design for isolation of protein complexes: (1) construct adaptable plasmids for expression in multiple organisms, (2) adapt standard gene primer design, (3) PCR amplify target genes, (4) clone into donor vector/expression vectors, and (5) express in selected organisms for exogenous and endogenous complex isolation.
We have developed software that designs appropriate PCR primers flanking each gene such that any gene can be amplified from genomic DNA. The resulting PCR products can be directly recombined into entry vectors. We have used the Gateway® cloning system (Invitrogen) to produce entry clones that can be recombined into our modified broad host range expression vectors which contain ori genes compatible with replication in a variety of bacterial hosts.
We have performed PCR amplification from host genomic DNA in 96-well format and shown, using generic conditions, that 60-70% of the reactions yield the predicted size products. We have previously demonstrated automated PCR amplification, cleanup and gel analysis using liquid handling robots and are transitioning to high-throughput hardware. We have successfully PCR amplified approximately 80 R. palustris genes and cloned 40 into expression vectors. Twenty of these constructs have been electroporated into R. palustris. To date high-throughput electroporation has not been implemented but such 96-well systems are commercially available and will be tested. Plans for automation at this step include an automated colony picker and subsequent robot directed plasmid preps for QA and long term cataloging and storage. We have also successfully cloned over 30 S. oneidensis genes into expression vectors using the Gateway® system. Over 20 of these constructs have been successfully expressed in both E. coli and S. oneidensis.
Expression in E. coli and in hosts R. palustris and S. oneidensis has to date been evaluated primarily using manual processes. Cell samples from relatively large cultures are lysed and IMAC or TAP isolations are completed followed by verification of product by SDS-PAGE and/or Western blot. Tagged S. oneidensis proteins expressed in E. coli and S. oneidensis for exogenous bait experiments are then purified using single-step IMAC on a Qiagen Biorobot 3000 LS. Milligram quantities of up to 12 proteins in parallel have been purified using this automated system.