WO2000029987A1 - Methods for identifying and classifying organisms by mass spectrometry and database searching - Google Patents

Methods for identifying and classifying organisms by mass spectrometry and database searching Download PDF

Info

Publication number
WO2000029987A1
WO2000029987A1 PCT/US1999/027191 US9927191W WO0029987A1 WO 2000029987 A1 WO2000029987 A1 WO 2000029987A1 US 9927191 W US9927191 W US 9927191W WO 0029987 A1 WO0029987 A1 WO 0029987A1
Authority
WO
WIPO (PCT)
Prior art keywords
mass
protein
proteins
database
sample
Prior art date
Application number
PCT/US1999/027191
Other languages
French (fr)
Inventor
Plamen A. Demirev
Catherine Fenseleau
Original Assignee
University Of Maryland
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University Of Maryland filed Critical University Of Maryland
Priority to US09/856,044 priority Critical patent/US7020559B1/en
Priority to AU19150/00A priority patent/AU1915000A/en
Publication of WO2000029987A1 publication Critical patent/WO2000029987A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • Fig. 1 Molecular mass distribution (in bins of 1 kDa) of proteins deposited in the SwissPROT/TrEMBL sequence database: a) all prokaryotic proteins, and b) all proteins from B.subtilis.
  • Fig. 2 Positive ion MALDI spectra from: a) B.subtilis (8 hours growth time), matrix - SA; and b) E.coli (32 hours growth time), matrix - CHCA.
  • Fig. 3 Positive ion MALDI spectra from: a) E.coli (32 hours growth time), matrix - MCA:SA mixture; and b) E.coli (8 hours growth time), matrix - CHCA.
  • Fig. 4 Number of proteins combined from B.subtilis and E.coli with masses within a predetermined mass window (in ppm) as a function of molecular mass.
  • Fig. 5 Positive ion MALDI spectrum from a mixture of B.subtilis and E.coli, matrix - SA.
  • Fig. 6 An example of a flow chart for microroganism and cell identification by mass spectrometry and database searching.
  • the present invention relates to compositions of matter, instruments, processes (e.g., as carried using computer software and/or hardware), and methods, for identifying, classifying, and or characterizing biological materials by measuring the molecular weights of protein constituents in such materials and using the molecular information to deduce the organismic source of the biological materials.
  • biological materials comprising proteins can be subjected to mass spectrometry, or other suitable means for determining mass, in order to determine the molecular weights of the protein constituents.
  • the resulting molecular weight information of the protein constituents can then be used to query databases which contain, among other information, lists of protein molecular weight information and the identity of the organism source from which the information was derived.
  • the present invention presents a method for rapid identification, classification, and or characterization of biological materials, such as microorganisms, organisms, organs, tissues, cells, subcellular materials, and the like, which exploits the wealth of information contained in genome and protein sequence databases.
  • biological materials such as microorganisms, organisms, organs, tissues, cells, subcellular materials, and the like.
  • the massive efforts to sequence the human genome has brought about a rapid increase in the speed with which DNA sequences from all species are being accumulated in publicly available computer databases.
  • any instrument, method, process, etc. can be utilized to determine the molecular weight of proteins in a sample.
  • a preferred method of obtaining molecular weight is by mass spectrometry, where protein molecules in a sample are ionized and then the resultant mass and charge of the protein ions are detected and determined.
  • any suitable instrument, method, process, etc. for carrying out mass spectroscopy can be utilized.
  • mass spectrometry to analyze proteins, it is preferred that the protein be converted to a gas-ion phase.
  • Various methods of protein ionization are useful, including, e.g., fast ion bombardment (FAB), plasma desorption, laser desorption, thermal desorption, preferably, electrospray ionization (ESI) and matrix-assisted laser desorption ionization (MALDI).
  • FAB fast ion bombardment
  • ESI electrospray ionization
  • MALDI matrix-assisted laser desorption ionization
  • Many different mass analyzers are available for peptide and protein analysis, including, but not limited to, Time-of-Flight (TOF), ion trap
  • ITMS Fourier transform ion cyclotron
  • FTMS Fourier transform ion cyclotron
  • quadrupole ion trap and sector (electric and/or magnetic) spectrometers. See, e.g., U.S. Pat. No. 5,572,025 for an ion- trap MS.
  • Mass analyzers can be used alone, or in combination to form tandem mass spectrometers. In the latter case, a first mass analyzer can be use to separate the protein ions (precursor ion) from each other and determine the molecular weights of the various protein constituents in the sample. A second mass analyzer can be used to analyze the separated constituents, e.g., by fragmenting the precursor ions into product ions. Any desired combination of mass analyzers can be used, including, e.g., triple quadrupoles, tandem time-of-flights, ion traps, and/or combinations thereof. Different kinds of detectors can be used to detect the protein ions.
  • destructive detectors can be utilized, such as ion electron multipliers or cryogenic detectors (e.g., U.S. Pat. No. 5,640,010).
  • non-destructive detectors can be used, such as an ion trap which is utilized in an ion current pick-up devices which are utilized in quadrupole ion trap mass analyzers or FTMS.
  • Any source of proteins can be used in accordance with the present invention, including whole organisms, such as multicellular and unicellular organisms, organs, tissues, cells, subcellular structures, and mixtures thereof.
  • microorganisms can be utilized, e.g., archeabacteria, bacteria, chlamydiae, rickettsia, viruses, mycoplasma, molds, yeasts, protozoa, algae, prions, etc.
  • Cells, microorganisms, etc. can be genetically-engineered, altered, modified, etc.
  • Proteins can be extracted from intact or treated materials. Any substrate comprising a biological material can be used. For instance, it may be desirable to characterize organisms found on surfaces, in food, in biological fluids, such as saliva, urine, fecal matter, blood, lymph, or plasma, on materials used to wipe surfaces suspected of containing organisms, in hair, objects handled or contacted by organisms, etc.
  • sample preparation methods can be utilized including, dried droplet (Karas and Hillenkamp, Anal. Chem., 60:2299-2301, 1988), vacuum-drying (Winberger et al, In Proceedings of the 41st ASMS Conference on Mass Spectrometry and Allied Topics, San Francisco, May 31 -June 4, 1993, pp. 775a-b), crush crystals (Xiang et al, Rapid
  • samples of microorganisms can be lyophilized, extracted into a solution, such as a 70:30 solution of CH 3 CN:0.1% trifluoroacetic acid, and then embedded in the matrix.
  • a solution such as a 70:30 solution of CH 3 CN:0.1% trifluoroacetic acid
  • Various matrices can be used, e.g., sinapinic acid, 2,5-dihydroxybenzoic acid, alpha-cyano-4-hydroxycinaminnic acid.
  • a sample can be processed in various ways prior to addition to the matrix. For instance, the sample can be extracted, subjected to corona discharge, chromatography, such as HPLC, etc., e.g., to remove particular unwanted constituents (such as lipids, small molecules, high molecular weight constituents) before mass spectrometry.
  • MALDI-TOF can detect proteins at the attamole level over a wide range of molecular weights. Masses can be determined, e.g., as low as 1000, and as high as a hundred-thousand daltons. Any range of molecular weights can be used in accordance with the present invention, e.g., about 4000-20,000. Masses accurate to about 50 ppm or better will be most reliable in the general case.
  • a specific peak matches more than one different protein in the searched databases. This can happen when different proteins share the same, or similar, molecular weights.
  • a search of a database in such a case might reveal more than one protein which corresponds to the measured molecular weight of a protein in the sample.
  • the peak protein can be separated from the mixture and subjected to further physical characterization.
  • Separation can be accomplished by any suitable method, e.g., conventional techniques involving, e.g, by cell lysis, extraction, and two-dimensional gel chromatography, capillary electrophoresis, or high performance liquid chromatography
  • Usefiil information includes, amino acid composition, amino acid sequence, proteolytic and enzymatic cleavage patterns, isoelectric point, hydrophobicity, and other physical characteristics.
  • Another scenario where additional information on a protein in spectrum may be warranted is where such protein has not matched any of the proteins listed in the database. Thus, such information can be useful to increase the specificity of the approach.
  • proteomics e.g. for rapid identification of proteins present in a mixture in picomolar amounts
  • MS-based procedures for identity assignment of individual proteins.
  • These methods include, but are not limited to, chemical/enzymatic digestion of material (obtained from a single spot in a two- dimensional gel electropherogram, or by other suitable chromatographic technique) and mass spectral determination of the molecular masses of the protein and resulting peptides (peptide mapping). Making use of already available information in protein sequence databases, a comparison can made between proteolytic peptide mass patterns generated "in silico," and experimentally-observed peptide masses.
  • a "hit-list” can be compiled, ranking candidate proteins in the database, based on (among other criteria) number of matches between the proteolytic fragments.
  • Several Web sites are accessible that provide software for protein identification on-line, based on peptide mapping and sequence database search strategies.
  • Data collected from a mass spectrometer typically comprises the intensity and mass to charge ratio for each detected event.
  • Spectral data can be recorded in any suitable form, including, e.g., in graphical, numerical, or electronic formats, either in digital or analog form.
  • Spectra is preferably recorded in a storage medium, including, e.g., magnetic, such as floppy disk, tape, or hard disk; optical, such as CD-ROM or laserdisc; or, ROM-CHIPS.
  • the mass spectrum of a given sample typically provides information on protein intensity, mass to charge ratio, and molecular weight.
  • the molecular weights of proteins in the sample are used as a matching criterion to query a database.
  • the molecular weights are calculated conventionally, e.g., by subtracting the mass of the ionizing proton for singly-charged protonated molecular ions, by multiplying the measured mass-over-charge-ratio by the number of charges for mutliuply-charged ions and subtracting the number of ionizing protons.
  • Various databases are useful in accordance with the present invention. Useful databases include, databases which contain genomic sequences, expressed gene sequences, and/or expressed protein sequences.
  • Preferred databases contain nucleotide sequence-derived molecular masses of proteins present in a known organism, organ, tissue, or cell-type. There are a number of algorithms to identify open reading frames (ORF) and convert nucleotide sequences into protein sequence and molecular weight information.
  • ORF open reading frames
  • Several publicly accessible databases are available, including, SwissPROT/TrEMBL database which contains substantial entries for a variety of organisms, including, B. subtilis and E. coli.
  • TIGR Microbial Database http://www.tigr.org/tdb/mdb/mdb.html
  • VanBogelen et al. Escherichia coli and Salmonella: Cellular and Molecular Biology, ASM Press
  • Information contained in the databases includes, e.g., gene name, protein name, E.C. number, category of function, Swiss-Prot accession code, sequence code for Genbank, Kohara phage location, genetic map location, direction of transcription on the chromosome, predicted molecular weight and isoelectric point from DNA sequence, etc.
  • One or more databases can be searched using any suitable search algorithm.
  • search algorithm for example, the SwissProt/TrEMBL database ("Expasy,” Swiss Bioinformatics Institute) using the Sequence Retrieval System (SRSWWW) module.
  • SRSWWW Sequence Retrieval System
  • any search strategy can be utilized in accordance with the present invention.
  • a mass spectrometer is equipped with commercial software that identifies peaks above a certain threshold level, calculates mass, charge, and intensity of detected ions. Correlating molecular weight with a given output peak can be accomplished directly from the spectral data, i.e., where the charge on an ion is one and the molecular weight is therefore equal to the numerator value minus the mass of the ionizing proton.
  • protein ions can be complexed with various counter-ions and adducts, such as Na ⁇ , and K + . In such a case, it would be expected that a given protein ion would exhibit multiple peaks, such as a triplet, representing different ionic states (or species) of the same protein.
  • post- translation processing may have to be considered.
  • processing events which modify protein structure in a cell, including, proteolytic processing, removal of N-terminal methionine, acetylation, methylation, glycosylation, etc.
  • a database can be queried for a range of proteins which match the molecular mass of the unknown.
  • the range window can be determined by the accuracy of the instrument, the method by which the sample was prepared, etc. Based on the number of hits (where a hit is match) in the spectrum, the unknown is identified or classified.
  • a preferred method of the present invention concerns identifying one or more unknown microorganisms in a sample, comprising: searching a sequence database for a plurality of different proteins that have the molecular weights of proteins in a mass spectrum of a sample, wherein said sample comprises a plurality of proteins from one or more unknown microorganisms, whereby said one or more unknown microorganisms are identified.
  • Identifying is meant in the general sense. For example, when an unknown microorganism is utilized in the aforementioned method, an objective is to determine the character of the unknown. This can mean finding out the particular taxonomic group(s) to which the microorganism belongs, such as its kingdom, phylum, class, order, family, genus, species, variety, and or strain. By determining that the sample is derived from a bacteria, the sample is thus classified as a bacteria. Identification in this sense can be as precise as the materials and methods allow. For some purposes, it may be enough to identify a sample as derived from a set of possible groups; however, other purposes may demand more precision.
  • a database is searched for proteins which have the molecular weights of protein constituents in the sample.
  • a database is a collection of organized information in a form which can be searched and retrieved by a computer, or other electronic processing means.
  • the searching can be accomplished usually any suitable, effective, search algorithm that can determine the presence of entries in the database which have the same, or within a specified range, molecular weight of proteins in the mass spectrum of the unknown sample.
  • the database as mentioned earlier, can comprise genomic sequences, expressed genes, protein sequences, protein molecular weights, etc.
  • the database contains nucleotide information
  • this information can be translated into protein data before the searching step, e.g., by identifying an ORF, proteolytic and cleavage sites, glycosylation sites, methylation sites, and other processing which can influence the mass of a protein.
  • the searching step can be characterized as searching for proteins which are "predicted" to have molecular weights.
  • a search in accordance with the present invention means, e.g., that a database is queried or probed for the presence of a data which matches or corresponds to the measured data, such as the measured data obtained from a mass spectrum.
  • the database is search for a plurality of different proteins, i.e., more than one, preferably more than 5, etc.
  • identification reliability will depend on a number of factors, including the number of peaks in a matched spectrum matched to proteins in a database, the number and accuracy of proteins predicted from the genome sequence in the mass range under study, etc.
  • different it is meant that the proteins arise from different genes, such as a gene coding for a protease and a gene coding for an amylase.
  • a search strategy can use the information generated by MS, or any other method, to search a database.
  • a simple search and find strategy can be used where the database is queried for proteins which match the molecular weight of the inputted data.
  • Fig. 6 is an example of a process of identifying an unknown organism, cell, or other biological material.
  • One or more of the steps depicted in the flow chart can be used to identify an organism in accordance with the present invention.
  • a mass spectrum of a sample comprising proteins from an unknown organism has already been generated using MALDI/TOF spectrometry.
  • the output from the mass spectrometer is represented as a series of m z values, where m is the mass of a protein plus the mass of a proton or other charging species, and z is the net number of charges carried by the ion and is used as the initial input 1.
  • the input masses are processed 2, e.g., by subtracting one dalton to correct for the proton added to the protein when it is ionized through gas- phase proton transfer reaction of MALDI.
  • the input data can additionally processed by determined an average molecular weight or a monoisotopic mass.
  • a protein will typically be represented in a mass spectrum by more than one peak because of the presence in it of more than one isotope. Carbon, for instance, occurs in nature as C-12 or C-13 in a ratio of about 100: 1. Therefore, if a compound contains a single carbon, it would be expected that 99% of it would be C-12 and 1% of it would be C-13. The mass spectrum of such compound would therefore have at least two peaks, each corresponding to a different carbon isotope.
  • a mass spectrum of such a compound would contain multiple peaks for each polyisotopic molecule.
  • a compound containing a plurality of atoms represented by more than one isotope will have a complex pattern of spectral peaks.
  • Such complex spectral information can be processed in a number of ways. A average mass can be calculated, e.g., using the empirical spectral information. Alternatively, a monisotopic mass can be calculated where a mass is derived where the mass of only one isotope of each atom is represented in the molecule.
  • a mass window is set 3 to define a mass range in which matches in the database will be scored as hits.
  • the mass window for a particular query can be set based on various criteria.
  • One consideration is the accuracy of the instrument. For instance, if the instrument can only measure values within three daltons, then the mass window could be for ⁇ 3 Da.
  • Other considerations include, post- translational processing.
  • the accuracy of the instrument can be determined routinely, e.g., using known standards and calibrating the instrument using an external and internal standard.
  • the processed data resulting from 2 is used as input data to initiate a search 4 of a database containing protein masses.
  • the database comprises nucleotide sequence information which has been analyzed to predict the occurrence of open reading frames (ORF) and the calculated molecular masses of such ORFs.
  • ORF open reading frames
  • various public and private databases are available that contain calculated protein mass information, or which can be mined by available software to derive such information.
  • the search mode queries the database for proteins having molecular masses which match up with the molecular masses in the mass spectrum input data 2. For each peak in the input data 2, the database is queried and a list is generated of putative database proteins which match it.
  • a match is identified in the database if it possesses the same mass as the peak, or if it is within the range indicated in the mass window 3.
  • a first list 6 can be generated which reflects the masses and organismic sources for each match. For example, each mass spectral peak of 2 can be associated with a family of proteins, representing proteins of the same molecular mass but from different organisms and proteins within the mass range set in 3.
  • the data in 6 can optionally be refined 7 by inputting additional data 8, e.g., from fragmented precursor ions of 1 and collecting data of peptide mass, sequence tag information from mass spectra, or other types of downstream information on the constituent proteins.
  • additional data e.g., from fragmented precursor ions of 1 and collecting data of peptide mass, sequence tag information from mass spectra, or other types of downstream information on the constituent proteins.
  • data, or orthogonal information can, e.g., increase certainty that the identification is correct and/or reduce the number of positive hits identified in a search.
  • a search identifies X possible proteins in the database which match the query by being within the mass window set in 3
  • a step 8 can be used to reduce the number of possible hits.
  • sequence information or proteolytic information can be used to determine which specific hit, in the set of hits identified for the specified mass range, corresponds to the data point of interest in the mass spectrum.
  • amino sequence or composition information obtained from a peak of interest can be used to search the set of hits identified as matching the peak of interest to ascertain which hit contains the sequence information.
  • Sequence can information can be highly specific, eliminating all other peaks having the same molecular weight from the list generated in 7.
  • Amino acid composition information can also be used a refining tool, although it may be less specific. Any supplemental information on the physical characteristics of a protein can be used to confirm and/or reduce the number of hits identified in a search, including, sequence information as mentioned, cleavage patterns
  • the data from 6 or 7 can then be scored to generate an output list 10 which lists the possible organisms sources of the mass spectrum.
  • the identified organismic sources can be ranked based on a number of criteria, including, but not limited to, total number of proteins identified as matching an organismic source, orthogonal information obtained in 8, etc. Table 1, for instance shows that B. subtilis contains 12/15 or 80% matching peaks and E. coli contains 6/15 or 40% matching peaks, If percent match is the sole criteria, B. subtilis would be ranked above E. coli.
  • the proteins which are unidentified e.g., the three proteins listed in
  • Table 1 for B. subtilis in the list can be subjected to further analysis in an interation step 10.
  • One reason that a matching protein is not identified in the database may be that the protein is subjected to post-translational modifications and therefore does not have the molecular weight predicted by ORF analysis.
  • An advantage of the present invention is that it can be independent of the specific ionization technique and mass analyzer utilized, alleviating the requirement for rigorous reproducibility, crucial in currently used fingerprint-based approaches.
  • the approach introduced here is independent of relative signal intensities in the mass spectrum. It does not even require that the same set of proteins be expressed and/or detected in each analysis of the same organism, only that a set is characteristic so that it can be associated with a microorganism source.
  • sample preparation, ionization and mass analysis for obtaining mass spectra are not restrictive for the described approach, which also has a potential to be used for identification of cells from individual tissues.
  • the present invention can be used in variety of different ways and settings and has useful applications in the lab, field, and environmental testing. For example, it can be used in human and veterinary medicine to diagnose normal and pathological conditions from biological materials, such as blood, plasma, urine, sperm, fecal matter, and saliva.
  • biological materials such as blood, plasma, urine, sperm, fecal matter, and saliva.
  • the present invention is also useful in research and industry. Food samples can be obtained from food materials suspected of contamination.
  • SA Sinapinic acid
  • CHCA «- cyano-4-hydroxycinnamic acid
  • MCA 4-methoxycinnamic acid
  • MALDI mass spectrometry a solution of bovine insulin and bovine ubiquitin was added to the E.coli sample/matrix mixture on the sample slide in order to increase the accuracy of mass determination.
  • an internal mass calibration standard a solution of bovine insulin and bovine ubiquitin
  • B.subtilis external calibration of the instrument using a mixture of proteins (bovineinsulin, bovine ubiquitin and horse heart cytochrome C) was performed prior o running the samples. All proteins were obtained from Sigma Chemical Co. (St. Louis, MO).
  • Positive ion mass spectra (typically from 50 single laser shots rastered uniformily across the sample spot) were recorded in linear mode at 20 kV accelerating voltage and a delay of 0.3 ⁇ s.
  • the estimated N 2 laser fluence was around 10 mJ-cm "2 .
  • the MALDI spectra (Fig. 2) of B.subtilis and E.coli contain multiple peaks between 4 and 10 kDa with a signal to noise ratio better than 3.

Abstract

A method for rapid identification of biological materials is presented, which exploits the wealth of information contained in genome and protein sequence databases (5). In a preferred embodiment, the method utilizes the masses of a set of ions by MALDI TOF mass spectrometry of intact or treated cells (1). Subsequent correlation (4) of each ion in the set to a protein, along with the organismic source of the protein, is performed by searching a database comprising protein molecular weights (9).

Description

METHODS FOR IDENTIFYING AND CLASSIFYING ORGANISMS BY MASS SPECTROMETRY AND DATABASE SEARCHING
Cross-reference to Applications
This application claims the benefit of U.S. Provisional Application Nos. 60/108,696, filed November 17, 1998, and 60/120,679, filed February 19, 1999, which are hereby incorporated by reference in their entirety.
Background of the Invention
The development of methods of rapidly identifying and characterizing biological materials, such as microorganisms and cells, is major focus of academic and industrial research. The need for such methods has been felt in the health, laboratory, and environmental industries. In medicine, for example, the exponential rise in antibiotic- resistant bacteria and emerging viral disease has caused a crisis in the health-care and food industries. As a result, there has been a continued pressure to find new, reliable, and rapid means of characterizing pathological and disease-causing organisms. Similarly, the threats of biological warfare and terrorist activities which have been felt world-wide has caused an escalated search for ways of identifying putative biological agents, especially in the field, in airports, and in other public areas.
Coupled with the need for advanced biological agent detection methods has been an escalating effort in the sequencing of DNA from all types of organisms and identifying expressed genes. The complete genomic sequences of a number of microorganisms had been completed. The availability of such information about genome and proteome of whole organisms is an important reservoir to be exploited for identifying and characterizing unknown and known organisms. Description of the Drawings
Fig. 1 : Molecular mass distribution (in bins of 1 kDa) of proteins deposited in the SwissPROT/TrEMBL sequence database: a) all prokaryotic proteins, and b) all proteins from B.subtilis. Fig. 2: Positive ion MALDI spectra from: a) B.subtilis (8 hours growth time), matrix - SA; and b) E.coli (32 hours growth time), matrix - CHCA.
Fig. 3: Positive ion MALDI spectra from: a) E.coli (32 hours growth time), matrix - MCA:SA mixture; and b) E.coli (8 hours growth time), matrix - CHCA.
Fig. 4: Number of proteins combined from B.subtilis and E.coli with masses within a predetermined mass window (in ppm) as a function of molecular mass.
Fig. 5: Positive ion MALDI spectrum from a mixture of B.subtilis and E.coli, matrix - SA.
Fig. 6: An example of a flow chart for microroganism and cell identification by mass spectrometry and database searching.
Description of the Invention
The present invention relates to compositions of matter, instruments, processes (e.g., as carried using computer software and/or hardware), and methods, for identifying, classifying, and or characterizing biological materials by measuring the molecular weights of protein constituents in such materials and using the molecular information to deduce the organismic source of the biological materials. In preferred embodiments of the invention, biological materials comprising proteins can be subjected to mass spectrometry, or other suitable means for determining mass, in order to determine the molecular weights of the protein constituents. The resulting molecular weight information of the protein constituents can then be used to query databases which contain, among other information, lists of protein molecular weight information and the identity of the organism source from which the information was derived. By comparing the set of protein molecular masses of an unknown, as determined, for instance, in a mass spectrum, against a database containing the molecular masses of proteins present in known organisms, the unknown can be rapidly and reliably identified, classified, or characterized. The present invention presents a method for rapid identification, classification, and or characterization of biological materials, such as microorganisms, organisms, organs, tissues, cells, subcellular materials, and the like, which exploits the wealth of information contained in genome and protein sequence databases. The massive efforts to sequence the human genome has brought about a rapid increase in the speed with which DNA sequences from all species are being accumulated in publicly available computer databases. As a result, the complete genomes of many different organisms are now completely known (e.g., National Center for Biotechnology Information (NIH), http://www.ncbi.nlm.nih.gov/ Entrez/ Genome/ org.html; The C. elegans Sequencing Consortium, Science 1998, 282, 2012-2018; TIGR Microbial Database, http://www.tigr.org/tdb/mdb/mdb.html). See, also, Table 6. There exists complementarity between the genome of an organism, and its respective proteome, i.e. the dynamic entity set of all expressed proteins. In databases, such complementarity is realized via assignment of an amino acid sequence to each "open reading frame" (ORF) in a DNA sequence. By using bioinformatics tools, the complete proteomes of organisms with established DNA sequences have been made available and accessible, e.g., through the Internet. Characterization of such organisms can be achieved through knowledge of their complete genomes or complementary proteomes.
In accordance with the present invention, any instrument, method, process, etc. can be utilized to determine the molecular weight of proteins in a sample. A preferred method of obtaining molecular weight is by mass spectrometry, where protein molecules in a sample are ionized and then the resultant mass and charge of the protein ions are detected and determined.
Any suitable instrument, method, process, etc. for carrying out mass spectroscopy can be utilized. To use mass spectrometry to analyze proteins, it is preferred that the protein be converted to a gas-ion phase. Various methods of protein ionization are useful, including, e.g., fast ion bombardment (FAB), plasma desorption, laser desorption, thermal desorption, preferably, electrospray ionization (ESI) and matrix-assisted laser desorption ionization (MALDI). Many different mass analyzers are available for peptide and protein analysis, including, but not limited to, Time-of-Flight (TOF), ion trap
(ITMS), Fourier transform ion cyclotron (FTMS), quadrupole ion trap, and sector (electric and/or magnetic) spectrometers. See, e.g., U.S. Pat. No. 5,572,025 for an ion- trap MS.
Mass analyzers can be used alone, or in combination to form tandem mass spectrometers. In the latter case, a first mass analyzer can be use to separate the protein ions (precursor ion) from each other and determine the molecular weights of the various protein constituents in the sample. A second mass analyzer can be used to analyze the separated constituents, e.g., by fragmenting the precursor ions into product ions. Any desired combination of mass analyzers can be used, including, e.g., triple quadrupoles, tandem time-of-flights, ion traps, and/or combinations thereof. Different kinds of detectors can be used to detect the protein ions. For example, destructive detectors can be utilized, such as ion electron multipliers or cryogenic detectors (e.g., U.S. Pat. No. 5,640,010). Additionally, non-destructive detectors can be used, such as an ion trap which is utilized in an ion current pick-up devices which are utilized in quadrupole ion trap mass analyzers or FTMS. Any source of proteins can be used in accordance with the present invention, including whole organisms, such as multicellular and unicellular organisms, organs, tissues, cells, subcellular structures, and mixtures thereof. Various microorganisms can be utilized, e.g., archeabacteria, bacteria, chlamydiae, rickettsia, viruses, mycoplasma, molds, yeasts, protozoa, algae, prions, etc. Cells, microorganisms, etc. can be genetically-engineered, altered, modified, etc. Proteins can be extracted from intact or treated materials. Any substrate comprising a biological material can be used. For instance, it may be desirable to characterize organisms found on surfaces, in food, in biological fluids, such as saliva, urine, fecal matter, blood, lymph, or plasma, on materials used to wipe surfaces suspected of containing organisms, in hair, objects handled or contacted by organisms, etc.
Any method of preparing samples for analysis can be used. For MALDI-TOF, a number of sample preparation methods can be utilized including, dried droplet (Karas and Hillenkamp, Anal. Chem., 60:2299-2301, 1988), vacuum-drying (Winberger et al, In Proceedings of the 41st ASMS Conference on Mass Spectrometry and Allied Topics, San Francisco, May 31 -June 4, 1993, pp. 775a-b), crush crystals (Xiang et al, Rapid
Comm. Mass Spectrom., 8:199-204, 1994), slow crystal growing (Xiang et al., Org. Mass Spectrom, 28:1424-1429, 1993); active film (Mock et al., Rapid Comm. Mass Spectrom., 6:233-238, 1992; Bai et al., Anal. Chem., 66:3423-3430, 1994), pneumatic spray (Kochling et al., Proceedings of the 43rd ASMS Conference on Mass Spectrometry and Allied Topics; Atlanta, GA, May 21-26, 1995, pl225); electrospray (Hensel et al, Proceedings of the 43rd ASMS Conference on Mass Spectrometry and Allied Topics; Atlanta, GA, May 21-26, 1995, ρ947); fast solvent evaporation (Vorm et al, Anal.
Chem., 66:3281-3287, 1994); sandwich (Li et al., J. Am. Chem. Soc, 118:11662-11663, 1996); and two-layer methods (Dai et al, Anal. Chem., 71:1087-1091, 1999). See also, e.g., Liang et al., Rapid Commun. Mass Spectrom., 10:1219-1226, 1996; van Adrichem et al., Anal. Chem., 70:923-930, 1998. For example, samples of microorganisms can be lyophilized, extracted into a solution, such as a 70:30 solution of CH3CN:0.1% trifluoroacetic acid, and then embedded in the matrix. Various matrices can be used, e.g., sinapinic acid, 2,5-dihydroxybenzoic acid, alpha-cyano-4-hydroxycinaminnic acid. A sample can be processed in various ways prior to addition to the matrix. For instance, the sample can be extracted, subjected to corona discharge, chromatography, such as HPLC, etc., e.g., to remove particular unwanted constituents (such as lipids, small molecules, high molecular weight constituents) before mass spectrometry.
MALDI-TOF can detect proteins at the attamole level over a wide range of molecular weights. Masses can be determined, e.g., as low as 1000, and as high as a hundred-thousand daltons. Any range of molecular weights can be used in accordance with the present invention, e.g., about 4000-20,000. Masses accurate to about 50 ppm or better will be most reliable in the general case.
In some cases, it may be desirable to obtain more information on a particular protein identified in a mass spectrum. One instance is where a specific peak matches more than one different protein in the searched databases. This can happen when different proteins share the same, or similar, molecular weights. A search of a database in such a case might reveal more than one protein which corresponds to the measured molecular weight of a protein in the sample. To determine which protein in the database corresponds to the peak at issue, the peak protein can be separated from the mixture and subjected to further physical characterization. Separation can be accomplished by any suitable method, e.g., conventional techniques involving, e.g, by cell lysis, extraction, and two-dimensional gel chromatography, capillary electrophoresis, or high performance liquid chromatography Usefiil information includes, amino acid composition, amino acid sequence, proteolytic and enzymatic cleavage patterns, isoelectric point, hydrophobicity, and other physical characteristics. Another scenario where additional information on a protein in spectrum may be warranted is where such protein has not matched any of the proteins listed in the database. Thus, such information can be useful to increase the specificity of the approach.
As mentioned, characterization of individual proteins in mixture can be accomplished using any suitable means. The expanding requirements in proteomics, e.g. for rapid identification of proteins present in a mixture in picomolar amounts, have resulted in the development of powerful MS-based procedures for identity assignment of individual proteins. [9-14] These methods include, but are not limited to, chemical/enzymatic digestion of material (obtained from a single spot in a two- dimensional gel electropherogram, or by other suitable chromatographic technique) and mass spectral determination of the molecular masses of the protein and resulting peptides (peptide mapping). Making use of already available information in protein sequence databases, a comparison can made between proteolytic peptide mass patterns generated "in silico," and experimentally-observed peptide masses. A "hit-list" can be compiled, ranking candidate proteins in the database, based on (among other criteria) number of matches between the proteolytic fragments. Several Web sites are accessible that provide software for protein identification on-line, based on peptide mapping and sequence database search strategies.
[15] Methods of peptide mapping and sequencing using MS are described in WO95/25281, U.S. Pat. No. 5,538,897, U.S. Pat. No. 5,869,240, U.S. Pat. No. 5,572,025, U.S. Pat. No. 5,696,376. See, also, Yates, J. Mass Spec, 33:1-19. 1998. The present invention can also be combined with methods that detect small molecules (other than proteins) in samples, such as the method described in WO98/09314. The latter method only measure molecules in the range of about 500-1500 Da, and not more than 1876 Da.
Data collected from a mass spectrometer typically comprises the intensity and mass to charge ratio for each detected event. Spectral data can be recorded in any suitable form, including, e.g., in graphical, numerical, or electronic formats, either in digital or analog form. Spectra is preferably recorded in a storage medium, including, e.g., magnetic, such as floppy disk, tape, or hard disk; optical, such as CD-ROM or laserdisc; or, ROM-CHIPS.
The mass spectrum of a given sample typically provides information on protein intensity, mass to charge ratio, and molecular weight. In preferred embodiments of the invention, the molecular weights of proteins in the sample are used as a matching criterion to query a database. The molecular weights are calculated conventionally, e.g., by subtracting the mass of the ionizing proton for singly-charged protonated molecular ions, by multiplying the measured mass-over-charge-ratio by the number of charges for mutliuply-charged ions and subtracting the number of ionizing protons. Various databases are useful in accordance with the present invention. Useful databases include, databases which contain genomic sequences, expressed gene sequences, and/or expressed protein sequences. Preferred databases contain nucleotide sequence-derived molecular masses of proteins present in a known organism, organ, tissue, or cell-type. There are a number of algorithms to identify open reading frames (ORF) and convert nucleotide sequences into protein sequence and molecular weight information. Several publicly accessible databases are available, including, SwissPROT/TrEMBL database which contains substantial entries for a variety of organisms, including, B. subtilis and E. coli. For other databases, see also, e.g., TIGR Microbial Database, http://www.tigr.org/tdb/mdb/mdb.html; VanBogelen et al., Escherichia coli and Salmonella: Cellular and Molecular Biology, ASM Press,
Washington, D.C., 1996; http://pcsf.brcf.med.umich.edu/eco2dbase; http://expasy.hcuge.ch/cgi-bin/map2/def7ECOLI.ECO2DBASE. Information contained in the databases includes, e.g., gene name, protein name, E.C. number, category of function, Swiss-Prot accession code, sequence code for Genbank, Kohara phage location, genetic map location, direction of transcription on the chromosome, predicted molecular weight and isoelectric point from DNA sequence, etc.
One or more databases can be searched using any suitable search algorithm. For example, the SwissProt/TrEMBL database ("Expasy," Swiss Bioinformatics Institute) using the Sequence Retrieval System (SRSWWW) module. In general, any search strategy can be utilized in accordance with the present invention.
Typically, a mass spectrometer is equipped with commercial software that identifies peaks above a certain threshold level, calculates mass, charge, and intensity of detected ions. Correlating molecular weight with a given output peak can be accomplished directly from the spectral data, i.e., where the charge on an ion is one and the molecular weight is therefore equal to the numerator value minus the mass of the ionizing proton. However, protein ions can be complexed with various counter-ions and adducts, such as Na~, and K+. In such a case, it would be expected that a given protein ion would exhibit multiple peaks, such as a triplet, representing different ionic states (or species) of the same protein. Thus, it may be necessary to analyze and process spectral data to determine families of peaks arising from the same protein. This analysis can be carried out conventionally, e.g., as described by Mann et al., anal. Chem., 61 :1702-1708, 1989.
In matching a molecular mass calculated from a mass spectrometer to a molecular mass predicted from a database, such as a genomic or expressed gene database, post- translation processing may have to be considered. There are various processing events which modify protein structure in a cell, including, proteolytic processing, removal of N-terminal methionine, acetylation, methylation, glycosylation, etc.
A database can be queried for a range of proteins which match the molecular mass of the unknown. The range window can be determined by the accuracy of the instrument, the method by which the sample was prepared, etc. Based on the number of hits (where a hit is match) in the spectrum, the unknown is identified or classified. A preferred method of the present invention concerns identifying one or more unknown microorganisms in a sample, comprising: searching a sequence database for a plurality of different proteins that have the molecular weights of proteins in a mass spectrum of a sample, wherein said sample comprises a plurality of proteins from one or more unknown microorganisms, whereby said one or more unknown microorganisms are identified.
Identifying is meant in the general sense. For example, when an unknown microorganism is utilized in the aforementioned method, an objective is to determine the character of the unknown. This can mean finding out the particular taxonomic group(s) to which the microorganism belongs, such as its kingdom, phylum, class, order, family, genus, species, variety, and or strain. By determining that the sample is derived from a bacteria, the sample is thus classified as a bacteria. Identification in this sense can be as precise as the materials and methods allow. For some purposes, it may be enough to identify a sample as derived from a set of possible groups; however, other purposes may demand more precision. For instance, it may be enough for certain purposes to describe the sample as comprising a bacteria, as opposed to a protozoa, or a pathogenic bacteria as opposed to a nonpathogenic bacteria. In accordance with the method, a database is searched for proteins which have the molecular weights of protein constituents in the sample. A database is a collection of organized information in a form which can be searched and retrieved by a computer, or other electronic processing means. As described above, the searching can be accomplished usually any suitable, effective, search algorithm that can determine the presence of entries in the database which have the same, or within a specified range, molecular weight of proteins in the mass spectrum of the unknown sample. The database, as mentioned earlier, can comprise genomic sequences, expressed genes, protein sequences, protein molecular weights, etc. If the database contains nucleotide information, then this information can be translated into protein data before the searching step, e.g., by identifying an ORF, proteolytic and cleavage sites, glycosylation sites, methylation sites, and other processing which can influence the mass of a protein. In the case where a DNA database is being used to generate and deduce protein information, the knowledge of the protein's characteristics is indirect. Thus, the searching step can be characterized as searching for proteins which are "predicted" to have molecular weights. A search in accordance with the present invention means, e.g., that a database is queried or probed for the presence of a data which matches or corresponds to the measured data, such as the measured data obtained from a mass spectrum.
According to preferred embodiments of the present invention, the database is search for a plurality of different proteins, i.e., more than one, preferably more than 5, etc. In general, identification reliability will depend on a number of factors, including the number of peaks in a matched spectrum matched to proteins in a database, the number and accuracy of proteins predicted from the genome sequence in the mass range under study, etc. By the term "different," it is meant that the proteins arise from different genes, such as a gene coding for a protease and a gene coding for an amylase. A search strategy can use the information generated by MS, or any other method, to search a database. A simple search and find strategy can be used where the database is queried for proteins which match the molecular weight of the inputted data. Fig. 6 is an example of a process of identifying an unknown organism, cell, or other biological material. One or more of the steps depicted in the flow chart can be used to identify an organism in accordance with the present invention. In this example, a mass spectrum of a sample comprising proteins from an unknown organism has already been generated using MALDI/TOF spectrometry. The output from the mass spectrometer is represented as a series of m z values, where m is the mass of a protein plus the mass of a proton or other charging species, and z is the net number of charges carried by the ion and is used as the initial input 1. The input masses are processed 2, e.g., by subtracting one dalton to correct for the proton added to the protein when it is ionized through gas- phase proton transfer reaction of MALDI. The input data can additionally processed by determined an average molecular weight or a monoisotopic mass. A protein will typically be represented in a mass spectrum by more than one peak because of the presence in it of more than one isotope. Carbon, for instance, occurs in nature as C-12 or C-13 in a ratio of about 100: 1. Therefore, if a compound contains a single carbon, it would be expected that 99% of it would be C-12 and 1% of it would be C-13. The mass spectrum of such compound would therefore have at least two peaks, each corresponding to a different carbon isotope. As the number of carbon atoms in a molecule increases, there is an increasing number of polyisotopic molecules, comprising varying ratios of the different carbon isotopes. A mass spectrum of such a compound would contain multiple peaks for each polyisotopic molecule. A compound containing a plurality of atoms represented by more than one isotope will have a complex pattern of spectral peaks. Such complex spectral information can be processed in a number of ways. A average mass can be calculated, e.g., using the empirical spectral information. Alternatively, a monisotopic mass can be calculated where a mass is derived where the mass of only one isotope of each atom is represented in the molecule.
Before database searching is initiated, a mass window is set 3 to define a mass range in which matches in the database will be scored as hits. The mass window for a particular query can be set based on various criteria. One consideration is the accuracy of the instrument. For instance, if the instrument can only measure values within three daltons, then the mass window could be for ±3 Da. Other considerations, include, post- translational processing. The accuracy of the instrument can be determined routinely, e.g., using known standards and calibrating the instrument using an external and internal standard.
The processed data resulting from 2 is used as input data to initiate a search 4 of a database containing protein masses. In preferred embodiments, the database comprises nucleotide sequence information which has been analyzed to predict the occurrence of open reading frames (ORF) and the calculated molecular masses of such ORFs. As mentioned above, various public and private databases are available that contain calculated protein mass information, or which can be mined by available software to derive such information. The search mode queries the database for proteins having molecular masses which match up with the molecular masses in the mass spectrum input data 2. For each peak in the input data 2, the database is queried and a list is generated of putative database proteins which match it. A match is identified in the database if it possesses the same mass as the peak, or if it is within the range indicated in the mass window 3. A first list 6 can be generated which reflects the masses and organismic sources for each match. For example, each mass spectral peak of 2 can be associated with a family of proteins, representing proteins of the same molecular mass but from different organisms and proteins within the mass range set in 3.
The data in 6 can optionally be refined 7 by inputting additional data 8, e.g., from fragmented precursor ions of 1 and collecting data of peptide mass, sequence tag information from mass spectra, or other types of downstream information on the constituent proteins. Such data, or orthogonal information, can, e.g., increase certainty that the identification is correct and/or reduce the number of positive hits identified in a search. When a search identifies X possible proteins in the database which match the query by being within the mass window set in 3, a step 8 can be used to reduce the number of possible hits. When the queried database contains amino acid sequence information (deduced or experimentally-derived), sequence information or proteolytic information can be used to determine which specific hit, in the set of hits identified for the specified mass range, corresponds to the data point of interest in the mass spectrum. For example, amino sequence or composition information obtained from a peak of interest can be used to search the set of hits identified as matching the peak of interest to ascertain which hit contains the sequence information. Sequence can information can be highly specific, eliminating all other peaks having the same molecular weight from the list generated in 7. Amino acid composition information can also be used a refining tool, although it may be less specific. Any supplemental information on the physical characteristics of a protein can be used to confirm and/or reduce the number of hits identified in a search, including, sequence information as mentioned, cleavage patterns
(chemical or enzymatic), isoelectric point, hydrophobicity as deduced from chromatography, immuogenicity, etc.
The data from 6 or 7 can then be scored to generate an output list 10 which lists the possible organisms sources of the mass spectrum. The identified organismic sources can be ranked based on a number of criteria, including, but not limited to, total number of proteins identified as matching an organismic source, orthogonal information obtained in 8, etc. Table 1, for instance shows that B. subtilis contains 12/15 or 80% matching peaks and E. coli contains 6/15 or 40% matching peaks, If percent match is the sole criteria, B. subtilis would be ranked above E. coli. Optionally, the proteins which are unidentified (e.g., the three proteins listed in
Table 1 for B. subtilis) in the list can be subjected to further analysis in an interation step 10. One reason that a matching protein is not identified in the database may be that the protein is subjected to post-translational modifications and therefore does not have the molecular weight predicted by ORF analysis. An advantage of the present invention is that it can be independent of the specific ionization technique and mass analyzer utilized, alleviating the requirement for rigorous reproducibility, crucial in currently used fingerprint-based approaches. The approach introduced here is independent of relative signal intensities in the mass spectrum. It does not even require that the same set of proteins be expressed and/or detected in each analysis of the same organism, only that a set is characteristic so that it can be associated with a microorganism source. The particular choices of sample preparation, ionization and mass analysis for obtaining mass spectra are not restrictive for the described approach, which also has a potential to be used for identification of cells from individual tissues. The present invention can be used in variety of different ways and settings and has useful applications in the lab, field, and environmental testing. For example, it can be used in human and veterinary medicine to diagnose normal and pathological conditions from biological materials, such as blood, plasma, urine, sperm, fecal matter, and saliva. The present invention is also useful in research and industry. Food samples can be obtained from food materials suspected of contamination.
FXAMPΓES
For the purpose of illustrating the feasibility of the method MALDI TOF mass spectrometry was employed. The described database search method is not restricted to that specific instrument combination and sample preparation. Sinapinic acid (SA) or «- cyano-4-hydroxycinnamic acid (CHCA) 50 mM in 70:30 CH3CN:H2O, and an equimolar mixture of SA and 4-methoxycinnamic acid (MCA) in 70:30 CH3CN:H2O were used as matrixes. The microorganisms studied were: B.subtilis (strain 168, ATCC# 23857) and E.coli (ATCC#11775). They were grown in-house according to standard procedures; 8 g/1 nutrient broth (Difco Labs, Detroit, MI) was used as a growth medium, after harvesting the material was centrifuged for 10 min at 104g and washed with water three times prior to lyophilization for prolonged storage at -10° C. Lyophilized vegetative cells were suspended in a 70:30 solution of CH3CN: 0.1% trifluoroacetic acid at a concentration of 5 mg/ml. B.subtilis suspension (0.2 μl) was deposited on the sample slide before MALDI mass spectrometry. For E.coli and B.subtilis was prepared by mixing suspensions of the two microorganisms on the slide prior to CPD treatment and
MALDI mass spectrometry. In some experiments, an internal mass calibration standard ( a solution of bovine insulin and bovine ubiquitin) was added to the E.coli sample/matrix mixture on the sample slide in order to increase the accuracy of mass determination. For B.subtilis, external calibration of the instrument using a mixture of proteins (bovineinsulin, bovine ubiquitin and horse heart cytochrome C) was performed prior o running the samples. All proteins were obtained from Sigma Chemical Co. (St. Louis, MO).
Positive ion mass spectra (typically from 50 single laser shots rastered uniformily across the sample spot) were recorded in linear mode at 20 kV accelerating voltage and a delay of 0.3 μs. The estimated N2 laser fluence was around 10 mJ-cm"2.
A search by protein molecular mass (Mr) and based on the set of protein molecular weights in the spectra was carried out in the SwissProt/TrEMBL database ("Expasy", Swiss Bioinformatics Institute) using the Sequence Retrieval System (SRSWWW) module at http://expasy.hcuge.ch/srs5/. A interactive window ("Alternative Query Form") allows search by a number of classifiers. In case we chose average protein
MW as the primary classifier. We selected a ±3 Da MW window, and the only restriction applied in the query was the choice of the "bacteria" protein subset of the database (in earlier release of SwissPROT the identifier "prokaryota" was also available). Thus protein identifies and organismic sources were tentatively assigned for all peaks from the experimental spectra within the range from 4 to 15 kDa.
Under the conditions used, the MALDI spectra (Fig. 2) of B.subtilis and E.coli contain multiple peaks between 4 and 10 kDa with a signal to noise ratio better than 3.
They are listed in Tables 1 and Tables 2, respectively. A database search was performed, based on the observed masses. It was assumed that singly-protonated molecules were detected i.e., a proton mass was subtracted from the observed mass in order to obtain the average Mr. In assigning the respective peaks (i.e. Proteins with Mr within the Mr window chosen: ± 3 Da), the organisms from which each potential protein originates, are also determined. These are presented in Tables 1 and 2. From, Table 1, One microorganism, B.subtilis, is identified as the source of 12 of the 15 peaks. There are two "runner-ups" in that example, that provide matches for 6 and 5 of the 15 major peaks. It is evident from Table 2 that 13 E.coli proteins match observed peaks (out of 17 total), while one microorganism matches 5 of the 17 peaks. The possibility that unmatched peaks can correspond to alkali cation adducts and/or post-translationally modified products (including proteolytic fragments) of proteins already present in the database will be explored in a software implementation of the described approach.
As already pointed out, there exist inherent problems with the reproducibility of MADLI mass spectra of the same organism - E.coli [26,29,30] shows that they do not match each other of the spectrum in Fig. 2b. However, searching the proteome database for masses observed in each spectrum leads to the positive identification of the bacteria in each case (Tables 3-5). This is not surprising since all spectra should reflect the presence of expressed proteins from the same genome, the same type of robustness can be illustrated by comparing the MALDI spectra from the same sample of E. coli, obtained in different matrixes. The spectra have different fingerprints - peaks above 5 kDa are more prominent in the spectra obtained with MCA:SA matrix (Fig. 3. a), in comparison to spectra with CHCA matrix (Fig. 2b). However, the database search method results in positive identification of the species in each spectrum (Tables 2 and 7). Effects of incubation time on experimentally obtained mass spectra from E.coli have been discussed in the literature [31]. Spectra from E.coli harvested after 8 and 32 hours of growth are compared on Fig. 3.b and 2b. Again the overall spectral appearance is different for the two samples. Nevertheless the identification is straightforward in both cases (Tables 2 and 7). It appears that experimental factors such as choice of an "appropriate" MADLI matrix, variability in the levels of protein expression, etc. will have limited influence when microorganisms are identified by searching the proteome database.
References
1. National Center for Biotechnology Information (NTH), http://www.ncbi.nlm.nih.gov/Entrez/Gerome/org.html
2. The C.elegans Sequencing Consortium, Science 1998, 282, 2012-2018 3. Arigioni, F.; Talabot, F.; Peitsch, M.; Edgerton, M.; Meldrum, E.; Allet, E.; Fish, R.;
Jamotte, Th.; Curchod , M.-L.; Loferer, H. Nat. Biotechnology 1998, 16,851-856. 4. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins; Baxevanis, A.; Oulette, B., Ed.; Methods of Bichemical Analysis 39; Wiley interscience : New York, 1998. 5. Roepstorff, P. Curr. Opin. in Biotechnol. 1998, 8, 6-13
6. I. Humphrey-Smith, W. Blackstock, J.Protein Chem. 1997, 231, 1-6.
7. James, P. Biochem. Biophys. Res. Commun. 1997, 231 1-6
8. Kuster, B.; Mann, M. Curr. Opinions in Struct. Biology 1998, 8, 393-400
9. Henzel, W.; Billeci, T.; Stults, J.; Wong, S.; Grimley, C; Watanabe , C. Proc. Natl Acad. Sci. USA, 1993, 90, 5011-5015
10. Mann, M.; Hojrup, P.; Roepstorff, P. Biol. Mass Spectrum. 1993, 22, 338-345.
1 1. Pappen, D.; Hojrup, P. Bleasby, A. Current Biology 1993, 3, 327-332.
12. James, P.; Quandroni, M.; Carafoli, E.; Gonnet, G. Biochem. Biophys. Res. Commun. 1993, 195,58-64. 13. Yates III, J.R.; McCormack, A.; Eng, J. Anal. Chem. 1996, r5S,534A-540A.
14. Fenyό, D.; Qin, J.; Chait, B. Electrophoresis 1998, 19, 998-1005.
15. prospector.uscf.edu. www.proteometrics.com www.mann.embl-heidelberg.de/Services/PeptideSearch chrg.inf.ethz.ch/M assSearch.html expasy.hcuge.ch www.sequet.dlac.uk/mowse.htm1
16. Jensen, O.; Podtelejnikov, A.; Mann, M. Rapid Commun. Mass Spectrum. 1996, 10, 1371-1378. 17. Mortz, E.; O'Connor, P. Roepstorf, P.; Kelleher,N.; Wood, T.; McLafferty, F.;
Mann,M. Proc. Natl. Acad. USA 1996, 93, 8264-8267. 18. Yates III, J.R.; Eng, J. US Patent No. 553897 (issued July 23,1996). "Use of Mass Spectrometry Fragmentation Patterns of peptides to Identify Amino Acid Sequences in Databases."
19. Anhalt, J.P.; Fenselau, C. Anal. Chem. 1975, 47, 219-225. 20. Heller, D.;Fenselau, C; Cotter, R.; Demirev, P.; Olthoff, J.; Honovich, J.; Uy, M.;
Tanaka, T.; Kishimoto, Y. Biochem. Biophys. Res. Commun. 1987, 142, 194-199.
21. Mass Spectrometry for the Chacracterization of Microorganisms; Fenselau, C, Ed.; ACS Symposium Series 541 ; Am. Chem. Soc: Washington DC, 1994.
22. Cain, T.; Lubman, D.; Weber Jr., W Rapid Commun. Mass Spectrom. 1994, 8, 1026- 1030.
23. Claydon, M.; Davey, S.; Edwards-JonesN.; Gordon, D. Nature Biotechnology 1996, 14, 1584-1586.
24. Holland, R.; Wilikes, J.; Rafii, F.; Sutherland, J.; Person, C; Voorhees, K.; Lay, J. Rapid Commun. Mass Spectrom 25. Krishnamurthy, T.; Ross, P.; Rajamani, U. Rapid Commun. Mass Spectrom. 1996,
10, 883-888
26. Arnold, R.; Reilly, J.; Commun. Mass Spectrom. 1998, 12, 630-636.
27. Welham, K.; Domin, M.; Scannell, D.; Cohen, E.; Ashton, D.; Rapid Commun. Mass Spectrom.1998, 12, 176-180 28. Haag,A.; Taylor,S.; Johnston, K.; Cole, R. J. Mass Spectrom. 1998, 33, 750-756
29. Wang, Z.; Russon, L.; Li,L.; Roser, D.; Long, S. R. Rapid Commun. Mass Spectrom. 1998, 12, 456-464.
30. Dai, Y.; Li, L.; Roser, D.; Long, S. R. Rapid Commun. Mass Spectrom. 1999, 13, 73-78. 31. Arnold, R.; Reilly, J. A Study of Bacterial Culture Growth by MALDI-MS of Whole
Cells; Proceedings of the 46th ASMS Conference on Mass Spectrometry and Allied
Topics, Orlando, Florida, May 31 -June 4, 1998, p. 180.
32. Birmingham, J.; Demirev, P.; Ho, Y-P.; Thomas, J.; Bryden, W.;Fenselau, C. Rapid
Commun. Mass Spectrom. 1999, 13, 604-606. 33. Das, S.; Yu, L.; Gaitatzes, C; Roger, R.; Freeman, J.; Blenkowska, J.; Adams,R.M.;
Smith, T.F. Nature 1997, 385, 29-30 Without further elaboration, it is believed that one skilled in the art can, using the preceding description, utilize the present invention to its fullest extent. The preceding preferred specific embodiments are, therefore, to be construed as merely illustrative, and not limitative of the remainder of the disclosure in any way whatsoever. The entire disclosure of all applications, patents, publications, cited above and in the figures are hereby incorporated in their entirety by reference, including Demirev et al, Anal. Chem., 71 :2732-2738, 1999.
From the foregoing description, one skilled in the art can easily ascertain the essential characteristics of this invention, and without departing from the spirit and scope thereof, can make various changes and modifications of the invention to adapt it to various usages and conditions.
Table 1. Ranking of organisms according to matched peaks in B. subtilis spectrum (Fig. 2. a).
o
Figure imgf000022_0001
Only organisms with more than one matching peak (within ± 3 Da) are listed.
Table 2. Ranking of organisms according to matched peaks in E. coli spectrum (Fig. 2.b).
t
Figure imgf000023_0001
*Only organisms with more than one matching peak (within ± 3 Da) are listed.
Table 3. Ranking of organisms according to matched peaks in E. coli spectrum (Fig. 1.b of Ref. 29).
t to
Figure imgf000024_0001
"Only organisms with more than one matching peak (within ± 5 Da) are listed.
Table 4. Ranking of organisms according to matched peaks in E. coli spectrum (Fig. l .a of Ref. 30).
t
Figure imgf000025_0001
"Only organisms with more than one matching peak (within ± 5 Da) are listed.
Table 5. Ranking of organisms according to matched peaks in E. coli spectrum (Fig. 4 of Ref. 26).
to
Figure imgf000026_0001
*Only organisms with more than one matching peak (within ± 5 Da) are listed.
Table 6. Ranking of organisms according to matched peaks in E. coli spectrum (Fig. 3. a).
Figure imgf000027_0001
"Only organisms with more than one matching peak (within ± 3 Da) are listed.
Table 7. Ranking of organisms according to matched peaks in E. coli spectrum (Fig. 3.b).
to
Figure imgf000028_0001
"Only organisms with more than one matching peak (within ± 3 Da) are listed.
Table 8. Ranking of organisms according to matched peaks in spectrum of B. subtilis and E. coli mixture (Fig. 5).
to - .
Figure imgf000029_0001
"Only organisms with more than one matching peak (within ± 3 Da) are listed.
TABLE 9
Figure imgf000030_0001

Claims

aims:
1. A method of identifying one or more unknown microorganisms in a sample, comprising: searching a sequence database for a plurality of proteins that are predicted to have the molecular weights of proteins in a mass spectrum of a sample, whereby said one or more unknown microorganisms are identified. wherein said sample comprises a plurality of proteins from one or more unknown microorganisms, and said database is searched for more than one different protein,
2. A method of claim 1 , wherein the sequence database is a protein sequence database.
3. A method of claim 1, wherein the sequence database is a nucleotide sequence database.
4. A method of claim 1 , wherein the mass spectrometry data is MALDI-TOF data.
5. A method if claim 1 , wherein the mass spectrometry data is obtained by electrospray on a time-of-flight, quadrupole, or ion trap mass analyzer.
6. A method of claim 1, wherein the sample comprises chemical or enzymatic digested polypeptide fragments.
7. A method of claim 1 , further comprising: performing a mass spectral analysis on a sample comprising one or more microorganisms.
8. A method of claim 1 , further comprising: identifying molecular weights of proteins in a mass spectrum of said sample.
9. A method of claim 1 , wherein said sample comprises at least two different species of microorganisms.
10. A method of claim 1, wherein the sequence database is the
NCBI/SwissProt/EMBL database.
11. A method of claim 1 , further comprising chemical or enzymatic digestion of a protein in said sample.
PCT/US1999/027191 1998-11-17 1999-11-17 Methods for identifying and classifying organisms by mass spectrometry and database searching WO2000029987A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US09/856,044 US7020559B1 (en) 1998-11-17 1999-11-17 Methods for identifying and classifying organisms by mass spectrometry and database searching
AU19150/00A AU1915000A (en) 1998-11-17 1999-11-17 Methods for identifying and classifying organisms by mass spectrometry and database searching

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US10869698P 1998-11-17 1998-11-17
US60/108,696 1998-11-17
US12067999P 1999-02-19 1999-02-19
US60/120,679 1999-02-19

Publications (1)

Publication Number Publication Date
WO2000029987A1 true WO2000029987A1 (en) 2000-05-25

Family

ID=26806169

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/027191 WO2000029987A1 (en) 1998-11-17 1999-11-17 Methods for identifying and classifying organisms by mass spectrometry and database searching

Country Status (2)

Country Link
AU (1) AU1915000A (en)
WO (1) WO2000029987A1 (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10038694A1 (en) * 2000-07-28 2002-02-14 Anagnostec Ges Fuer Analytisch Identification of micro-organisms, using MALDI-TOF-MS, uses a synthetic reference spectrum for comparison with the mass spectrum to register the micro-organism under study
WO2002027329A2 (en) * 2000-09-25 2002-04-04 Eastern Virginia Medical School Biomarkers of transitional cell carcinoma of the bladder
WO2003074727A1 (en) * 2002-03-01 2003-09-12 De Montfort University Rapid identification of yeasts
US6680203B2 (en) 2000-07-10 2004-01-20 Esperion Therapeutics, Inc. Fourier transform mass spectrometry of complex biological samples
EP1437673A1 (en) 2003-01-07 2004-07-14 AnagnosTec, Gesellschaft für Analytische Biochemie und Diagnostik mbH Method for the identification of microorganisms by mass spectrometry
US6800449B1 (en) 2001-07-13 2004-10-05 Syngenta Participations Ag High throughput functional proteomics
US7061605B2 (en) 2000-01-07 2006-06-13 Transform Pharmaceuticals, Inc. Apparatus and method for high-throughput preparation and spectroscopic classification and characterization of compositions
US7108970B2 (en) 2000-01-07 2006-09-19 Transform Pharmaceuticals, Inc. Rapid identification of conditions, compounds, or compositions that inhibit, prevent, induce, modify, or reverse transitions of physical state
US7133864B2 (en) 2001-08-23 2006-11-07 Syngenta Participations Ag System and method for accessing biological data
CN100364355C (en) * 2004-07-29 2008-01-23 大唐移动通信设备有限公司 Method for obtaining sector coverage by using tiled measured data
US7811772B2 (en) 2005-01-06 2010-10-12 Eastern Virginia Medical School Apolipoprotein A-II isoform as a biomarker for prostate cancer
WO2011123479A1 (en) * 2010-03-29 2011-10-06 Academia Sinica Quantitative measurement of nano / micro particle endocytosis with cell mass spectrometry
WO2012044170A1 (en) * 2010-10-01 2012-04-05 Nederlandse Organisatie Voor Toegepast-Natuurwetenschappelijk Onderzoek Tno New classification method for spectral data
CN107782886A (en) * 2011-04-21 2018-03-09 生物梅里埃公司 Use the method for at least one cephalosporin resistance mechanism of Mass Spectrometer Method
CN111257404A (en) * 2016-01-14 2020-06-09 萨默费尼根有限公司 Method for top-down multiplexed mass spectrometry of mixtures of proteins or polypeptides

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5538897A (en) * 1994-03-14 1996-07-23 University Of Washington Use of mass spectrometry fragmentation patterns of peptides to identify amino acid sequences in databases
US5809212A (en) * 1993-07-12 1998-09-15 New York University Conditional transition networks and computational processes for use interactive computer-based systems
US5869240A (en) * 1995-05-19 1999-02-09 Perseptive Biosystems, Inc. Methods and apparatus for sequencing polymers with a statistical certainty using mass spectrometry
US5930803A (en) * 1997-04-30 1999-07-27 Silicon Graphics, Inc. Method, system, and computer program product for visualizing an evidence classifier
US5930784A (en) * 1997-08-21 1999-07-27 Sandia Corporation Method of locating related items in a geometric space for data mining
US5977890A (en) * 1997-06-12 1999-11-02 International Business Machines Corporation Method and apparatus for data compression utilizing efficient pattern discovery
US5986652A (en) * 1997-10-21 1999-11-16 International Business Machines Corporation Method for editing an object wherein steps for creating the object are preserved
US5987470A (en) * 1997-08-21 1999-11-16 Sandia Corporation Method of data mining including determining multidimensional coordinates of each item using a predetermined scalar similarity value for each item pair

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5809212A (en) * 1993-07-12 1998-09-15 New York University Conditional transition networks and computational processes for use interactive computer-based systems
US5538897A (en) * 1994-03-14 1996-07-23 University Of Washington Use of mass spectrometry fragmentation patterns of peptides to identify amino acid sequences in databases
US5869240A (en) * 1995-05-19 1999-02-09 Perseptive Biosystems, Inc. Methods and apparatus for sequencing polymers with a statistical certainty using mass spectrometry
US5930803A (en) * 1997-04-30 1999-07-27 Silicon Graphics, Inc. Method, system, and computer program product for visualizing an evidence classifier
US5977890A (en) * 1997-06-12 1999-11-02 International Business Machines Corporation Method and apparatus for data compression utilizing efficient pattern discovery
US5930784A (en) * 1997-08-21 1999-07-27 Sandia Corporation Method of locating related items in a geometric space for data mining
US5987470A (en) * 1997-08-21 1999-11-16 Sandia Corporation Method of data mining including determining multidimensional coordinates of each item using a predetermined scalar similarity value for each item pair
US5986652A (en) * 1997-10-21 1999-11-16 International Business Machines Corporation Method for editing an object wherein steps for creating the object are preserved

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7108970B2 (en) 2000-01-07 2006-09-19 Transform Pharmaceuticals, Inc. Rapid identification of conditions, compounds, or compositions that inhibit, prevent, induce, modify, or reverse transitions of physical state
US7061605B2 (en) 2000-01-07 2006-06-13 Transform Pharmaceuticals, Inc. Apparatus and method for high-throughput preparation and spectroscopic classification and characterization of compositions
US6680203B2 (en) 2000-07-10 2004-01-20 Esperion Therapeutics, Inc. Fourier transform mass spectrometry of complex biological samples
DE10038694C2 (en) * 2000-07-28 2003-09-25 Anagnostec Ges Fuer Analytisch Process for the identification of microorganisms using MALDI-TOF-MS
DE10038694A1 (en) * 2000-07-28 2002-02-14 Anagnostec Ges Fuer Analytisch Identification of micro-organisms, using MALDI-TOF-MS, uses a synthetic reference spectrum for comparison with the mass spectrum to register the micro-organism under study
WO2002027329A3 (en) * 2000-09-25 2003-08-07 Eastern Virginia Med School Biomarkers of transitional cell carcinoma of the bladder
WO2002027329A2 (en) * 2000-09-25 2002-04-04 Eastern Virginia Medical School Biomarkers of transitional cell carcinoma of the bladder
US6800449B1 (en) 2001-07-13 2004-10-05 Syngenta Participations Ag High throughput functional proteomics
US7133864B2 (en) 2001-08-23 2006-11-07 Syngenta Participations Ag System and method for accessing biological data
WO2003074727A1 (en) * 2002-03-01 2003-09-12 De Montfort University Rapid identification of yeasts
EP1437673A1 (en) 2003-01-07 2004-07-14 AnagnosTec, Gesellschaft für Analytische Biochemie und Diagnostik mbH Method for the identification of microorganisms by mass spectrometry
CN100364355C (en) * 2004-07-29 2008-01-23 大唐移动通信设备有限公司 Method for obtaining sector coverage by using tiled measured data
US7811772B2 (en) 2005-01-06 2010-10-12 Eastern Virginia Medical School Apolipoprotein A-II isoform as a biomarker for prostate cancer
WO2011123479A1 (en) * 2010-03-29 2011-10-06 Academia Sinica Quantitative measurement of nano / micro particle endocytosis with cell mass spectrometry
WO2012044170A1 (en) * 2010-10-01 2012-04-05 Nederlandse Organisatie Voor Toegepast-Natuurwetenschappelijk Onderzoek Tno New classification method for spectral data
EP2439536A1 (en) * 2010-10-01 2012-04-11 Nederlandse Organisatie voor toegepast- natuurwetenschappelijk onderzoek TNO New classification method for spectral data
CN107782886A (en) * 2011-04-21 2018-03-09 生物梅里埃公司 Use the method for at least one cephalosporin resistance mechanism of Mass Spectrometer Method
CN111257404A (en) * 2016-01-14 2020-06-09 萨默费尼根有限公司 Method for top-down multiplexed mass spectrometry of mixtures of proteins or polypeptides

Also Published As

Publication number Publication date
AU1915000A (en) 2000-06-05

Similar Documents

Publication Publication Date Title
JP5808398B2 (en) System and method for determining drug resistance of microorganisms
US11646185B2 (en) System and method of data-dependent acquisition by mass spectrometry
Jensen et al. Delayed extraction improves specificity in database searches by matrix‐assisted laser desorption/ionization peptide maps
JP4767496B2 (en) Mass spectrum measurement method
Henzel et al. Protein identification: the origins of peptide mass fingerprinting
Xu et al. MassMatrix: a database search program for rapid characterization of proteins and peptides from tandem mass spectrometry data
WO2000029987A1 (en) Methods for identifying and classifying organisms by mass spectrometry and database searching
Fenselau et al. Characterization of intact microorganisms by MALDI mass spectrometry
Arnold et al. Fingerprint matching of E. coli strains with matrix‐assisted laser desorption/ionization time‐of‐flight mass spectrometry of whole cells using a modified correlation approach
US7020559B1 (en) Methods for identifying and classifying organisms by mass spectrometry and database searching
Kallow et al. MALDI‐TOF MS for microbial identification: Years of experimental development to an established protocol
Poutanen et al. Use of matrix‐assisted laser desorption/ionization time‐of‐flight mass mapping and nanospray liquid chromatography/electrospray ionization tandem mass spectrometry sequence tag analysis for high sensitivity identification of yeast proteins separated by two‐dimensional gel electrophoresis
US8160819B2 (en) Rapid identification of proteins and their corresponding source organisms by gas phase fragmentation and identification of protein biomarkers
Sennels et al. Improved results in proteomics by use of local and peptide-class specific false discovery rates
Stump et al. Use of double-depleted 13C and 15N culture media for analysis of whole cell bacteria by MALDI time-of-flight and Fourier transform mass spectrometry
GB2394545A (en) Mass spectrometry
Song et al. Development and assessment of scoring functions for protein identification using PMF data
Fenyö et al. Informatics development: challenges and solutions for MALDI mass spectrometry
CA2447336A1 (en) Methods of detecting protein arginine methyltransferase, and uses related thereto
Wilkes et al. Improved cell typing by charge‐state deconvolution of matrix‐assisted laser desorption/ionization mass spectra
EP1647825A2 (en) Method of mass spectrometry
US7603240B2 (en) Peptide identification
Fenselau et al. Bioinformatics for flexibility, reliability, and mixture analysis of intact microorganisms
Bienvenut et al. Proteomics and Mass Spectrometry
Rao BE, Bangalore University, India, 2000

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 09856044

Country of ref document: US

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase