WO2005036369A2

WO2005036369A2 - Database for microbial investigations

Info

Publication number: WO2005036369A2
Application number: PCT/US2004/033742
Authority: WO
Inventors: Kumar Hari; John Mcneil; Dave Ecker; Neill White; Vivek Samant; Vanessa Zapp; Alan Goates
Original assignee: Isis Pharmaceuticals, Inc.
Priority date: 2003-10-09
Filing date: 2004-10-12
Publication date: 2005-04-21
Also published as: WO2005036369A3

Abstract

A centralized system provides for locating, storing and providing microbial-related information, including pathogens, diseases, symptoms, genetic information, and related documentation. A centralized database relates pathogens to diseases that they cause, symptoms of those diseases, genetic information about the pathogens, and documentation that is associated with any of the above. By analyzing the stored data, the system recognizes duplicated information, for example a pathogen present in the database twice, each time under a different taxonomical name, and creates a link from one instance of the pathogen to the other. The system additionally identifies common genes and sequences among related organisms. A researcher uses the system to perform investigations. For example, a medical doctor can enter patient symptoms to obtain a list of consistent diseases and associated pathogens. The system also tracks the progression of an outbreak and identifies similar occurrences by searching the World Wide Web and reporting results.

Description

DATABASE FOR MICROBIAL INVESTIGATIONS

Inventors Kumar Hari John McNeil Dave Ecker Neill White Vivek Samant Vanessa Zapp Alan Goates

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of United States Provisional Application No. 60/509,911, filed October 9, 2003, and United States Provisional Application No. 60/598,408, filed August 2, 2004; each of which is incorporated by reference herein in its entirety.

STATEMENT OF GOVERNMENT SUPPORT

[0002] This invention was made with United States Government support under FBI contracts J-FBI-02-127 and J-FBI-04-189. The United States Government may have certain rights in the invention.

BACKGROUND OF THE INVENTION

Field Of The Invention

[0003] This invention relates generally to the field of microbial investigations. In particular, this invention is directed towards a centralized computerized investigative system designed to assist researchers and clinicians with identifying and treating pathogens and related data.

Description Of The Related Art

[0004] The number of organisms that pose an infection risk, both bacterial and viral, to humans is amazingly large. A recent literature review identified 1,415 species of infectious organisms pathogenic to humans, including 217 viruses and prions, 538 bacteria and rickettsia, 307 fungi, 66 protozoa and 287 helminthes (Taylor et al, Phil. Trans. R. Soc. Lond. B (2001) 356, 983). In addition, threats to the food supply, i.e. to crops and farm animals, and the environment also exist in abundance. Further, there are many strain variants below the species level, and organisms that are phylogenetically closely related to known infectious agents from which bioengineered or emerging infectious agents might originate. This includes the possibility of organisms shifting from animal to human hosts.

[0005] In addition to the bioagents themselves, there are many threat-related genes that encode toxins and antibiotic resistance and that mediate virulence, host range and pathogenicity. Many of these genes are important to consider in the identification and treatment of bioengineered threats.

[0006] In spite of the above, no efficient and reliable methodology exists for bringing together information about these threats in one centralized location, largely because the solution is far from trivial. For example, there are many different government, medical and veterinary sources of "threat lists". Typically, threat lists contain common or sometimes misspelled bioagents, rather than the accepted taxonomical designations, if an accepted designation even exists. Naming ambiguities extend to disease lists as well, where common names are often used, and the diseases may themselves be caused by multiple organisms. Pathogens are often named after the disease they cause— for example, severe acute respiratory syndrome (SARS), and foot-and-mouth disease are both pathogen names that describe the effect on their victims rather than inherently describing the pathogen. Disease and pathogen names also can vary depending on the host. Further, taxonomies are legitimately derived from multiple sources, including organism characteristics; disease effects; and molecular characteristics, e.g., rRNA sequence. Disease information and sequence data are often linked to different taxa. For example, viral sequences are most completely linked to the National Center for Biotechnology Information's (NCBI) taxonomy, while viral diseases and characteristics are most completely linked to the International Committee on Taxonomy of Viruses (ICTV) taxonomy.

[0007] Today, a number of individuals with specific local domain expertise are required to manually connect threat lists, biological agents, correct taxonomic names, and correct sequences in a genetic sequence database such as the National Institute of Health's GenBank since there is no comprehensive collection of threats, synonyms and consistent taxonomic names.

[0008] The benefit of having coordinated access to these disparate sources of data is substantial— clinical investigations of infections, for example, would be improved by allowing clinicians to cross-reference symptoms and epidemiology data with data already available from other sources. Today, that data may exist but only in the hands of experts doing research in discrete subject areas, effectively turning a clinician's search into something akin to searching for a needle in a haystack. The history of the recent SARS outbreak is an illustrative example.

[0009] Accordingly, there is a need for a centralized microbial knowledge base system that provides efficient access to a diverse array of expert-curated data.

SUMMARY OF THE INVENTION

[0010] The present invention enables a centralized system for locating, storing and providing microbial-related information, including pathogens, diseases, symptoms, genetic information, and related documentation.

[0011] The present invention maintains a database that relates pathogens to diseases that they cause, symptoms of those diseases, genetic information about the pathogens, and documentation that is associated with any of the above. By analyzing the stored data, the system recognizes duplicated information, for example a pathogen present in the database twice, each time under a different taxonomical name, and creates a link from one instance of the pathogen to the other. The system additionally identifies common genes and sequences among related organisms.

[0012] A user such as a clinician or researcher can use a system of the present invention to perform investigations. For example, a medical doctor can enter patient symptoms into the system to obtain a list of diseases consistent with the entered symptoms, and pathogens associated with each disease. The system can also be used, for example, to track the progression of an outbreak and identify similar occurrences by searching the World Wide Web on a nightly, automated basis and reporting results such as news articles, published research and the like to the doctor, who can then validate the new data making it part of the system's database.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] Fig. 1 illustrates a system for performing microbial investigations in accordance with an embodiment of the present invention.

[0014] Fig. 2 illustrates a pathogen report in accordance with an embodiment of the present invention.

[0015] Fig. 3 illustrates an example of a query screen presented to a user in accordance with an embodiment of the present invention.

[0016] Fig. 4 provides an illustration of a pathogen report in accordance with an embodiment of the present invention.

[0017] Fig. 5 illustrates an NCBI-source pathogen page in accordance with an embodiment of the present invention.

[0018] Fig. 6 illustrates an example of a disease report in accordance with an embodiment of the present invention.

[0019] Fig. 7 illustrates a disease/host report in accordance with an embodiment of the present invention. [0020] Fig. 8 illustrates a user of a list-based database query in accordance with an embodiment of the present invention.

[0021] Fig. 9 illustrates a method for configuring a database in accordance with an embodiment of the present invention.

[0022] Fig. 10 illustrates a method for executing a search in accordance with an embodiment of the present invention.

[0023] The figures depict preferred embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

System Architecture

[0024] Referring now to Fig. 1, there is shown a block diagram illustrating a system in accordance with one embodiment of the present invention. System 100 includes a user interface (UI) module 104 for communicating with a user 102 of the system; a query constructor module 106 for building search queries; a network access module 108 for executing the queries and searching a network such as the Internet for results of the query; a data analysis engine 110 for reviewing the results received from the search and compiling them for display to the user 102 via user interface module 104; and a Microbial Rosetta Stone (MRS) database 114 for storing the returned search results._MRS database 114 thus is an effective reference database of microbial organisms, diseases, genomics, threat management and forensics. Also illustrated in Fig. 1 are three example data sources 112a, 112b, and 112c, located across an Internet connection from system 100. As will be appreciated by those of skill in the art, more or fewer data sources 112 could be searched by system 100, and the inclusion of three data sources in Fig. 1 is purely for illustration. [0025] User interface module 104 enables communication between system 100 and a user 102 of the system. In one embodiment, UI module 104 includes a web server, and user 102 accesses the web server using a conventional web browser such as Internet Explorer by Microsoft Corporation of Redmond, Washington, or Firefox by The Mozilla Corporation of Mountain View, California. In an alternative embodiment, user 102 uses client software developed specifically to interact with UI module 104; and in another embodiment user 102 interacts directly with UI module 104 by being physically located at the same location as system 100.

[0026] In one embodiment, UI module 104 includes a Java-based expert interface called a pluggable object viewer (POV), created by and available as open source from Isis Pharmaceuticals of Carlsbad, California. This framework allows for hierarchical browsing of data from any source that can be wrapped into a Java object as defined by a plug-in level API. The framework keeps track of lists of these objects and allows for querying related objects by any combination of a related list and a set of filters upon the data object in question. The logic of the query is always available to the user and can be branched out from any point to create alternate search paths. Queries of user-defined complexity and depth can be annotated, saved as XML files and re-run or edited at a later time. The lists of results can be manually manipulated or combined with other lists using set logic (union, intersection, difference, etc.). Result lists can display any level of depth for a query from a single result to an entire history for the query that produced those results. Individual items in the list can display further detail when selected, and the display can be simple text output, more complex HTML or even graphical in nature. HTML is preferably used not only for display formatting, but for the ability to embed URLs to other public data sources, allowing UI module 104 to link from within system 100 to externally available information. Examples of a typical user interface in accordance with embodiments of the present invention are described below. System Operation

[0027] In a preferred embodiment, system 100 is initially configured by placing available information in MRS database 114, and associating pathogen, disease and gene sequence data wherever possible.

[0028] A list of known microbial threats is assembled from available sources. Depending on the source, a threat list might include pathogens, diseases, or pathogen/disease pairings. Once assembled, the pathogens associated with each list are standardized according to the correct internationally accepted taxonomic names. The threat list and associated pathogens are stored in MRS database 114. Next, a comprehensive collection of disease synonyms is loaded for each of the agents listed on the threat lists. In addition, a list of gene sequences and gene synonyms is also loaded into MRS database 114, including sequences that identify individual genotypes of the pathogens. MRS database 114 maintains an association between the disease synonyms, the genetic information, and pathogens. Note that in some cases, not all data is available for each pathogen— for example, genetic data might not yet be available. This does not present a problem for system 100, as the data is not a required element, and optionally can be added to MRS database 114 later, as the information becomes available.

[0029] Additionally, available publications that describe the threat list pathogens are preferably indexed and associated in MRS database 114 with the pathogens, diseases, and other elements of the database that they discuss.

[0030] In a preferred embodiment, initial data is loaded into system 100 in a variety of ways, including manual entry; bulk import using computational parsers and data validation tools; expert curation facilities; and from other database management systems (DBMS). Network access module 108 preferably includes automated data import scripts for loading taxonomic lineages and nucleic acid sequences from a source such as the National Center for Biotechnology Information (NCBI). In one embodiment, network access module 108 also imports data automatically from other microbial databases, including curated data from publicly available databanks such as the Swissprot protein sequence database maintained by the Swiss Institute of Bioinformatics, the Protein Families (Pfam) database of alignments maintained by the Wellcome Trust's Sanger Institute, Kyoto Encyclopedia of Genes and Genomes maintained by the Kyoto University Bioinformatics Center (KEGG), and the Gene Ontology (GO) project maintained by the Gene Ontology Consortium.

[0031] In one embodiment, network access module 108 also imports data automatically from government sites, including the Centers for Disease Control (CDC); the United States Department of Agriculture (USDA); PBR (Pox Virus Database); ProMED, and the Department of Health and Human Services (HHS).

[0032] In one embodiment, data is also loaded from the International Committee on the Taxonomy of Viruses (ICTV), including data for viral isolates. Preferably, internationally-accepted taxonomy standards for viral organisms are used.

[0033] Where available, system 100 associates each of the pieces of initial data with other related data such as bacterial taxonomy data from NCBI or ortholog data from various publications.

[0034] Because, as noted, naming ambiguities may exist across different data sources, some analysis is preferably undertaken by data analysis engine 110 to determine which imported entries are duplicates of other imported entries, or, similarly, which seemingly disparate pieces of data in fact relate to a common pathogen.

[0035] Data ambiguities within the MRS database 114 are preferably resolved by creating additional tables in the database schema. These additional linking tables allow "many-to-many"-style relationships to be defined that otherwise would not be available in a relational schema. For example, one naming ambiguity involves three GenBank nucleotide sequences (X52374, X52505 and X52506) that were linked to the Berne virus at NCBI and to the Equine torovirus at ICTV. Despite the similar sequence associations, searching NCBI with the name "Equine torovirus" yields no results. To alert users that this naming ambiguity exists while maintaining the taxonomic context for both designations, system 100 creates a new table to connect the differing NCBI and ICTV names that, by sequence associations, appear to describe the same organism. Data relationships stored in this linking table are then preferably brought to the user's attention via user interface module 104, as illustrated in Fig. 2. In Fig. 2, the NCBI pathogen name, "Berne virus" 202 is listed as the Pathogen Name , and the taxonomy source 206 field indicates that the source is the NCBI 208 database. Under the equivalent pathogen section 210 is an indication of the corresponding ICTV name, equine torovirus 212. A similar linking table architecture is preferably used to track synonyms for pathogens and disease names, and to provide data source referencing for each name. Links to documents and keywords are also included to allow referencing of threat list, disease, epidemiology, formulation, transmission, forensic, protocol, characteristic and sequence data.

[0036] Once the initial data is loaded and validated, system 100 then begins an automated curation of the data. First, system 100 ensures that pathogens are associated with diseases that they may cause. Next, if available, a link is established from each pathogen to its genomic sequence. System 100 insures a detailed and accurate taxonomic mapping of disease/threat organism by linking each included pathogen to its proper taxa, and each disease to the pathogen that causes it. Also, if available, system 100 creates a link from the pathogen to additional information on pathogen gene products and functions, including, for example, start and stop codons, protein coding regions, regions where PCR primers can bind, etc.

[0037] As part of the data validation process, system 100 attempts to identify organisms that are annotated as being different, but which are in fact the same. Preferably, system 100 includes scripts that make the identification of duplicate entries based on names and synonyms that define the threat agent in MRS database 114. For sequences, duplicate identification is preferably made by comparing and mapping features, e.g., exons, introns, untranslated regions (UTRs), primer binding sites, etc., from one sequence to another. System 100 then preferably checks to make sure that each of the pathogens identified as being duplicates of one another are in fact the same. Where they are not, they are unlinked from one another.

[0038] In addition to identifying duplicates and false duplicates, system 100 also identifies common genes and sequences among related organisms, by, for example, performing analyses of sequence similarities at the nucleotide or amino acid level. Such analyses are carried out using, e.g., available software utilities such as BLAST programs (basic local alignment search tools) and PowerBLAST programs known in the art (Altschul et al., J. Mol. BioL, 1990, 215, 403-410; Zhang and Madden, Genome Res., 1997, 7, 649-656), or other alignment utilities. For example, data analysis engine 110 preferably analyzes sequences and creates gene sequence alignments for hundreds or thousands of sequences at once. In comparing sequence features, data analysis engine 110 looks for 100% matches to a given DNA sequence feature, for example "Feature X = ACGTACGT". If the sequence feature is found in a full sequence but the feature name has not been mapped to the 100%-matched region, data analysis engine 110 designates that sequence as having "Feature X", and defines where in the full sequence it is found. Having identified duplicate, false duplicate and related organisms, in one embodiment system 100 then automatically notifies curators of inconsistent data or other errors it has discovered. Where possible, in a preferred embodiment, system 100 resolves the inconsistencies automatically; alternatively, curators are alerted to resolve the inconsistencies manually.

[0039] In one embodiment, MRS database 114 is also initially provided with documents such as studies, academic research papers, clinical trial data, news articles or other relevant materials relating to diseases and pathogens. For each document, some metadata is created either manually or automatically about the document. In one embodiment, for example, keywords are automatically derived from the document using conventional document analysis technology such as the ht://Dig product available from the ht://Dig group, and the document is then indexed according to its keywords. Those of skill in the art will appreciate that other methods exist for indexing documents, including by performing latent semantic analysis, manual indexing, etc., any of which may be appropriate for a given implementation of system 100. Once the documents are indexed, they are linked to appropriate pathogens and/or diseases and/or genetic data. For example, a document describing the identification of the Ames strain of Bacillus anthracis as the threat agent in the 2001 Anthrax attacks is stored in MRS database 114 and linked to disease, pathogen, two taxonomic lineages, other authors who have contributed papers about Bacillus anthracis, relevant gene sequences, and related keywords. Queries

[0040] A user that wishes to obtain some knowledge about a pathogen, set of symptoms, or other data tracked by system 100 makes a search request by interacting with UI module 104, as described further below. UI module 104 passes the search request received from the user to query constructor module 106, which then constructs a query that can be used to search both MRS database 114 and external data sources 112c, e.g., via a search of the Internet. Data analysis engine 110 correlates the response to the searches and provides them to user 102 via UI module 104.

User Interface

[0041] Fig. 3 illustrates an example of a query screen 300 presented to user 102 via UI module 104 in an embodiment of the present invention. Query screen 300 preferably allows a user 102 to type a global free-form query into a text box 302 and click the "find" button 304 to perform a search. Additionally, the user 102 can choose to filter the search results according to filter criteria 306. In the illustrated embodiment of Fig. 3, filter criteria 106 includes Contact, Disease, Document, Forensics, Pathogen, Sequence, Threat List, Genome, and Characteristic fields. Each field additionally includes Boolean statements by which to filter. For example, to filter by contact, a user 102 can choose to filter by the contact's last name or first name. Those of skill in the art will appreciate that a multitude of filtering options can be used in connection with inputting a query.

[0042] Fig. 4 provides an illustration of a pathogen report 400. Report 400 could be obtained, for example, when a user 102 types "Foot and mouth" into text box 302 in order to do a full database search; or by entering "foot and mouth" in a synonym search box from a pathogen search page (not shown). A list of results matching the criteria is then returned, and the user can click on a hyperlink to one of the results to view the report. Pathogen reports 400 preferably display salient data for microorganisms, including pathogen names, taxonomic rank with information source, and synonyms. Taxonomic lineages that support upward and downward tree traversals are also shown. The display of "Equivalent Pathogen(s)" 402 indicates that alternative taxonomic lineages exist for the organism. Diagnostic methods, laboratories capable of performing them and organizations which supply reagents, and links to relevant protocol references are listed in the "Capability Information" section 404 of pathogen report 400. Clicking on the "foot-and-mouth disease virus (NCBI) link 406, for example, links to Fig. 5, which is the NCBI-source pathogen page for foot and mouth disease.

[0043] Fig. 6 illustrates an example of a disease report 600 in accordance with an embodiment of the present invention. A disease report 600 preferably includes a description of host-pathogen associations, symptoms and treatments associated with the disease. In one embodiment, the report includes hyperlinks to other related data. For example, in Fig. 6, clicking on "glycine max" 602 would take a user to the pathogen page for the phakopsora pachyrhizi pathogen.

[0044] Fig. 7 illustrates a disease/host report 700 in accordance with an embodiment of the present invention. The disease/host report describes each unique host/pathogen correlation found as a result of the query search. In the illustrated embodiment, epidemiological information 702 is also displayed. [0045] Other reports can also be provided. For example, in one embodiment a forensics report can be produced by system 100, including a_history of the pathogen with links to seminal publications, and details about cases where the pathogen may have been used in a terrorist event.

[0046] Fig. 8 illustrates a way in which a list-based query can be made of the database using the Java-based POV interface. The user-configured Query Panel 802 displays: 1) data types that can be used in searches; 2) the number of results from a search ("7 Threat Organism(s)" or "83 Sequence(s)") with user-defined annotations ("Use these sequences"); 3) and data types available for additional searches based on a given result. The Result Panel 804 displays the data relationships between lists of results (here, the 7 Threat Organisms and the 83 Sequences). Details for any highlighted information in the Result Panel are shown in the Inspector Panel 806. Web links and other tools, e.g., a sequence "Feature Viewer" tool 808, are preferably embedded in the Inspector Panel 806.

[0047] For example, a medical doctor may recognize that five patients arrive in the clinic with similar symptoms of difficulty breathing, cough, and fever of 100.5 F. In order to understand more about these symptoms and how to diagnose the disease, the doctor logs into system 100 via UI module 104, clicks on the "Disease" link, and enters the symptoms into the "Symptoms" search field. A list of diseases is displayed, and by viewing the individual Disease reports, the etiologic agents, hosts, symptoms and epidemiologic properties are displayed. By clicking on the linked Pathogens in the Disease reports, the doctor can find microbiology lab protocols that may be used to uniquely identify the organism, diagnose the disease and possibly guide therapeutic intervention. Next, the doctor may wish to have the system track the progression of this outbreak and identify similar occurrences by searching the World Wide Web on a nightly, automated basis. To do so, the doctor enters the POV user interface, and constructs a query by entering the symptoms he has identified into a "Symptom" search and running the query. Using the results, he searches for all Diseases associated with those symptoms by running a second search, and then finds all pathogens which can cause those diseases by running a third search. Finally, the doctor may wish to find data on molecular diagnostic techniques, which he does by running a fourth search on "Capability" where he limits this search with the text "nucleic acid". While having similar content as the information in the web interface search, results of this POV query are preferably displayed in a table format, and the doctor can save the query logic as a template that can be used to automatically search the World Wide Web. This automated WWW search tool can be specified to run at designated intervals against the MRS database 114. The query results are then used to generate search terms that are used by a second tool to search the web in a conventional fashion, e.g., via the Google search engine. Data gathered from the web that matches the MRS results is e-mailed to the doctor as web pages, and the doctor can tell the tool to load into the MRS database 114 any or all of the new data that was found. The new information may be loaded as Documents into the system, keywords are indexed by existing tools as described above, and linked to the Document such that the next time the doctor runs a search for diseases with symptoms of difficulty breathing, cough, and fever of 100.5 F, the new data will be displayed.

Data Updates

[0048] An advantage of system 100 is that it automatically presents the most current and most relevant data to a user in response to a request for information. In a preferred embodiment, network access module 108 re-executes queries with a specified frequency, thus allowing the data in MRS database 114 to reflect the most currently-available information. For example, during the recent SARS coronavirus outbreak, research progressed rapidly on the pathogen itself, disease symptoms, and treatment protocols, with new information being published frequently. A user 102 of system 100 could create a search request, and specify that the request should remain active for a certain number of days, weeks, or even indefinitely. Thus the results the user receives after one day would be updated on a second or subsequent days, with newly-available information. And, because data analysis engine 110 resolves variation in nomenclature and identifies similarities among seemingly different information, system 100 can provide a user 102 with a centralized source of global information about the search topic— for example, disease outbreaks in China, the United States and northern Africa may seem completely unrelated, but may be grouped together by system 100 because of similarities previously unrecognized. A user 102 in the United States may not even be aware of the Chinese or African outbreaks, but would in fact be made aware of them by the results provided by system 100. In this manner, system 100 provides a centralized source of information that is readily accessible to clinicians and researchers throughout the world.

[0049] Another example of how system 100 can be used is as follows:

[0050] A group of individuals is suspected of having been exposed to an unknown pathogen. A health care professional evaluates the fever, cough and chest pain symptoms exhibited by the group of individuals and provides a search request via UI module 104, and including the observed symptoms. The results returned by data anlysis engine 110 include lists of pathogens which are known to give rise to the symptoms. Anthrax (Bacillus anthracis) appears as a member of the list. The health care professional then makes another search request, requesting information about specific tests that can be performed to distinguish among the pathogens in the list. Because the health care professional has a particular reason to suspect that the group has been exposed to anthrax, the health care professional then searches for known clinical tests that can be used to distinguish anthrax from other pathogens. The result of the query is that among other methods, base composition analysis and multi-locus VNTR analysis (MLVA) are efficient tests for identifying anthrax, both of which in this example are contained as data types within a data table in MRS database 114 designated "RS_Characteristic." The query may then provide links to a variety of data types that may include: contact information for scientists and other professionals with expertise in the testing methods, literature reports on the tests, and the like. For the case of base composition analysis, a link to laboratories performing triangulation identification for genetic evaluation of risk (TIGER) pathogen testing may be provided. Base composition analysis is described further in U.S. patent application Serial Nos. 10/323,233; 09/798,007; 09/891,793; 60/431,319; 60/443,443; 60/443,788; and 60/447,529, each of which is commonly owned and incorporated herein by reference in entirety. Additionally, commonly owned US patent applications 10/728,486; 10/660,122; 10/660,997; 10/660,996; and 10/660,998 are each also incorporated herein by reference in their entirety.

[0051] The health care professional may choose to perform a nucleic acid base composition analysis test and thus, he collects clinical samples from the group that are then sent to the TIGER pathogen testing laboratory. At the TIGER laboratory, technicians query the same database to obtain intelligent amplification primers with the aim of rapid identification of Bacillus anthracis. After amplification of the clinical samples, the amplification products are analyzed by mass spectrometry to determine their base compositions. System 100 is then queried for a match of the base composition of the amplification product, and returns information indicating that the experimentally-determined base compositions match the base compositions catalogued in the database for Bacillus anthracis, prompting the technicians to inform the health care professional to treat his patients for an anthrax infection. The health care professional can then ask system 100 again for the latest effective treatments for anthrax infection, which may include newly discovered antibiotics or other drugs.

[0052] At this point, an expert in microbial forensics may initiate an investigation wherein the strain of anthrax is determined by searching system 100 with nucleic acid sequence or base composition data. A result of such a query may provide links to laboratories known to harbor various strains of anthrax. A genetic engineering event may also be determined in a similar manner. [0053] Likewise, an expert in epidemiological investigations may perform a search for information about the spread of an anthrax outbreak. Information such as the rate of spread of particular strains of anthrax spores and the resistance to disinfection may be provided by system 100.

[0054] Referring now to Fig. 9, there is shown an example of a method for initially configuring system 100 in the manner described above. First, pathogen data is loaded 902 into system 100, either on its own or as part of a threat list. Next, disease data is loaded 904 into system 100, again, either separately or as part of a threat list linking diseases with pathogens. After the pathogen and disease data is loaded, genetic data is preferably loaded 906. System 100 then identifies 908 duplicate entries in a manner as described above. The input data is then linked 910 together, so that related pathogens, diseases and genetic information can be cross- referenced as appropriate. System 100 then extracts 912 a set of keywords from the stored data and uses those keywords as search terms to perform 914 periodic updates over the Internet. As described above, information such as published articles or research found related to those keywords is then indexed in MRS database 114 along with the appropriate diseases, pathogens or genetic information.

[0055] Referring now to Fig. 10, there is shown a flow chart illustrating a method for executing a search in accordance with an embodiment of the present invention. System 100 receives 1002 a search request and first checks to see 1004 whether the search is one that has previously been requested and is therefore cached. If the search has not been done previously, then a query is constructed 1008; otherwise, the query is retrieved 1006 from the cache. In either case, once a query is constructed or retrieved, it is then used to execute 1010 a search over the web to obtain the latest information available about the queried items— for example, if the search request is for treatments for anthrax, the query might include a search on the terms "anthrax", "Bacillus anthracis," "Ames," and "treatment". Once the search results have been obtained, they are analyzed and correlated 1012 with data already in MRS database 114, and the search results are then returned 1014. Preferably, the query is also cached so that it can be quickly retrieved again if needed.

[0056] The present invention has been described in particular detail with respect to a limited number of embodiments. Those of skill in the art will appreciate that the invention may additionally be practiced in other embodiments. First, the particular naming of the components, capitalization of terms, the attributes, data structures, or any other programming or structural aspect is not mandatory or significant, and the mechanisms that implement the invention or its features may have different names, formats, or protocols. Further, the system may be implemented via a combination of hardware and software, as described, or entirely in hardware elements. Also, the particular division of functionality between the various system components described herein is merely exemplary, and not mandatory; functions performed by a single system component may instead be performed by multiple components, and functions performed by multiple components may instead performed by a single component. For example, the particular functions of the data analysis engine 110 and so forth may be provided in many or one module.

[0057] Some portions of the above description present the feature of the present invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are the means used by those skilled in the bioagent identification arts to most effectively convey the substance of their work to others skilled in the art. These operations, while described functionally or logically, are understood to be implemented by computer programs. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules or code devices, without loss of generality.

[0058] It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the present discussion, it is appreciated that throughout the description, discussions utilizing terms such as "processing" or "computing" or "calculating" or "determining" or "displaying" or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system memories or registers or other such information storage, transmission or display devices.

[0059] Certain aspects of the present invention include process steps and instructions described herein in the form of an algorithm. It should be noted that the process steps and instructions of the present invention could be embodied in software, firmware or hardware, and when embodied in software, could be downloaded to reside on and be operated from different platforms used by real time network operating systems.

[0060] The present invention also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.

[0061] The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear from the description above. In addition, the present invention is not described with reference to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any references to specific languages are provided for disclosure of enablement and best mode of the present invention.

[0062] Finally, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention.

[0063] We claim:

Claims

1. A system for performing microbial investigations, the system comprising: a data store for storing microbial data; a query constructor module, for receiving search parameters and constructing a query from the parameters; a network access module, communicatively coupled to the query constructor module, for performing a search on the constructed query, the search domain including a wide area network, and receiving search results; a data analysis engine, communicatively coupled to the network access module and the data store, for creating an association between microbial data in the data store and the received search results; and a user interface module, communicatively coupled to the data analysis engine and the query constructor module, for receiving input from a user and providing responses to the user.

2. A method for performing microbial investigations, the method comprising: receiving a search request; constructing a query from the search request; executing the constructed query to retrieve a first result from at least one first data source; executing the constructed query to retrieve a second result from a second data source; associating the first result with the second result; and providing the associated first result and second result as a response to the search request.

3. The method of claim 2 wherein the search request is for diseases associated with a specified pathogen.

4. The method of claim 2 wherein the search request is for pathogens associated with a specified diseas

5. The method of claim 2 wherein the search request is for symptoms associated with a specified pathogen.

6. The method of claim 2 wherein the search request is for pathogens associated with a specified symptom.

7. The method of claim 2 wherein the search request is for diseases associated with a specified symptom.

8. The method of claim 2 wherein the search request is for genetic data associated with a pathogen.

9. A computer program product for performing microbial investigations, the computer program product stored on a computer-readable medium and including instructions for causing a processor to carry out the steps of: receiving a search request; constructing a query from the search request; executing the constructed query to retrieve a first result from at least one first data source; executing the constructed query to retrieve a second result from a second data source; associating the first result with the second result; and providing the associated first result and second result as a response to the search request.

10. The computer program product of claim 9 wherein the search request is for diseases associated with a specified pathogen.

11. The computer program product of claim 9 wherein the search request is for pathogens associated with a specified disease.

12. The computer program product of claim 9 wherein the search request is for symptoms associated with a specified pathogen.

13. The computer program product of claim 9 wherein the search request is for pathogens associated with a specified symptom.

14. The computer program product of claim 9 wherein the search request is for diseases associated with a specified symptom.

15. The computer program product of claim 9 wherein the search request is for genetic data associated with a pathogen.