WO2000058863A1 - Techniques for performing a data query in a computer system - Google Patents

Techniques for performing a data query in a computer system Download PDF

Info

Publication number
WO2000058863A1
WO2000058863A1 PCT/US2000/008450 US0008450W WO0058863A1 WO 2000058863 A1 WO2000058863 A1 WO 2000058863A1 US 0008450 W US0008450 W US 0008450W WO 0058863 A1 WO0058863 A1 WO 0058863A1
Authority
WO
WIPO (PCT)
Prior art keywords
category
super
data
term
terms
Prior art date
Application number
PCT/US2000/008450
Other languages
French (fr)
Inventor
Jay Ponte
Original Assignee
Verizon Laboratories Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/283,268 external-priority patent/US6826559B1/en
Priority claimed from US09/282,730 external-priority patent/US7047242B1/en
Application filed by Verizon Laboratories Inc. filed Critical Verizon Laboratories Inc.
Priority to AU43280/00A priority Critical patent/AU4328000A/en
Publication of WO2000058863A1 publication Critical patent/WO2000058863A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/912Applications of a database
    • Y10S707/944Business related
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99932Access augmentation or optimizing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99934Query formulation, input preparation, or translation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99936Pattern matching access
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99944Object-oriented database structure
    • Y10S707/99945Object-oriented database structure processing

Definitions

  • This application relates to the field of telecommunications and more particularly to the field of electronic commerce.
  • markup language pages displayed to a user 800 using a browsei 824 typically include a mix of content and advertisements
  • a user may see the content of a seaich engine, such as a search template, along with advertisements from one or more companies.
  • the advertisements typically referred to as "banner ads,” may include links to other site locations, such as the home page of the advertising company
  • banner ads are displayed on pages that include content related to the banner ad.
  • a web page for an automobile dealer might include an advertisement and a link to a site offe ⁇ ng financing for automobiles
  • some web content is not clearly associated with a particular demographic group or user interest.
  • a search engine is likely to be used by a wide range of users who may be interested m a wide range of goods and services. Accordingly, a need exists for methods and systems that target banner ads to such users.
  • Targeted banner ads is used herein to refer to such methods and systems.
  • Targeted banner ad methods and systems present a number of programming challenges. Among other things such methods and systems may need to provide relevancy ranking of catego ⁇ es of information related to a user's query. Accordingly, a need exists to provide improved methods and systems for performing such relevancy ranking.
  • Piovided heiein aie methods and systems foi taigeting advertisements
  • At least one category is associated with at least one super-categoiy
  • An advertisement is associated with at least one of the super-categones
  • At least one term is determined that is associated with a data query
  • a first of the at least one super-category is determined in accordance with at least one term of the data query and the at least one categoiy
  • An advertisement associated with the first super-category is determined
  • Super-category terms may be linked to advertisements, so that an advertisement assigned to a super-category is displayed to a user if that super-category is identified as a relevant super-category based on a user's query.
  • These techniques include establishing a super-category term list for each term appea ⁇ ng in one of a super-category or a category of document to be searched, each element of the super-category term list including terms in the super-category and terms in categones associated with that super-categon
  • the terms in a data query are obtained Terms in catego ⁇ es are obtained in response to the data queiy.
  • a modified query is formed consisting of the terms in the data queiy and the terms in the categories Terms of the modified query are weighted
  • the supei -categoiy term list is ranked by applying the modified query to the super-category term lists to determine the most relevant super- category to the data query Brief Description of Drawings
  • Figuie 1 is an example of an embodiment of a system that includes an on-line query tool
  • Figure 2 is an example of a block diagram of a hardware view of an embodiment of an on-line query tool
  • Figure 3 is an example of an embodiment of a user interface displayed with an online query tool
  • Figure 4 is an example of a block diagram of a software view of an online query tool of Figure 2;
  • Figure 5 is an example of an embodiment of a table illustrating data storage for denormahzed objects in the databases.
  • Figure 6 is an example of an embodiment of a table representing data stored the gene ⁇ c object dictionary
  • Figure 7 is an example of an embodiment of a portion 440 of a PHTML execution tree
  • Figure 8 is an example of an embodiment showing more detail of the parse d ⁇ ver
  • Figures 9 and 10 are an example of a user interface displayed in response to a user request with an online query tool;
  • Figure 11 is an example of an embodiment of a user interface displayed with user query information;
  • Figure 12 is an example of the query results displayed in response to performing a user query of Figure 11 ;
  • Figure 13 is an example of a user interface which includes user-specified query information
  • Figuie 14 is an example of a resulting display page in lesponse to the queiy performed with intoimation specified in Figuie 13,
  • Figuie 15 is a moie detailed display in lesponse to choosing a particulai category of Figure 14,
  • Figures 16 and 17 aie an example of a usei mteiface displayed in lesponse to selecting an option from the menu of Figuie 3 to add oi change a listing
  • Figuies 18 is an example of a d ⁇ spla ⁇ screen in lesponse to updating the business listing specified in Figures 16 and 17,
  • Figures 19 and 20 are an example of a user mteiface scieen display results in response to a user lequest with regard to Figuie 18,
  • Figure 21 is an example of a screen display to a user with moie information with regard to the business listing selected from screen 20,
  • Figure 22 is the business information displayed with regaid to the business in Figure 21,
  • Figure 23 is an example of an embodiment of the processes included in the request router of Figure 22,
  • Figure 24 is an example of a block diagram of an embodiment of the BackOffice component
  • Figure 25 is an example of the flow piocess representing the piocessing of normalized data to the various data forms included in the Front End Server
  • Figure 26 is an example of normalized data as may be included m an embodiment of the invention.
  • Figure 27 is an example of denormahzed data form as may be included in an embodiment of the invention.
  • Figure 28 is a flowchart of an example of an embodiment of a method for performing request processing in the system of Figure 2 and 4,
  • Figure 29 is a flowchart of an example of an embodiment of the method steps for performing parser processing the system of Figure 2 and 4
  • Figuie 30 is a flowchart of an example of a method with steps foi performing query engine processing in the system of Figure 2 and 4:
  • Figure 31 is an example of a dependency giaph as may be included in one embodiment of the invention for performing incremental update
  • Figure 32 is an example of a flowchart of the method steps for performing different update techniques in accordance with the number of tiansactions
  • Figure 33 is a flowchart of an example of method steps of one embodiment for performing data queiy cache lookup as used performing a data query;
  • Figure 34 represents an example of applying the minimum cost derivation sequence as applied in the step of Figure 33.
  • Figure 35 is a flowchart of an embodiment of method with steps for forming a name and determining if the corresponding data set is located in the query cache;
  • Figure 36 is an example of an entity as stored in the data query cache
  • Figure 37 is a flowchart of an embodiment of a method including steps for performing an additional total-city cache lookup
  • Figures 37 and 38 are flowcharts for a method in one embodiment for performing total-city and multi-city cache searches
  • Figure 39 is an example of more details that may be included in a embodiment of the query engine
  • Figure 40 is an example of an embodiment of method steps by which the information ret ⁇ eval software may obtain results
  • Figure 41 is a flow chart showing an example of an embodiment of method steps for obtaining results
  • Figure 42 is a flow chart showing an example of method steps for classifying results for que ⁇ es using common terms
  • Figure 43 depicts an example of a user interface for an on-line query tool, including a screen for initiating a user query;
  • Figure 44 depicts an example of a user interface for an on-line query tool, including catego ⁇ es that may be ret ⁇ eved in response to initiation of a user query;
  • Figuie 45 is a block diagram of an embodiment of the database as may be included in the BackOffice component,
  • Figure 59 is an example of an embodiment of data tables included on a sending node for a multi -media data transfer
  • Figure 60 is an example of an embodiment of the tables as appea ⁇ ng on the sending side and the receiving side in the multi-media data transfer,
  • Figure 61 is an example of a representation of a tree structure representing the relationships between entitites used in the multi-media transfer
  • Figure 62 is a snapshot of the tables that may be included in a preferred embodiment in sending data in a multi-media data tiansfer;
  • Figure 63 is a snapshot of an example of an embodiment of the tables on the sending and receiving side at another point when performing a multi-media data transfer;
  • Figure 64 is an example of an embodiment of tables and external processes on the sending and receiving side using the multi-media data transfer
  • Figure 65 is an example of an embodiment of the tables resulting from the text data integration
  • Figure 66 is an example of a block diagram of an embodiment of the data table whose contents have been transferred to the receiving side;
  • Figure 67 is a flowchart of a method of the steps of one embodiment for assembling blob data into a repository table when performing a multi-media data transfer
  • Figure 68 is a flow chart setting forth method steps for establishing super-category term lists and for matching advertisements to super-categones, to assist in targeting an advertisement to a user of an on-line query tool;
  • Figure 69 is a flow chart setting forth method steps for mapping catego ⁇ es to super-categones
  • Figure 70 is a flow chart setting forth method steps foi executing a modified query in an on-line query tool designed to assist in targeting an advertisement to a user of an online query tool, and
  • Figure 71 is a diagram showing an example of a linked super-category term list. Best Mode for Carrying Out the Invention
  • FIG. 1 shown is an embodiment of an on-line query tool 1910.
  • one or more users 1900-1904 may connect to the on-line query tool 1910 via a network 1906. Users may interact with the query tool using conventional hardware and software, such as, in an embodiment, a web browser through the Internet.
  • FIG 2 shown is an embodiment of a hardware view of an online query tool.
  • this on-line query tool may be the GTE Superpages SM query tool.
  • Figure 2 shows a hardware view of the components that may be included in one embodiment of the query tool in typical operation as being accessed by a user through a network.
  • the user 800 enters a query request which is sent via a network 802, such as the Internet, to the GTE Superpages Front End Server 804.
  • the GTE Superpages Front is sent via a network 802, such as the Internet, to the GTE Superpages Front End Server 804.
  • End Server 804 includes a hardware router 806 for receiving incoming query requests.
  • the hardware router routes the request, using a simple hardware-based technique, to one of the server nodes 808-810 which may be designated to service the request by performing the requested query.
  • the servers 808 through 810. server 1 through server n, respectively, interact with the Pnmary Database 812 and Secondary Database 814 to perform a data query.
  • the Primary Database 812 interacts with the BackOffice component 818 at times, as will be desc ⁇ bed in paragraphs elsewhere herein, to obtain data used in performing the queries.
  • the BackOffice component 818 performs data filte ⁇ ng and other processing, for example, to combine information that may be obtained from vanous data sets producing a resultant data set.
  • the resultant data set is subsequently transferred to the Pnmary
  • the process of data integration and updating the data may be performed at a time other than peak demand time.
  • These processes and data transfer techniques are generally performed "off-line" and not in response to user query requests. Rather, these techniques may be performed as part of a data maintenance and update process performed in accordance with the load and the number and type of update transactions.
  • Figure 2 depicts a Superpages Front End Sei vei 804 which includes a varying number of server nodes 808-810 to respond to the vanous query requests as made by a user 800.
  • the techniques and concepts which are described in paragraphs that follow may be used in a variety of different systems which include one or more server systems Additionally, a single database or other datastore may be used The techniques described herein may generally be applied to a large distributed system Additionally, these same concepts and techniques may be applied a single user system performing data queries and searches upon a local database.
  • Figure 3 shown is an example of a user interface screen as included in one embodiment of the system of Figure 2 Generally, Figure 3 is the initial screen 1800 that may be displayed to a user entenng a URL corresponding to the GTE Superpages Internet site.
  • Figure 3 includes fields for query information 1802-1808, hyperlinks to other tools 1810, such as on-line shopping or placing advertisements, and other links 1812, for performing other tasks such as modifying an existing business listing.
  • the GTE Superpages Internet site is related to on-line yellow pages, similar to those included in a paper phone book.
  • vanous business services and user services may be provided
  • a user may query the on-line yellow page information for various businesses in the United States based on particular search c ⁇ tena.
  • On-line shopping info ⁇ nation regarding products and business services may be provided to a user performing a data query.
  • Advertisers, such as the business providers of the vanous products and services may also purchase advertisements similar to those that may be purchased in the paper copy of a phone book that includes yellow page listings of businesses.
  • the interface 1800 may include links to vanous services and functions.
  • one service provided permits businesses to advertise in the on-line yellow pages.
  • Functions associated with this service may include, for example, purchasing advertisements and adding or changing a business listing that an advertiser or business includes in the yellow pages.
  • some of these functions are included in the interface portion 1812, with links to other tools in the screen portion 1810.
  • a user may connect with any of these tools or functions to perform tasks related to the yellow pages advertising by selecting an option from the user interface 1800, such as by left-clicking with a mouse.
  • FIG. 4 shown is an embodiment of the various software components for an on-line query system.
  • One embodiment may be the on-line query tool of the GTE Superpages system.
  • Figure 4 depicts a software view of the typical operation of the system as being accessed by a user 800 through a network 802 using the hardware as described in conjunction with Figure 2.
  • the user may enter a request, as through a browser. This request is communicated through the GTE Superpages Front End Server 804 over the network 802 As shown in Figure 4, the Front End Server
  • server node 808 that includes a web server engine 852.
  • the web server engine 852 is a NetscapeTM engine which serves as a central coordinating task for accessing files and displaying information to the user on the browser 824.
  • the server node 808 also includes a request router 854, a monitor process 856 and a parser 866.
  • the parser 866 generally includes a parse dnver 858, a genenc object dictionary 860, a query engine 862, and a data manager 864.
  • the parse driver 858 operates upon data from a constructed ad repository 842 and the PHTML files 844.
  • the parse driver 858 stores and retrieves data from the PHTML execution tree 846 and the page cache 848.
  • the data manager 864 included m the parser 866 is responsible for interacting with the database, which in the Figure 4 is the Pnmary Database 812. It should also be noted that the data manager 864 may also obtain data from a Secondary Database as previously shown in Figure 4. If there are multiple databases other than a Pnmary and Secondary Database, the data manager may also interact with these to obtain the necessary data upon which data que ⁇ es are performed.
  • the query engine 862 operates upon data from, and writes data to, the data query cache 850.
  • the query engine uses data from the term lists 836 to obtain identifieis and possibly othei let ⁇ evable data in accordance with various key terms upon which a data queiy is being pei formed
  • the request routei 854 generally interacts with the paiser and leads data fiom the configuration file 830 and load file 834
  • the monitor process 856 also leads and writes data to and from respectively the load file 834
  • the web se ei engine 852 in this embodiment the
  • Netscape engine 852 obtains data from the HTML repository 838 and the image repository 840 in accoi dance with various tequests fiom the biowser for different types of files.
  • the monitor process 856 is generally responsible for indicating the availability of server nodes 808-810 in performing data queries.
  • the monitoi is also generally responsible foi leceiving incoming messages from other server nodes as to then availability foi servicing lequests
  • the load file 834 upon which the monitor piocess 856 reads and writes data, is a dynamic file in that its contents are updated in lesponse to incoming messages indicating machine availability and the current load of the corresponding machine.
  • the load file also includes static information components, such as the maximum load of each system. Generally, the actual executing load (current load) of a system is less than or equal to the maximum load (max load) as indicated in accordance with the load file.
  • Each server has its own unique copy of the load file which is updated in accordance with messages which it receives from the other nodes. Below is an example of an entry that may be included m the load file representing the information descnbed above SERVER, MAX LOAD, CURRENT LOAD
  • the configuration file 830 may be a static file physically located on one of the server nodes 808-810 with a copy replicated on each other server node Generally, this file is created pnor to use of the system. It may specify which servers may service requests based on weighted parameters of a particular search domain associated with a particular server. Below is an example of an entry in a configuration file:
  • the domain weight may be a normalized value representing costs (e.g., time) associated with processing a request for this associated search domain or partition. This domain weight is based on the median time to seivice a lequest in that domain based on the analysis of past data logs, foi example, as normalized by the number of listings in the domain Similarly, servei weights may lepiesent the cost associated with piocessmg a request on a particular servei
  • the domain/partition indicates a portion of the search domain upon which a usei query may be pei formed that is associated with a particulai server
  • an incoming lequest may be processed by one of a plurality of parsers 858 on each of the sen ei nodes
  • the parser 858 generally transforms the user input query into a form used by other components, such as the request routei.
  • the request router generally receives an incoming request as forwarded by the hardware router 806 of Figure 2.
  • the request router subsequently uses the load file and the configuration file to decide which server node 808-810 a request is routed to based on the load and the availability of the server node, and the designated seiver for each partition or domain
  • the query is performed producing data query information that may be cached, for example, in the memory of a data query cache 850.
  • One use of the data query cache 850. as will be descnbed in paragraphs that follow, is its use in improving the performance in response to a user request in a subsequent query that may use a subset or superset of the data stored in the data query cache 850.
  • a superset or composition query is one which is a boolean composite of several querying terms.
  • a composition query may be determined by the parser 866, and the request router 854 may decide to which server node 808-810 the composition query or other query is sent for processing in accordance with domain weights as indicated m the configuration file
  • Reallocation of requests when a server is unavailable may be performed generally with a bias toward the initial allocation scheme as indicated also by the configuration file.
  • the PHTML execution tiee 846 includes an expanded version of a PHTML file lequested fiom the PHTML file 844 as the lesult, foi example, of a usei queiy PHTML generalh is a modified version of the HTML language, which is a markup language according to the Standaidized General Markup Language (SGML) standard capable of interpietation b ⁇ biowseis, such as a Netscape browsei PHTML generall) is a scripted version of HTML with conditional statements that piovide for alternate inclusion of blocks of HTML code in a lesulting HTML page transmitted to a browser in accordance with certain run time query conditions
  • PHTML file may be descnbed as a parse tiee lepiesenting paised and expanded PHTML files. For example, if a PHTML file conditionally includes accesses to other PHTML files or vanous portions of HTML commands, the parse tree structure reflects this in its representation of the parse tree which is cached in the PHTML execution tree 846 Upon a subsequent request for the same PHTML file, the cached, expanded version is retrieved from the PHTML execution tree 846 to inciease system efficiency, thereby decreasing user response time for the subsequent query
  • a request is received by the ebserver engine 852 which interacts w ith the parser 866
  • a PHTML file is obtained and executed from the PHTML file store 844
  • the expanded version of the PHTML file is cached in the PHTML execution tree 846.
  • an HTML page is generally constructed and cached in the page cache 848.
  • constructed HTML pages are stored in the page cache 848 if the amount of time taken to produce the resulting HTML page is greater than a predetermined threshold. Implementations of the page cache may implement different replacement schemes. In one preferred embodiment, the page cache implements an LRU replacement scheme.
  • the threshold the amount of time used to determine which pages are stored in the page cache, may vary with system and response time requirements.
  • a paiticulai seaich oidei of the pieviously described caches and file systems may be performed Initially, it is detei mined w hethei the HTML page to be displayed to the usei is located in the page cache 848 If not seaich lesults are obtained from the query cache and the lesulting HTML page is constiucted and itself may be placed in the page cache 848 If a PHTML file is lequiied to be executed in consti ucting the lesulting HTML file, the PHTML execution tiee 846 may be accessed to determine if theie is a paised version of the requned PHTML file ahead) expanded in the PHTML execution tiee If no such file is located in the PHTML execution tiee 846, the PHTML file 844 is accessed to obtain the required
  • the constructed ad repository generally includes constructed advertisement pages which may include, foi example, text and non-text data, such as audio and graphic images to be displayed in response to a user query which represent, for example, a yellow pages ad
  • the webservei engine 852 accesses information from the image repository 840 and HTML repository 838
  • the image repository 840 includes vanous graphic images and other non-text data which may also be directly accessed by the webserver engine 852 in response to a user lequest, as by a user request for a specific URL
  • the HTML repository 838 includes vanous HTML files which may be provided to the user, for example, in response to a user request with a specific URL which indicates a file
  • each of the server nodes 808-810 are one or more parsers 866 which perform, for example, parsing of the text of a user data query request Figure 4 includes some of
  • the data manager 864 generally interacts with a database to actually retrieve the data to be included in the resultant data query as displayed to the user.
  • the parse driver 858 generally uses a data schema description to interpret various data fields of the generic data objects.
  • abstraction of the data interpretation into the data schema description enables different components of the parser 866 to operate upon and use generic data objects without requiring these components require code changes or recompilation in cases of the introduction of new data presentation types.
  • This technique insulates code as included in the parser 866 from the introduction of new presentation types which may be represented as generic data objects.
  • One common use of the GTE Superpages Internet site is to perform a data query.
  • data field 1802 is a category query field by which queries may be performed in accordance with specified search categories that may be associated with business listings included in the yellow pages database. Additionally, field 1802 also includes predetermined top categories, as may be determined by examining log files in accordance with user query selections and search criteria. In this embodiment, selection of the "top categories" of the field 1802, as by left-clicking with a mouse button, causes the interface 1820 of Figure 9 to be displayed in a user's browser.
  • FIG. 9 shown is one embodiment of a user interface for displaying a first page of the top query categories 1820.
  • these categories are associated with the various business listings and are tags by which a user may perform queries.
  • the user may select the "top categories" from the initial interface as included in the field 1802.
  • this user interface scieen may be displayed by selecting "detailed seaich " from the field 1808 from the initial usei interface 1800 Foi example, the user interface 1830 may be displayed if the user wants to perform a data query for specified categories and certain distance criteria As shown in the example of user interface 1830, a data quen may be performed for lestauiants within five
  • FIG. 13 shown is an example of one embodiment of a user interface display 1850 for performing a user query in accordance with user-specified search criteria.
  • User interface 1850 of Figure 13 is the interface 1800 of Figure 3, but with user- specified data query information included in various data fields.
  • a data query is performed for "shoes" as the category 1802 for "Boston, MA” in field 1804.
  • the query is performed by selecting the "Find It” button of field 1806.
  • the resulting screen displayed in response to selection of the "Find It” button is included in Figure 14.
  • the screen results 1860 may include displayed summa ⁇ zed business listing information in accordance w ith the search cnte ⁇ a previously specified m Figure 14.
  • Vanous business listings may be grouped together in categories. In this example, relating to "shoes", are 154 business listings included in thirteen (13) categories.
  • FIG. 15 shown are the business listings relating to the user- specified search c ⁇ te ⁇ a selection relating to "custom made shoes". From this screen 1870, the user may further select one of the businesses for more information pertaining to the business, such as directions and business-provided advertisements.
  • FIG. 16 and 17 shown is one embodiment of a user interface that may be displayed when a business or advertiser updates a business listing
  • This screen may be displayed, for example, by selection of the "add or change your listing" option 1812 of Figuie 3 of the initial usei interface
  • ⁇ usei interface 1880 provides data fields which allow a usei to entei in infoimation such as a telephone numbei corresponding to a business listing Conesponding business listing infoimation is then updated
  • a phone numbei 617-832 5000 is enteied into field 1882 to letneve business listing information conesponding to this phone numbei
  • the lesulting scieen of Figuie 18 is subsequently displayed to the user in this embodiment
  • the phone numbei co ⁇ esponds to a business as displayed in Figuie 18 If this is the conect business, a usei may select a displayed business for example, by clicking on the ' matching business" information
  • a section of the displayed interface 1883 indicates options for creating a website linked to a particular business listing Note also that in some embodiments, it is possible to enhance a business listing and/or link a listing to a pre- existing website or to one that is created
  • the request router 854 may be executed within a Netscape server process space and may be invoked when a user, via a browser, makes a request which results in a PHTML file being executed
  • the PHTML files as generally included in the PHTML file store 844, are in the form of a scnpt activated when a server node 808-810 is forwarded a user request
  • the request router 854 is generall) responsible foi routing a request to the propei server node in accoi dance with data stored in the configuiation and load files The request is also forwarded to one of the plurality of paisers for processing once the proper servei node has been located
  • the request router 854 is generall) responsible foi routing a request to the propei server node in accoi dance with data stored in the configuiation and load files
  • the request is also forwarded to one of the plurality of paisers for processing once the proper servei node has been located
  • the request routei 854 generally includes a housekeeping thread 880. a router thread 882, and one or more worker threads 884. Generally, the housekeeping thread 880 is responsible for maintaining a parser status table 886 and a parser queue 888.
  • the router thread 882 generally responds to the monitor process changes as recorded in the various data files with regard to servei node availability
  • the router thread 882 reads data from the configuration and load files, and maintains an ln-memory copy foi use by the vanous threads of the request routei 854
  • the router thread 882 updates the m- memory copy of the configuration and load files in accordance with predetermined node fail-over and reallocation-of-request policies For example, if in reading the configuration and load files, the router thread 882 determines that a first server node is at maximum utilization, the router thread updates its in- memory, server-node, local version of the files. The router thread determines not to forward requests to the first server.
  • the router thread When the first server node's actual utilization decreases and is now available for processing additional requests, the router thread accordingly updates its m-memory copy.
  • ach of the worker threads 884 is initially forwarded a request which ar ⁇ ves at a server node.
  • the worker thread 884 makes the decision whether the request should be routed to another node.
  • the worker thread 884 makes this decision generally in accordance with the contents of the configuration and load files as previously described. If a request is determined to be routed to another server, the worker thread forwards the request to another worker thread on another server node. If the worker thread does not forward the request to another server, the worker thread determines which parser to send the request to for further processing.
  • the list of available parsers is stored m the parser queue 888, which in this particular embodiment is implemented as an AT&T System 5TM with a system message queue.
  • the parser queue is generally maintained by the housekeeping thread 880.
  • the Netscape 1 or other HTTP seiver provides as a service the dispatching of lequests to the various woiker threads
  • Other implementations may provide this function using other techniques such as callback mechanisms which dispatch the user requests to one of the plurality of a ⁇ ailable worker threads 884.
  • the parser status table 886 includes information about use. availability and location of each of the plurality of parsers on each server node.
  • the parser status information may be used in determining where to route requests for example, as performed by the worker thread 884
  • the parser status information as included in the parser status table 886 may be used to route requests based on an adaptive technique similai to the adaptive caching technique which will be described in paragraphs that follow This may be particularly useful in systems with multiple processors, for example, those in which certain CPUs are dedicated processors associated with predetermined parsers
  • the parsing results may be stored in the PHTML execution tree accessed by the particular processor. Subsequent requests which are also processed by the same parser may access the cache parsing results stored in the PHTML execution tree.
  • the request processing model includes a plurality of parsers and a plurality of worker threads.
  • an incoming request is associated with a particular worker thread which then forwards the request to a parser for processing.
  • the worker thread is disassociated with the request, and is then available for use in the pool of worker threads.
  • the number of parsers and worker threads may be tuned in accordance with the number of user requests.
  • One point to note using this model is that the worker thread and the parser are disassociated and thought of as distinct processing units rather than as a unit in which a worker thread is associated with a particular parser for processing an entire life of a request.
  • the BackOffice component includes a database 892 which provides data, for example, to the Front End Server 804 through connection 822.
  • the database 892 as stored m the BackOffice component, may be updated, as through a webserver via a connection to a usei Such a connection as 896 may be used, for example, when a modification is made to an entry to conect typographical erroi
  • a user may connect, such as via a browser, using connection 896. to the websei ver 894 included in the BackOffice component
  • the database 892 is then accessed and updated in accordance with requests or updates made by the user.
  • a user may update entries included in database 892 using techniques other than by a connection 896 via a webserver to the database 892.
  • different types of updates to database 892 may be performed m different embodiments of the invention.
  • the database 892 may be updated on a per-entry basis by a variety of users connecting via multiple webserver connections.
  • periodic updates, for example, for particular data set may be provided from a particular vendor, and accordingly integrated into database 892 through a database integration technique rather than having a user manually enter these updates such as via a connection to the webserver 894.
  • the connection to the Front End Server 822 may be used, for example, to load a new copy of the database 892 into the Front End Server Primary and Secondary Databases 812, 814 as shown m Figure 2.
  • the way in which these updates may be sent across the connection 822 to the Front End Server may be as previously described in terms of database operational commands which perform updates from the computer system which include database 892.
  • the database 892 included in the BackOffice component and both the Pnmary and Secondary Databases, as included in Figure 24, are OracleTM databases.
  • Oracle provides remote database update and access commands which allow for remote database access and updating, such as update requests from the database server node 892 to update the Pnmary Database 812 as stored in the
  • Data is stored in the BackOffice component in this particular embodiment in a normalized dataform, as will be further described in paragraphs that follow.
  • These normalized data changes are transfered to the Front End Server 804 from the BackOffice component in one of several forms. For example, the entire database may be transferred to the Front End Server 804. Additionally, changes or updates to particular entries may also be transmitted to the Front End Server 804 from the BackOffice component rather than updating or overwriting the entire copy of the database as stored in the Front End Server 804.
  • Each of these types of database updates from the BackOffice component to the Front End Server 804 may be done in accordance with the number of transactions or updates to be performed. This is further described in other sections of this descnption.
  • a markup language file generally includes tags which represent commands or text identifiers for processing the contents of the file.
  • Structured Generalized Markup Language SGML
  • SGML Structured Generalized Markup Language
  • the process depicted in Figure 25 is performed once data has been received in the Primary Database 812, and is first stored in the Pnmary Database 812 m normalized data form, as in the normalized data store 900.
  • Extraction routines 902 examine the normalized data store 900 and rearrange the information to place it in the denormalized data form, also included in the Primary Database 812 of this embodiment.
  • the exti action loutmes 902 produce markup language files 906 which aie p ⁇ maiily used by the infoimation retrieval software to produce identifiers and conesponding woids oi teims upon which a query may be performed.
  • These lists of key words or terms w hich may be seaichable or retrievable and the corresponding record identifiers as included in the denormalized data store 904 may be stored in a list structure as included in the term list data store 836
  • the markup language files include one file oi document per business for which there is an advertisement, for example, m this particular embodiment.
  • Each of the markup language files 906 includes markup language statements, such as SGML-like statements, with tags identifying key data items in the document for each business.
  • the information retriev al software is Ve ⁇ ty software which uses as input markup language files 906. Additionally. Verity uses its own schema file by which a user indicates what key words or terms as indicated in the markup language files are searchable and which of the data fields contain retnevable information.
  • Searchable as used herein means fields or key words and terms upon which searches may be performed, like index searching keys.
  • Retnevable as used herein generally means fields or categories with associated data that may be ret ⁇ eved. All searchable fields have a tag, such as a business name or city. Identifiers are generally produced by the information retneval software 908. VerityTM, in this particular embodiment, produces term lists 836 in which there exists a list for each particular key word, term or category followed by a chain of identifiers that indicate the record number in the denormalized data store 904. Additionally, associated with each element in the term list which indicates a record in the denormalized data, retnevable data associated with that record may also be included.
  • the term list and the term list data store 836 contain a list conesponding to the key word "zip code”. There is a term list for each particular value of a zip code.
  • Attached to each key word "zip code” and the particular value may be a list oi a chain of identifieis Associated with each identifier on the chain may be associated data, such as the city and state, which may be retrieved when a particular zip code is searched.
  • the data included in the term lists may be data that is also needed in performing search optimizations, weighted searches, or different types of searches, such as proximity searches.
  • This data may furthei be stored in the various data files and caches of the Front End Server as needed in accordance with each implementation, for example in accordance with the types of searches and data upon which queries may be performed or otherwise operated upon by the Front End Server.
  • the Pnmaiy Database 812 includes both normalized and denormalized data form
  • the Secondary Database 814 includes only denormalized data form.
  • Normalized data is that representation of the data in which each data relation is represented independent of other relations
  • denormalized data is the antithesis of a normalized data in which one data relation represents all relations.
  • Different databases may be of different degrees of normalized and denormalized data.
  • the BackOffice component 818 generally stores the data in normalized data form of a certain degree.
  • the databases used in this server store the data in a form of a normalized form also of a certain degree and additionally in a denormalized form for search performance optimizations on performing data quenes.
  • the data is stored m third degree normal form.
  • sets of data may be stored together within a single field, such as multiple mailing addresses. Other embodiments may have one field per address. This may prove to be advantageous, for example, for high performance and better flexibility in systems subject to multiple and diverse data sources, and a high rate of modifications.
  • each particular business entry may have a unique identifier, (ID). Additionally, three pieces of information may be stored for each particular business
  • the normalized data form may look as in Figure 26 In this particular example, there may be a separate table foi each ID conesponding to a business and its business address 910 Additionally, there be two other data tables of information also indexed by each particular business ID, such as email addiess 912 and telephone number 914 Generally, as indicated in Figuie 26, the normalized data representation for each business associated with a paiticulai ID is repiesented as a separate data relation independent of the othei relations
  • denormalized data The conceptual opposite of normalized data is denormalized data, as depicted in Figure 27.
  • Figure 27 shown is an example of denormalized data stored in table 916.
  • denormalized data for each ID associated with a business, the business address, email and telephone number, may be stored in a single record.
  • one data relation which is a single recoid in the table 916, represents all relations for one particular data set, such as the ID conesponding to a business Vanous degrees of denormalized and normalized data as known to those skill in the art, may be used.
  • the optimal degree of normalized and denormalized data forms may vary with each particular implementation and embodiment
  • the BackOffice component 818 may include one or more database servers 892.
  • a user may directly interact with the web server 894 included in the BackOffice component via connection 896 which, for example, may be a network connection of a user accessing the web server through the Internet.
  • the user may also interact directly with the BackOffice component through the Front End Server Connection 822.
  • the particular type and number of data fields may vary with embodiment. Additional structure may also be imparted to data fields, such as a telephone number may include an area code and exchange component. Additionally, interactions between the Pnmary Database 812 of the Front End Server 822 and the BackOffice component may be dnven or controlled by the BackOffice component. For example, when there is an update to be performed to the Pnmary Database server 820, an automatic transfer of the new information may be transmitted to the Primary Database 812 by the BackOffice component. Data may be transmitted to the Pnmary Database 812 using connection 822 Additionally, connection 822 may be used to piovide feedback or status information to the back office component 818. foi example, regarding success or failure of a data transfer using connection 822.
  • the PHTML files 844 of Figuie 4 are generally HTML instructions as interpreted generally by a brow sei w ith additional embedded processing instructions
  • the PHTML execution tiee 846 may be implemented as a C++ applet class with various execute methods w hich aie conditionally performed based upon the evaluation of certain conditions as indicated in the PHTML scripting language statements.
  • Each of the PHTML files 844 be expanded and evaluated in accordance with the particular conditions of the user request. The fust time a PHTML file is accessed, it is expanded and the expanded version is placed in the PHTML execution tree 846 of Figure 4.
  • HTML page may be formed by the parser after interaction with the data manager and query engine to select a specific number of items to be displayed to the user.
  • the HTML page may be stored in the page cache 848.
  • the page cache generally includes a naming convention such as a file system which the name of the file conesponds to the arguments and parameters of the query. The technique for forming the name is descnbed in other paragraphs of this application.
  • the query engine 862 is generally responsible for performing any required sorting of the query information or subsettmg and supersetting of information. Generally, the query engine 862 ret ⁇ eves vanous identifiers which act as keys into the Pnmary Database 812 or Secondary Database 814 for accessing particular pieces of information m response to a user query. After the query engine 862 formulates and retneves various identifiers, for example as from the term lists, which conespond to a particular user query, this query information in the form of term list and ret ⁇ eved information may be stored in the data query cache 850. A technique similar to the page cache query-to-filename mapping technique may be used to map a particular query request to a naming scheme by which data is accessed in the data queiy cache The technique foi foiming this name is described in other sections of this application
  • data which is stored in the data queiy cache 850 ma ⁇ be compiessed or stored in a paiticulai format which facilitates easy letneval as well as attempting to optimize storage of the various data queries w hich aie cached, as discussed in othei portions of this application
  • Figuies 28-30 show n are flowcharts of method steps of embodiments for performing processing m ⁇ anous components of the previously descnbed system of Figures 2 and 4 Refernng now to Figure 28, shown are steps of one embodiment of a method of processing a request in the system of Figuies 2 and 4
  • the Webserver engine invokes the Request Router in accordance w ith the PHTML MIME (Multipurpose Internet Mail Extension)
  • the Worker thread as included in the Request Router is initially forwarded the request for processing
  • a determination is made as to whether or not this lequest is serviced by this node in accordance with the information included in the configuration and load files If.
  • step 924 a determination is made that the request is not to be serviced by this node, the request is forwarded to another servei node in accordance with the load and configuration file information. If, at step 924, a determination is made that this request is to be serviced by this node, control proceeds to step 926 where the Worker thread allocates an available parser from the parsei queue to process the incoming request. At step 928. the incoming request is passed to the designated parser for processing.
  • the parse driver of the parser parses the incoming request.
  • the query request that is parsed is included as a
  • URL parameter that is processed by the parse d ⁇ ver. For example, if the query includes syntax enors, the parse d ⁇ ver will detect and report out such enors.
  • a unique file name is determined in accordance with the query request. This filename conesponds to the display results that may be included in the page cache.
  • this filename is unique for a particular user query and in accordance with "look and feel" parameters of the display lesults
  • "look and feel” refers to parameteis that describe the displayed results, such as numbei of business listings displayed in an HTML page, the paiticulai starting point of the displaved lesults with regard to the resulting data set
  • 15 items may be The same query performed by a second user from a different display window may display 17 items
  • the resulting HTML page in both cases is different even though the lesulting data set used in forming each of the HMTL pages is different.
  • the page cache may include a different HTML page for each of the 15 and 17 item displays.
  • a determination is made at step 944 as to whether the page cache includes the data in the filename determined at step 942 If a determination is made that the data is included m the page cache by the existence of the file, control proceeds to step 946 where the data in the filename is retrieved from the page cache Contiol proceeds to step 956 where the resulting HTML including the data in display format is delivered to the user's browser. If a determination is made at step 944 that the data is not in the page cache, control proceeds to step 948 where a determination is made as to whether or not there is a PHTML file the PHTML execution tree.
  • step 950 the expanded PHTML representation is ret ⁇ eved
  • step 954 portions of the PHTML file are executed in accordance with the user query to obtain data to produce the resulting HTML page by invoking the Query engine for data results.
  • the data results are returned to the parse driver that creates a resulting HTML file returned to the user's browser at step 956.
  • the resulting HTML file may be cached in the Page cache in accordance with predetermined c ⁇ tena, as previously descnbed.
  • the resulting HTML file is communicated directly to the user's browser.
  • step 948 If a determination is made at step 948 that the PHTML file is not in the PHTML cache, control proceeds to step 952 where the PHTML file is retneved from the PHTML file storage and subsequently expanded. The expanded PHTML file is stored in the PHTML cache. Control proceeds to step 954, which is described above.
  • step 962 the query engine receives an incoming request, as forwarded the pai se d ⁇ ver in step 954
  • step 964 the data is ret ⁇ eved for the "normal" search results as appropriate from the data queiy cache, or using an alternate technique.
  • "normal" search results refeis to the resulting data set formed by business listing data associated with a well-defined geographic area.
  • "normal" seaich result data are othei search result data that may not be associated with a single well-defined geographic area, such as virtual businesses in the Internet.
  • These othei search results that may not be associated with a single well-defined geographic area are described in more detail in paragraphs relating to the data query cache and its use.
  • other search data in addition to the "normal" search data may be retneved and integrated into the resulting data set.
  • the result data set is formulated in accordance with the user query request, such as displaying results in a particular order or beginning at a particular point.
  • the resulting data set is returned to the parse d ⁇ ver for formatting in a display format in an
  • the Standard Industry Classification may be used to indicate vanous name categones and synonyms. These various name categones and synonyms are produced, for example, by the extraction routines which produce the markup files, as used in this particular embodiment by the information retrieval software.
  • certain portions of the data storage such as the image repository 840, are updated on an incremental change or delta basis.
  • Other prefened embodiments may have different thresholds or techniques to update vanous data stores included in the Front End Server 804. These techniques may vary with implementation.
  • the architecture descnbed in Figures 2 and 4 is a highly optimized, distnaded, fault tolerant, collaborative architecture.
  • the pnmary purpose of this architecture is to support a high volume of searches, which may be performed for example, through the Internet.
  • the databases may include business information, such as for specific businesses 01 classifications of businesses
  • data queries may be performed based on characten sites of the various businesses, such as location, name, or category.
  • the architectuie described herein supports a flexible presentation of these businesses, based on business agieements and service offerings.
  • the architecture described herein uses various techniques and combinations to achieve high performance while maintaining flexibility and scaleabihty
  • the architecture as depicted in Figures 2 and 4 includes a set of fully redundant server nodes in which each node is capable ot lesponding to any search request
  • Each server node communicates with all the other nodes, as previously described, establishing the health and availability of each server node
  • Incoming requests are classified by each node, as routed by the hardware router, using a classification scheme held common and by consensus.
  • the nodes agree to a disjoint partitioning of requests to each of the server nodes in which one server node will service a set of classes of requests that no other node will generally service.
  • a number of complimentary techniques including Subsumption and Highly Redundant Caching, may be then used to adapt a particular node to a particular class of requests.
  • the latency for request servicing by that node decreases as additional user quenes are performed for each particular class of requests.
  • Adaptive techniques may be most effective when dealing with repeated requests or queries similar to those previously performed.
  • an initial search request may be the most costly in terms of system resources and search time Therefore, other techniques are used conjunction with the adaptive techniques to further facilitate performing an optimal query m response to a user request.
  • common term optimization is one technique which is used that generally takes advantage of a statistical bias in both submitted queries and result sets towards particular words or combinations of words. By anticipating particular word combinations or precalculated result lists that match, the CTO matches the initiating search problem.
  • the Front End Server 804 has a data set domain which includes electronic yellow pages and advertising requmng a high degree of flexibility in theumblentation of data. Data is geneially presented using the look and feel of business partners in each business listing w hich may have distinct requirements for presentation.
  • new modes of datalitntation may be defined on a monthly basis requinng updates to large numbei s of data stored in the back office component in the pnmary and secondary database
  • the aichitecture described uses several techniques that also support performance iequirements of the particular data domain in this embodiment and application
  • techniques such as the generic object and the generic presentation language may be used to facilitate rapid introduction of new services and additional presentation data m a variety of forms to a user
  • each server may be fully redundant, and there are two additional servers that are designated database servers which have additional supporting software and hardwaie for facilitating database access.
  • FIG. 804 Other embodiments of the invention include additional configurations of servers and databases in their particular implementation. While including concepts and techniques described herein, for example, the different databases and packages commercially available which may be used, as known to those skilled in the art, vary with the type of data access using searches to be performed. In this particular embodiment, a relational database structure is used to store and retneve information in the Front End Server 804 Other embodiments may include additional types of database storage using other commercially available packages or specialized software which facilitate each particular application. • TGenenc Objects]
  • the PHTML files 844 that are provided to the parse d ⁇ ver 858 are scnpts that direct the parse d ⁇ ver 858 to perform quenes, view the results of quenes, and provide information to the browser 824.
  • the PHTML files 844 are expanded into the PHTML execution trees 846 the first time the parser 866 accesses the PHTML files 844.
  • the parse dnver 858 accesses the PHTML execution trees 846 du ⁇ ng operation in a manner descnbed in more detail below.
  • the scnpts that are stored in the PHTML files 844 may include commands that are interpreted by the parse driver 858, C++ objects that are executed, blocks of HTML code that are provided by the paise d ⁇ vei 858 to the biowser 824, and any other appiopriate data and/or executable statements
  • the PHTML scripts perform operations of objects in a way that is somewhat independent of specific attributes of the objects and thus, as described in more detail below, provide a generic mechanism foi displaying andticnting many types of objects.
  • the PHTML scripts include com entional commands to include other files
  • each business listing may be represented as a document stored in the primary and secondary databases 812, 814.
  • the documents may be manipulated as generic objects. As discussed in more detail below, representing each business listing as a generic object facilitates subsequent handling of the business listings
  • a table 400 illustrates data storage for a plurality of denormalized objects in the databases 812, 814.
  • the differences between normalized and denormalized data is discussed in more detail elsewhere herein.
  • the denormalized data format is optimized for fast performance while, perhaps, foregoing some storage compaction.
  • a plurality of rows 402, 404, 406 represent a plurality of denormalized gene ⁇ c objects, each of which conesponds to a business listing.
  • Attnbutes represent vanous attnbutes of the denormalized objects.
  • the first attnbute 412 conesponds to an identifier for the objects 402,404,406 and thus identifies a particular listing.
  • Each of the attnbutes contains a number of fields and contains descnptor information identifying the type, size, and number of fields. Attnbutes may be added to the normalized objects, or only to a specific subset thereof
  • a denormalized repiesentation of any one of the objects 402, 404, 406 contains the same number of attributes as any of the other one of the objects 402, 404, 406.
  • each object can be identified. Accordingly, if values for a new attribute aie added to only a subset of the objects, then the other objects, outside the subset, will contain a null value or some other conventional marker indicating that the particular attribute is not defined (or contains no data) for the objects in question. For example, assume that a new attribute 420 is added. Further assume that the new attribute 420 only contains values for the object 402, but is not defined for the objects 404, 406. In that case, data space foi the attribute 420 is still added to the denormalized version of the objects 404, 406. but no value is provided in the attribute 420 for the objects 404, 406.
  • a table 430 repiesents data stored in the gene ⁇ c object dictionary 860 conesponding to results of a search query provided by the query engine 862 or from the data query cache 850 in the case of a previous search having been performed.
  • the objects may be object identifiers.
  • the field 412 may conespond to an object identifier of each of the objects 402,
  • the parse d ⁇ ver 858 uses the table 430 provided by the generic object dictionary 860 along with the PHTML execution trees 846, to provide specific HTML code from the parse d ⁇ ver 858 to the browser 824 of the user 802.
  • a diagram illustrates a portion 440 of the PHTML execution trees 846.
  • the portion 440 is constructed using the scnpts in the PHTML files 844 and consists of a plurality of nodes conesponding to the decision points set forth in the PHTML scripts and a plurality of C++ objects and HTML pages that are executed and/or passed to the browser response to reaching a node conesponding thereto.
  • a node 442 can conespond to a PHTML lf-then-else statement having two possible outcomes wheiein one branch from the node 442 conesponds to one outcome (i.e., the conditional statement evaluates to tiue) and another bianch from the node 442 conesponds to another outcome (i.e., the conditional statement evaluates to false)
  • a structure may be implemented in a conventional mannei given a scripting language such as that described above in connection with the PHTML language. That is, implementing such a tree structure using a scripting language is stiaightfoiward to one of ordinary skill in the art using conventional techniques in a straightforward manner
  • Representing the documents (business listings) of the databases 812, 814 as generic objects facilitates modifying the documents, oi a subset thereof, without modifying the parser 866. For example, if an attribute is added to some of the objects, then it is only necessary to modify the objects (schema and data) that will contain that attribute and to also modify the PHTML files 844 to include new scripting to handle that new attnbute.
  • the scnpting may include statements to determine if the particular attribute exists for each object. For example, suppose the business listings weie in black and white and then color was added to some of the listings. The color attnbute could be added to some, but not all, of the objects only in normalized form.
  • the denormalized versions of all of the objects w ould contain a data space for the attnbute, but the objects that do not possess a color attnbute will have a null marker.
  • the PHTML files 844 can be modified to test if the color attnbute is available in a particular object (e.g., to test for a null value) and to perform particular operations (such as displaying the color) if the attribute exists or, if the attribute does not exist for a particular object, displaying the object in black and white. In this way, the color attribute is added to some of the objects without modifying the parser 866 and without modifying existing objects that do not contain the attnbute.
  • the query engine 862 determines whether the query is found in the data query cache 850 or whether it is necessary to perform a query operation using the Verity software (discussed elsewhere herein) and the term list 836. In either instance, the results of the query are provided by the query engine 862 to the gene ⁇ c object dictionary 860 in a form set forth above in connection with the desc ⁇ ption of Figure 6.
  • the parse d ⁇ ver 858 and PHTML execution trees 846 then operate on the gene ⁇ c object dictionaiy 860 to determine what data is displayed to the user by the browser 824
  • the PHTML execution trees 846 may require the parse driver 858 to obtain additional data from the databases 812, 814 through the data manager 864 Foi example, in instances where the categories conesponding to the retrieved documents (business listings) are displayed, the PHTML execution trees 846 may cause the parse dm ei 858 to obtain information fiom the genenc object dictionary 860 that identifies each category and the number of listing conesponding to each category.
  • the portion of the PHTML execution trees 846 may cause the parse d ⁇ ver 858 to use the data manager 864 to access additional information fiom the databases 812, 814, such as the names of the categories conesponding to the category identifiers provided in the gene ⁇ c object dictionary 860
  • An instantiator 452 creates the PHTML files 844 and constructs the PHTML execution trees 846 from the PHTML scripts the first time the PHTML is invoked by the parse dnver 858.
  • Instantiation includes reading the PHTML files and consti uctmg trees, such as that shown in Figure 7, based on the PHTML scnpts provided in the PHTML files 844. As discussed above, constructing such trees from a scripting language is generally known in the art.
  • An interpreter 454 accesses the PHTML execution trees 846 and, based on the information provided therein, provides HTML data to the browser 824 and/or executes a C++ object.
  • the interpreter 454 also accesses a configuration file 456 and a state file 458 which keeps track of the state of various values dunng traversal of the PHTML execution trees 846.
  • the interpreter 454 also receives other data that is used to traverse the PHTML execution trees 846 and to provide information to the browser 824.
  • the other data may include, for example, data from the data manager 864 and data from the genenc object dictionary 860.
  • the state data 854 includes information such as the number of iterations
  • the technique disclosed herein relates to a new data type which abstracts the data interpretation from the data typing by using data schemas.
  • a novel approach is the use of this data typing for rapid service deployment in search engines for advertising services on the Internet.
  • new presentation types may be introduced by an advertiser due to the large number of possible ways to present data to a user.
  • An advertiser may wish to change the information displayed when a user performs a query that results in displaying information regarding the advertiser's business. If there are tens of thousands of advertisers which perform this task on a monthly basis, this implies a very high rate of new presentation types which an online advertising service must be able to accommodate.
  • Use of this generic data type in GTE SuperpagesTM provides a flexible and efficient approach to incorporate these additional and new presentation types for large numbers of advertisers.
  • this technique provides for rapid integration of new data types without requiring recompilation or code changes in source code which uses instances of data that include the additional data types. This provides for the flexible and efficient introduction of data changes.
  • the generic data typing is optimized for performing multiple data operations by providing a small subset of possible operations or accesses upon any data of the generic data type. Therefore, these small subset of operations which are known may be optimized wherever there is a data access, for example, within the parser. This is in contrast to a non- generic data typing scheme which requires the introduction of a new data type and additional associated access patterns. In a non-generic data typing scheme there is an unlimited and unknown number of access patterns for which optimizations must be performed on an ad-hoc basis as new data types are introduced. Thus, when a new data type is introduced, the possible accesses need to be analyzed and optimized. In addition, the technique described herein provides for denormalized, flat, representations of the objects that facilitate rapid and efficient handling thereof.
  • the parse driver 858 uses a data schema description to interpret the various data attributes and fields of the generic data objects.
  • the abstraction of the data interpretation into the data schema description enables different components of the parse driver to operate upon and use generic data objects without having these components require code changes or recompilation due to the introduction of new presentation types.
  • This embodiment relates to concepts that may be included in a variety of applications.
  • One embodiment that includes these is the GTE Super Pages on-line Internet tool that may be used to perform data quenes
  • GTE Super Pages performs this query returning search results to an on-line user.
  • Concepts which will be descnbed paragraphs that follow may be generally used and adapted for use in querying any search domain.
  • Query partitioning is the stnct classification and i outing of a particular query based on its input term characteristics to a node oi a particular set of nodes This infoimation is stored in the various configuiation and load files, as descnbed in other sections of this application.
  • Query partitioning ensures that any adaption a node undergoes based on the characten sites of queries that it processes is maintained.
  • Specific nodes may serve specific query partitions. Caching and result set manipulation techniques may then be used on each particular node to bias each particular node to the queiy partition to which it has been assigned.
  • Highly redundant caching is generall) a technique that trades storage space against time by storing result sets along with subsets of these iesult sets
  • the highly redundant caching technique generally relies on the fact that the seaich time to locate an existing result is generally less than that amount of time which would result in creating the query result from a much largei search space.
  • One highly effective set manipulation technique, refe ⁇ ed to as subsumption, is especially important in the adaption of a particular node
  • Subsumption is generally the de ⁇ vation of query results from previous results, which can be either a superset of the requested result or subsets of the requested result.
  • Subsumption is also the recognition of the relationship between queries and the determination of the shorted denvation path to a result set.
  • That derivation may be the composition of several subsets resulting in a superset, or the extraction of a subset from a recognized result set.
  • an additional conjunctive (“and”) search term conesponds to the formation of a subset from the superset descnbed without the additional term.
  • the presence of an additional disjunctive (“or”) search term conesponds to the identification and composition of existing subsets each descnbed by one of the disjunctive clauses.
  • the data included in the data query cache is placed in nonvolatile storage such that if the node w eie to become unavailable, data from the data cache may be fully restored once the node resumes service.
  • composition query also uses the data in the data queiy cache
  • a composition query may generally be refened to as one which is a composition of several queries, foi example, when using several conjunctive search terms For example, a request of all the French restaurants in Massachusetts, Texas and California is a composition query that may reuse any existing cached data from previous queries stored individually regarding restaurants in Massachusetts, Texas and California.
  • a composition query is generally determined by the Parse D ⁇ ver, and the request router decides to which server node 808- 810 withm the Front End Server the composition query is sent for processing in accordance with domain weights of the configuration file Consider the following Configuration File information based upon the previous composition query:
  • the Request Router may route the composition request to either server 1 or 2. If the request is routed to server 1, data may be cached regarding MA and TX for reuse and a new query may be performed for the CA information. If the request is routed to server 2, data may be cached for reuse regarding CA and new queries performed for the MA and TX information. The Request Router, based on the weights, sends the request to server 2 since the cost associated with performing the MA and TX queries is less than the cost of performing the CA query.
  • a particular domain is associated with a particular server node upon which data query caching is perfoimed for designated domains.
  • the domain and server weights reflect the cost associated with processing a request on each node using the data query cache Accordingly, routing a lequest in accordance with these weights results in faster subsequent query times for those requests
  • Reallocation of the requests when a sei ver is unavailable is performed with a bias toward the initial allocation scheme as indicated by the Configuiation File. There is an assumption that reallocation is on a transient basis and that the initial allocation scheme is the one to be maintained.
  • server nodes M1-M4
  • domains initially allocated to each node as indicated below: Domains Dl and D2 allocated to node Ml. Domains D3 and D4 allocated to node M2. Domains D5 and D6 allocated to node M3.
  • Domains D7 and D8 allocated to node M4.
  • node Ml becomes unavailable and the routers reallocate Domain Dl to node M2 and D2 to node M3.
  • node M2 also becomes unavailable.
  • Domains Dl and D3 are reallocated to node M3 in addition to domains D5 and D6.
  • Domain D4 is reallocated to node M4 in addition to domains D7 and D8.
  • node Ml is restored and node M2 is still unavailable Domains Dl and D2 are reallocated to Ml in addition to Domain D3.
  • Domains D5, D6 and D4 are allocated to node M3.
  • Domains D7 and D8 are allocated to node M4.
  • bias toward restonng the initial allocation scheme when a node becomes available This bias contributes to faster subsequent query times upon re-entry of a server node due to the use of the data query cache, and routing of subsequent requests to the particular nodes in accordance with this bias.
  • step 200 a detei mination is made as to whether a data set in the data queiy cache conesponds to the cunent query being made If so, control proceeds to step 202 where this data is letneved and used by the query engine in formulating the queiy results that are d ⁇ spla ⁇ ed to the usei At this point, the processing stops at step 216
  • parents of the cunent query are determined by dropping one of the terms Foi example, if the queiy being made is for "MA AND RESTAURANTS AND FLOWERSHOPS", each of the three terms is sequentially dropped to form all combinations of two possible terms.
  • the set of parents is the following:
  • a search is made for only the parent terms.
  • other embodiments may go further in searching for results in the data query cache by also forming grandparent terms, as by dropping two terms. This process can be repeated for any number of terms being dropped and subsequently determining if any data sets in the data query cache conespond to the resulting terms.
  • preprocessing insures that ancestor-based geography exists.
  • that ancestor is a Ve ⁇ ty term list associated with a particular state. This implementation uses API calls to retrieve the data identifiers conesponding to the resulting data to be included in the query results.
  • step 205 If, at step 205, it is determined that there are one or more data sets in the data query cache that conespond to one or more of the parent terms, control proceeds to step 206 where a cost is associated with each parent.
  • a cost is associated with each parent.
  • One embodiment associates a cost with each parent term in accoi dance with the numbei of listings of each pended term This may also be normalized and used in a percentage form by dividing the numbei of listings in the parent domain by the total number of listings m the query domain. This percentage represents the probability of a business listing belonging to the patty data set appearing in the database. Contiol pioceeds to step 208 w heie the pended with the minimum cost is chosen as the starting data set foi formulating the data lesults At step 210.
  • the minimum cost derivation sequence is applied to produce the resulting data query
  • the minimum cost de ⁇ vation sequence is obtained by operating upon the least probability terms first.
  • the determination of the start data set in step 208 may be the data set with is closest in terms of parentage and ith the least number of listings in the data set The proximity in parentage is the primary ranking basis and the number of listings being secondary in determining ranking.
  • Figure 34 shown is a diagram of one example used in step 210 for determining and applying the best derivation sequence.
  • the query is for MA AND RESTAURANTS AND FLOWERSHOPS.
  • MA is the starting data set which is located in the data query cache.
  • the parentage has been extended to grandparents, and MA has been determined to be the first ranking data set in terms of parentage and number of listings in the data set.
  • control proceeds to one of two states, 232 representing "MA AND RESTAURANTS", or 234 representing "MA AND FLOWERSHOPS".
  • the state to which control is advanced depends generally on choosing the path with the minimum associated cost at each step. In this instance, the number of elements in the data sets "FLOWERSHOPS" (state 234) and "RESTAURANTS" (state 232) may be considered in determining cost.
  • contiol proceeds to state 236 wheie seaiching of the data set elements is performed to produce the final resulting data set representing "MA AND RESTAURANTS AND FLOWERSHOPS"
  • the approach just described is to advance to the next state which has the minimum cost associated until the final resulting data set is determined.
  • the data is partitioned by states
  • the adaptive techniques as described with regard to the GTE Superpages application described herein include partitioning the data sets based on geography, particularly within each state. In this instance, particular server nodes are designated as primary query servei s based on geographic location by state Additionally, as part of this partitioning of requests, the data query caches and term lists of identifiers are also partitioned according to state. In this embodiment, this partitioning is done as a preprocessing step prior to servicing a request in that the identifiers are formed and placed on each dedicated server node.
  • this partitioning may be determined based on expected data queries and data sets formed accordingly, for example, by examining log files with recorded data query search histories to determine frequently searched categones or combinations of catego ⁇ es.
  • a query request is generally the combination of boolean operators and search terms.
  • Key-value pairs or terms may be joined by the logical boolean AND operation, represented, for example, as "&”.
  • the logical boolean OR operation may also be represented, for example, by another symbolic operator such as a",". For example, when looking for either cities of ACTON or BOSTON, this a) be lepiesented as
  • keys include (T) City, (B) Business Listing, (S) State, (R) Sort Order, (LT) Latitude, (LO) Longitude, and (A) Area Code
  • LT and LO may be used to calculate data sets lelating to pioximity searches, such as restaurants withm thirty (30) miles of Boston
  • the Data Query Cache 850 in this embodiment, generally includes a "hot” and "cold” cache
  • the caching technique implemented is the LRU (Least Recently Used) policy by which elements of the cache are selected for replacement in accordance with time from last use These and other policies aie generally known to those skilled in the art.
  • the "hot" cache may include the most recently used items and the cold cache the remaining items
  • each of the data query caches and other caching elements as depicted in Figure 2. may be fast memory access devices, as known to those skilled in the art, used geneially for caching
  • the "hot" cache is implemented as stonng the data in random access memoiy. This may be distinguished from the storage medium associated with the "cold" cache representing those items which are determined, m accordance with caching policies such as the LRU, to be least likely to be accessed when compared with the items m the hot cache which are determined to be more likely to be accessed.
  • a double ended queue structure is used to store cached objects, but other data structures known to those skilled m the ait may be used in accordance with each implementation.
  • Data sets that are stored in the data query cache and page cache each conespond to a particular search query.
  • a mapping technique may be used to map a particular query to conesponding data as stored in the data query cache and the page cache.
  • this mapping uniquely maps a data query to a name refernng to the data set of the data query. In this embodiment, this allows quick access of the data set associated with a particular query and quick determination if such a data set exists, for example, the data query cache
  • Figuie 35 shown is a flowchart of an embodiment of the steps foi forming a name associated with a data set. as may be stored in the data queiy cache or page cache.
  • a subset of queiy terms is determined such that a string lepresenting a particular query is uniquely mapped to a name conesponding to a data set
  • the subset of keys that are used m mapping a string conesponding to a query to a name of a data set include
  • Proximity represents the pioximity in physical distance to/from a geographic entity, such as a city "City”.
  • "State Street”, “Zip”, “Aiea Code', “Phone Number”, and "Business Name” represent what the keys semantically describe as pertaining to a business listing "Category " lepresents a classification as associated with each business, such as representing a type of business service
  • Categoiy Identifier is an integer identifier representing a category id
  • Keywords indicate an ordering pno ⁇ ty for the resulting data set.
  • "National Account” repiesents a business or service level parent- child relationship where the national account indicates the parent An example is a parent- child relationship between a parent corporation and its franchises
  • a query stnng conesponding to a particular user query is formed using the onginal st ⁇ ng as formed, for example, by the Parsei of Figuie 2
  • the query stnng includes only those terms which are included m the subset as identified in step 240. If the ongmal stnng does not include an item that is in the subset, for example, since the user query does not include the item as a search term, that item is omitted in forming the query st ⁇ ng conesponding to the data set.
  • this query string is used to determine if a data set is located in the data query cache that conesponds to the cunent user query request.
  • the data sets each conespond to a filename.
  • a lookup as to whether a data set conesponding to a particular user query exists may be determined by performing a directory lookup, for example, using file system services as may be included m an operating system upon a device which serves as a fast memory access or other caching device. It should be noted that this technique may be used generally withm the Superpages Front End Servei and BackOffice to form unique names that conespond to paiticulai search terms.
  • one embodiment may include services for operating upon the original query string as formed by the Parser to produce parents and grandparents of the terms included in a query when performing the method steps of Figure 33 and 34 if there is no exact data set match in the data query cache
  • This may provide the advantage of insulating other code, such as in data encapsulation, fiom knowing the internal structuie of the query st ⁇ ng.
  • this is a common programming technique to minimize code portions from changes data types and structures to minimize, for example, the amount of recompilation when a new data type is introduced or existing data type modified.
  • Other techniques such as hashing, may be used to generate a unique identifier for the input st ⁇ ng, as know n to those skilled in the art.
  • the Page Cache name includes a parameter in forming the name uniquely identifying the filename including the result set for a query in this particular display format.
  • the data query cache includes cache objects in which each cache object conesponds to a particular cached query resulting data set. Refernng now to Figure 36, shown is a block diagram of one embodiment of a data set as stored in the data query cache.
  • each data set 250 includes header information 252 and information conesponding to one or more business listings.
  • header information may include information descnbmg the data query set, such as the number of business listings m the data set. Other types of information may be included in accordance with each particular application and implementation.
  • Each business listing 254 generally includes information that descnbes the business listing. More particularly, this information includes data that is cached as needed by other components the Front End Server, for example, in performing vanous searches, data retrieval, and othei operations upon data in accoi dance with functionality piovided by the embodiment In this instance, the follow mg types of fields of information are stored foi each business listing 254:
  • relevance information is Verity-specific information as it relates to the query. For example, this generally represents the frequency of words or terms in a document.
  • the advertiser p ⁇ o ⁇ ty indicates a service level that may be used in presenting business listings, for example, in a particular order to a user. For example, if a first advertiser purchases "gold" level advertising services, and a second advertiser purchases "silver" level advertising services, when a user requests only 15 listings to be displayed, the "gold" level advertisements may be displayed prior to the other advertisements by other advertisers, such as the "silver" level service purchaser. Thus, a higher level of service may guarantee an advertisement be placed earlier in the displayed results.
  • the technique used to store the data in the data cache from memory includes object senalization and desenahzation techniques, as known to those of ordinary skill in the art. These techniques transform an internal storage format, as may be stored in random access memory, to a format suitable for persistent storage in a file system, as in the data query cache.
  • the complementary operation is also performed from persistent storage to the m- memory copy.
  • object senalization i.e., from memory to persistent storage device in cache, is performed by storing the data type, its length, and the data itself. It should be noted that the length may not be needed for each data field, for example, in fixed length data types.
  • the complementary operation of object desenalization is generally performed by reading the fields in the same 01 der as written to the cache
  • the Page Cache may be implemented as HTML files in a file structure located on a disk or other storage device
  • the PHTML execution tree may be implemented as an ln-memory linked list or other abstract data structure representation of the C++ objects
  • the data query cache may include different types of cached geographical data as may be used in performing different data queries.
  • the type of data cached descnbed in the prior paragraphs is the "normal" business listing data as associated w ith a well-defined geographic area.
  • Other businesses for example, such as a florist or an airline, may not be associated with a single well-defined geographic location.
  • a business may not have any geographic bounds, such as if it is an Internet business with a virtual storefront accessible on the Internet.
  • other businesses may be located in a particular well-defined geographic area, such as an airline with a physical presence in a particular city, but the service area which conesponds to the service offered does not conespond to the location of the business itself.
  • multi-city placement may be descnbed as representing a business' service area in multiple cities when data queries are performed.
  • An example may be a plumbing service located in three (3) cities with service areas in ten (10) cities.
  • the total- city placement may generally be descnbed as representing a business' service area in all cities when searches are performed.
  • An airline is generally an example of this which services all major U.S. cities.
  • the total city and multi-city search results are cached separately from the "normal" query results, but are composited with the normal search results p ⁇ or to retrieving the data from the database.
  • the total and multi-city query results are retnevable independent of the "normal" search results.
  • the storage format for this information may be as descnbed for "normal" query results
  • othei embodiments may use a different format for storage than the "normal" search results, foi example, if othei information is deemed to be important in accordance with each implementation
  • FIG. 37 and 38 shown is a flowchart of an embodiment of a method for integrating total-city and multi-city cache results into "normal" cached search results.
  • a total-city cache name conesponding to the data query is formed.
  • the total city cache name is formed by starting with the string
  • step 264 the total-city data set cached item is moved to the hot cache, if not all ready in the hot cache. A reference to this data set is saved for later retrieval in other processing steps. If at step 262, a determination is made that the total-city query cached data set conesponding to the total-city cache name does not exist, control proceeds to step 266 where a search is performed for the total-city query. At step 268, the search results are cached, as in the "hot"cache. A reference to these search results are stored for use in later processing steps.
  • a multi-city cache name is constructed representing the multi-city cache conesponding to the cunent data query.
  • this multi-city cache name may be constructed by forming a string using the same fields extracted from the original data queiy stnng as formed by the parser in conjunction with forming the total-city name Similai to forming the data query name for the "normal" cached search results, the string conesponding to the cached data set for a given query uniquely identifies the data set.
  • step 272 a determination is made as to whether theie is multi-city cached data conesponding to the cunent multi-city cache name If, at step 272, a determination is made that such a data set exists in the multi-city cache, control pioceeds to step 274 where the data is moved to the "hot"cache, if not all ready located theie. Additionally, a reference to this location in the "hot”cache is saved for use in latei processing steps. If, at step 272, a determination is made that such a data set does not exist in the multi-city cache, control proceeds to step 276 where a search of the database is performed. The query results, if any, are cached in the "hot"cache with a reference to the results saved for use in later processing steps.
  • the total-city and multi-city data cache results are integrated with the "normal" query results. After the "normal” query is performed, but before sorting the search results, the total-city-cached results, if any, may be combined with the "normal” query results. If there are no total-city cached results, the multi-city results may be included, if any.
  • the combined search results are then sorted such that any redundant listings are removed. Any additional processing is performed, as m accordance with the user query, for example, as producing the listings which begin with "B", or only listing the top ranked fifteen (15) listings as ranked in accordance with other user specified cnte ⁇ a.
  • a garbage collection technique may be included to remove or delete cached objects that have been determined to be "old" in accordance with predetermined cnte ⁇ a. For example, in one embodiment using the LRU caching scheme, whenever the amount of fiee cache space falls below a thieshold level, the garbage collection routine is invoked
  • the thieshold level includes parameteis relating to a predetermined number of cache objects and the accumulated size of the objects in the cache.
  • theie may be multiple conceptual caches, such as the "normal" data queiy cache, the multi-city cache, and the total-city cache, the cached lesults may physically reside in the same "hot” and "cold” caching devices.
  • the query engine 862 may, in an embodiment of the invention, include information retrieval software 908 to ret ⁇ eve records from the Pnmary Database 812 that conespond to the user's query.
  • the query engine 862 may include more than one form of information retrieval software Foi example, the query engine, in addition including the information retrieval soft are 908 that is to be used to obtain listings in response to user queries, may further include banner ad retneval software 909 for retrieving advertisements that relate to the user's query.
  • the information retrieval software 908 may include functionality of software such as the Information Server Version 3 6 software commercially available from a company known as Verity. Other commercial packages of information retneval software are available, and the techniques described herein could also be employed using prop ⁇ etary software coded by the user.
  • the information retneval software 908 includes the Information Server Version 3.6 software and additional extensions provided by the host of the GTE Superpages system
  • the information retneval software 908 may at a step 82 access markup language files 906, as depicted in Figure 25, which are produced by the exti action routines 902 fiom the normalized data 900
  • the markup language files consist of business listings that aie stored in the P ⁇ maiy Database 812
  • the information retrieval softw are 908 may then, at a step 84 produce term lists 836 that are further used by the information retrieval softwaie 908 to handle queries that are delivered to the queiy engine 862.
  • the term lists 836 may consist of a linked list for each term that appears in one of the business listings, with the elements of the linked list including a document identifier for the business listing and certain statistics regarding the frequency of occunence of the particular term in each document and in the document set as a whole.
  • the banner ad retrieval softw are 909 may similarly generate and use banner ad term lists 837 that are further used by the banner ad retrieval software 909 to handle generation of appropriate banner ads.
  • the term lists which in an embodiment are generated using Ve ⁇ ty softw are, may be expanded at a step 86 to include synonyms for the terms appearing in the business listing.
  • the term "diner” appears in a business listing
  • the term "restaurant” might be assigned to the file for that business listing as stored in the Pnmary Database 812
  • the expansion of the listings to include synonyms of the words included in the listings may be accomplished by execution of PHTML scripts or other programming techniques.
  • the expansion may establish a hierarchical structure; for example, the term “restaurant” may be stored in a tree that includes the sub-category of "ethnic restaurant.” which may further include the sub- category "greek restaurant.”
  • PHTML scnpts may be provided to establish the tree structure and to operate on the tree structure to retrieve results that will be provided to the user.
  • the steps 82, 84 and 86 may be accomplished at initialization of the system, thus establishing and expanding the term lists 836, 837 for later use.
  • the system may operate to obtain results that are to be displayed to the user.
  • the steps for obtaining results may be seen in a flow chart 88 displayed in Figure 41.
  • the parse dnver 858 may at a step 20 parse a user query and deliver the parsed query in suitable form for handling by the query engine 862.
  • the query engine may include the information retneval software 908.
  • the query engine 862 may operate the information retrieval software 908 to take the parsed user request and expand the query, turning the user request into a detailed query.
  • the information retrieval software may operate on the expanded term lists 836 by identifying documents associated with the terms identified in the expanded query.
  • the term lists 836 are the business listings described in connection with steps 82, 84 and 86 above, expanded to include synonyms and terms that are determined to be related to the words in the business listing.
  • Identification of documents may be accomplished by a variety of information retrieval techniques. Documents may also be associated with queries by sorted relevancy ranking, clustering (automated grouping of related documents), automated document, summarization (creation of content abstracts, not simply the first few sentences of the document) and query-by-example (turning an individual document into a query in order to retrieve "more documents like this"). These functions may be accomplished by software techniques, such as having a table of pointers having as an argument a tokenized version of each possible term from the expanded user query from the step 22.
  • the table of pointers may point to the location of a term list 836 for each such term.
  • the term list may be a linked list of documents that include the term.
  • the linked list may include information about each document, such as the number of occunences of the term in the document, the inverse frequency of the term in the entire set of documents, the association of the document with other documents, the association of the document with categories, and the like.
  • an indexing architecture such as that provided by Verity allows for incremental indexing, so that only new, updated or deleted documents require changes, avoiding the need for a complete re-index each time a document changes.
  • Online identifiers may be provided, so that searches can continue while the identifiers are modified. This function is also provided by the Verity software.
  • a variety of weighting algorithms can be used to rank documents identified in the step 24 according to the information stored in the term lists 836.
  • a simple weighting algorithm might take a single term query, such as a category of information, and rank each document in a term list 836 in numerical order according to the product of the term frequency (the number of times a term appears in the document) and the inverse document frequency (the inverse of the number of times the term appears in the entire document set)
  • a list of the ranked documents may be further processed by the information let ⁇ ex al softwaie to piovide a results page
  • the information retnev al softwaie 908 may determine categories into which the retrieved documents fall.
  • the categories are yellow pages categories, which have been previously assigned to the documents, which are business tings, prior to entry of the business listings in the Primary Database 812
  • the information retrieval software 908 detei mines what categories are associated with the business listings retrieved by the ranking at the step 28
  • the information retrieval software 908 may compare the categories identified at the step 30 to the terms in the user query.
  • catego ⁇ es are piesent that do not include any of the terms in the user query, then, at a step 92, such categories may be discarded. Thus, the user will not ret ⁇ eve categories that are unrelated to the user query Such categories might otherwise appear, for example, if the information ret ⁇ ev al software 908 retrieves a business listing that is associated with two unrelated catego ⁇ es, only one of which is relevant to the user query. For example, a query for a restaurant might retrieve a listing for "Joe's restaurant and bowling alley.” The information retneval software 908 might then retrieve the categories "restaurants" and "bowling" that ould have been associated with that listing.
  • the "bowling" category would be discarded, because the user query for a restaurant is unrelated to the "bowling" category.
  • the term comparison may use an expanded version of the terms in the query and m the catego ⁇ es Thus, a category would not be discarded if it includes a synonym of a query term, even if the category does not include an exact term match.
  • the information retneval software may, at a step 94, determine whether there are any remaining catego ⁇ es.
  • the system may display a results page that consists ot a list of the lemainmg categories
  • the results page may furthei include an indication of the numbei of listings that aie associated with each category
  • the document identifiers established for information let ⁇ eval software 908 may maintain pointers to other documents or to souices of the documents, such as URLs oi file names. Thus, the identifieis may be stored apart from the documents allowing separate, non-invasive use of the identifiers, while maintaining the integrity of the data.
  • CTO Common Term Optimization
  • a series of steps may be performed as pie-processing operations in order to classify and establish query result sets for common queries.
  • common terms may be identified pnor to system initialization. Designation of common terms may be performed based on a number of different factors. For example, a single word might in theory be designated a common term, if it appears with a high frequency result sets obtained by users. It is noted that a single word common term may offer relatively little benefit in search efficiency, because the term lists 836 already permit searching based on individual terms. Alternatively, common terms might consist of multiple word combinations of any length, whether bi-grams, t ⁇ -grams, or n-grams.
  • words that co-occur in high frequency can be designated as common terms, such as in a bi- gram format.
  • the bi-gram “Boston - restaurant” might be designated a common term.
  • terms may be linked to specific contexts; that is, terms may be designated or classified as common terms in part according to their context.
  • the term “Boston” might be considered a common term if entered m the "city” field, but it might not be considered a common term if entered in a "business name” field or a "category” field.
  • the term “restaurant” might be a common term in the "category” field, but would not be considered a common term in the "city” field.
  • the common term sets may be sti uctuied to letlect context
  • each term might be expanded to include both synonyms for the term and other terms that aie semantically I elated to the common term in the established context for the term.
  • the pre-processing steps 32, 33 and 35 might be accomplished in a different order, and other steps might be included in embodiments of the invention
  • steps 32, 33 and 35 it is possible to establish lists or identifiers at a step 46 that include the expanded common term n-grams.
  • One way of dealing with common term combinations would be to generate in advance term lists 836 that are predicted to be used with some frequency (e.g., restaurants, Boston. New York, etc.) and to pie-calculate the intersection of the likely combinations. This approach requires substantial processing and would have to be performed frequently, given frequent changes the identifiers.
  • a term list 836 might consist of a linked list of documents, such as business listings, that contain the terms "Boston” and "restaurant,” (or synonyms thereof) m the contexts in which those terms are common.
  • the term lists 836 may, like other term lists 836 descnbed elsewhere herein, may further include information as to the term frequency of each term, synonym or related term, and the inverse document fiequency of the term, svnonym 01 1 elated term in all documents in the set.
  • the synonyms and 1 elated teims may be included in the actual business listings that are used to generate term lists 836. so that those listings will be included in the generation of common term lists In an embodiment, the listings themselves may be classified as to common terms and synonyms or related terms of those terms. Listings may be further classified as to sub-contexts, depending on the search context. Listings using identical terms should also be included in term lists, because they use identical token identifiers for such terms Foi example, the term "Boston” should be understood in a nationwide search to include listing in both Boston, Massachusetts and Boston, Kentucky, because the token for the term "Boston" will be the same in each case.
  • Result sets must be identified as tokenwise semantically related to the classifications that are possible in a seaich. Results are thus classified into common term groups on a sting- by-listmg basis.
  • the common term lists 836 for combined terms can be stored in a designated area of the primary database 812. front end server 804, or server node 808-810 that allows a rapid search in the event common term combinations are included in the user query.
  • the common term lists are thus assigned to a special results area for common term searches.
  • the steps 46 and 48 may be performed upon initialization of the system.
  • result sets are established for common term searches, and the result sets are stored in a special location in memory for rapid retrieval.
  • query rules may be established that direct approp ⁇ ate user quenes to the special location in memory established at the step 48.
  • the user might enter a query on a template 34 that is displayed as a page, such as markup language page, on the user's browser 824.
  • the template might include fields 36, such as a category field 38, a business name field 40, a city field 42 and a state field 44.
  • the query is delivered to the parser 866 of the server 808 to which that user has been routed.
  • the query is then used, as described above in connection with Figuie 41, to retrieve documents
  • the documents that aie letneved at the step 28 and displayed at the step 30 of Figuie 41 aie a set of matching categones for the queiy Foi example, as depicted in Figure 44, if the user enters the categoi y "ait supplies," the information retrieval software 908 may retrieve a set of matching categones that relate to art supplies
  • the retrieved categories may be ordered alphabetically, by oidei of significance, oi grouped by sub-categories.
  • the user then may select categories among the matching categories to receive either further sub-categones or documents, such as advertisements oi other markup language pages, that conespond to the categones in an embodiment, lather than matching categories, the information retneval software 908 may immediately retrieve matching documents, such as specific adv ertisements or other maikup language pages, rather than categories of documents.
  • This dnect retrieval step may be accomplished, for example, when one of the user-entered catego ⁇ es is an exact match to one of the categories included in the term lists 836.
  • the information retrieval software 908 retrieves documents from the term lists 836 that conespond to a ranking of an expansion of the user-entered query.
  • the information retrieval software 908 may, in a conventional manner, ret ⁇ eve term lists 836 that conespond to each of the terms of the query, such as a list conesponding to the category "restaurant " and a list conesponding to the city field "Boston.”
  • the information retneval software 908 could then perform an intersection of the two sets and perform a ranking of the related categories (e.g., Italian restaurants in Boston, French restaurants in Boston, etc.) or related listings (for specific Boston restaurants).
  • the information retrieval software 908 may be programmed to execute the search for the user's query in the special area of memory that was established for stoiage of the special common term lists 836 at the step 48 of Figure 42.
  • data updates included m the database come from three different sources in this particular embodiment.
  • One source is on-line updates, as provided by users making updates or entering new information for business listing via network connections through the BackOffice component as through the Front End Server.
  • a second source of data updates is based on foreign source updates.
  • foreign source updates are those update records which come from a different data source than the onginal existing database.
  • a third type of data integration or update source is refened to as a native source update.
  • a native source update is when an updated version of the existing database having the same source as the existing database is provided.
  • a database copy may be provided as an update on a monthly basis using full sets of data where a data provider provides an updated version of the same data set.
  • the native source data integration procedure integrates those changes in the new data set into the existing database. This is in contrast to a foreign source update, for example, where the existing database is piov ided by one vendoi , and the update lecords foi example, are provided by a different vendoi
  • the update vendors being from a foreign source are called foieign source data integration or updates
  • the native souice update records are provided using full sets of data
  • the existing database is a complete database
  • the native source updates are piovided in the form of a complete database as opposed to only providing update iecoids
  • the foreign source update records are generally records obtained from a souice different fiom the working database and are merged into the existing database.
  • Shown in Figure 45 is a native source update database 1500 which is integrated into the unfiltered database 1504. Generally, this is done by performing comparisons of the records of the native source update database 1500 and the unfiltered database recoids 1504 m determining the various types of operations that need to be performed to integrate the changes from the native source update into the unfiltered database.
  • the unfiltered database 1504 is a complete version equivalent to the working database.
  • the records included in the unfiltered database 1504 generally include raw data which has not had the benefit of the data enhancement techniques as applied to the working database records 1508.
  • the on-line update records 1506 and the foreign source update records 1510 are integrated directly into the working database copy 1508. It should generally be noted that the foreign souice update records 1510 are integrated or merged into the working database records 1508 by applying data merging techniques that will be described in more detail in paragraphs that follow.
  • the denormalized data as included in the BackOffice component and the Front End Server, include this particular embodiment, three tables or components of data.
  • the three components of data include a category file, a fact file, and a business listing file.
  • the business listing file has been previously descnbed in conjunction with the architecture in other sections of this desc ⁇ ption.
  • the fact file includes information additionally provided by vanous advertisers or business services
  • the third file is a category file may include a category identifiei and a conesponding heading
  • the category identifiei is a numeric quantity or other identifier that may be used in performing queries.
  • the heading is a textual description of the various category identifiers which may be used either for performing data queries In the various data integration and updates, as will be described paragraphs that follow .
  • the business listing file is generally what is updated when conside ⁇ ng the techniques which will be described.
  • the category file is also updated as part of the native source update, as will also be described in paragraphs that follow.
  • a business listing is the atomic unit of granulanty by which updates are performed. Any information and data such a phone number, name and address associated with a particular business entity is considered to be part of one logical piece of information or record. Thus, in the desc ⁇ ptions that follow, updates are made with regard to the information associated with one particular business listing or entity.
  • the techniques which will be descnbed regarding the foreign source update generally assume that an existing database and update records are provided, and that each originate from different or foreign sources. It should generally be noted that since the sources are different, there is no general assumption made as to particular data fields or the structure of the foreign records as compared to the existing database. It is first determined whether there is a matching entiy in the existing database for an entiy in the updated version of the database II no match is found in the existing database for an entry 01 business listing which appears in the updated version of the database, this new entry is added and integrated into the existing database.
  • the techniques which will be descnbed in paragraphs that follow may be adaptable, as known to those skilled in the art, to update situations in which an implementation uses something other than two complete sets of data when performing a system update.
  • this process of foreign souice update is performed in the BackOffice component 818 m which the existing database to be updated is generally in normalized form.
  • the updated version of the database may be in normalized or denormalized form.
  • additional processing steps as known to those skilled in the art, may be needed to retrieve and update the actual files that include the data, for example, associated with a particular business entity or record.
  • each business listing generally includes the following data items: business name, zip code, and at least one of a pnmary phone number or toll-free phone number.
  • the foreign source integration technique is based on the premise that a phone number and zip code of a business are sufficiently unique to significantly reduce the matching problem to comparisons of a few listings.
  • a determination is trying to being made as to whether entnes the update and existing database match to further determine if update records are to be added, or if existing database records are to be deleted or modified.
  • the matching technique described for foreign source update determines a conespondence between the foreign source update records 1510 and the records m the existing working database 1508.
  • the matching technique generally includes: 1) determining which records in the existing working database match which update records; 2) if more than one record in the existing database conespond to the same record in the existing working database, determining which record in the existing database is the closest match for the update record; and 3) if the foreign source update records include duplicate records such that multiple update records conespond to the same set of one or more existing database lecoids, collapsing the duplicate foieign souice update iecoids into a single update record that is matched to a single lecoid in the existing database
  • an update to an existing record is performed so as not to lose any existing infoimation while also incorporating the new additional information or updated information
  • an existing listing includes a business name and address, and phone numbei . but no e-mail address.
  • a foreign souice update record includes a business name and address, e-mail address, and phone number
  • the information from the foreign source update recoid is included in the existing database in union with the fields that are blank in the update record such that the e-mail address in the existing database is not removed when the updated information from the update record is applied. It should be noted that in this embodiment, no delete operations are performed with the foreign source update data integration due to the nature of combining data originating from different sources. However, other embodiments may include delete operations in addition to update and modify operations in foreign source data integration.
  • a comparison is made between the phone number of an update record and the phone numbei field of each entry in the existing database.
  • a determination is made as to whether or not the record in the latest version of the database copy is an 800 phone number. If a determination is made at step 1000 that the phone number of the cunent update entry is not an 800 number, control proceeds to step 1008.
  • the procedure "match phone number” is performed to produce a subset of one or more entnes of the existing database which match the existing phone number. Control proceeds to step 1010 where the procedure "name match” is performed. Generally, "name match” will be descnbed in paragraphs that follow to determine whether there is a business name match for a particular entry.
  • Control proceeds to step 1012 where "denve score" is performed based on the zip code and the name match score. Generally, the result of step 1012 produces a score representing a statistic relative to determining whether two entnes in a particular database and an updated version of the database match
  • step 1012 contiol pioceeds to step 1020 of Figuie 47 where a comparison oi a determination is made as to w hethei oi not the derived score is greater than 50%. If the derived score is greater than 50%. contiol pioceeds to step 1034 where a determination is made whether there is only one matching entry the database for an update record. If a determination is made at step 1034 that there is only matching entry in the database, control proceeds to step 1042. w here a determination is made that a match has been found.
  • step 1034 if at step 1034 there is moie than one matching entry in the database for a record in the cunent updated v eision of the database, control pioceeds to step 1036, where a determination is made whether theie is only one entry with a maximum score. If there is only one entry with a maximum score, control proceeds to step 1046, where this maximum scoring entry in the existing database is determined to be the matching entry for the updated version. If at step 1036 there are multiple entnes with the same maximum score, control proceeds to step 1038 where additional processing is required to determine which is the matching entry, if any.
  • the score threshold of 50% may be tuned and vaned for each particular implementation and embodiment.
  • This value is generally a configurable threshold value that may be defined heu ⁇ stically, for example, by examining data samples.
  • the processing of step 1038 is geneially performed off-line. It may be done manually or in an automated fashion in accordance with the types of data in the existing database. For example, at step 1038, having multiple entnes with the same maximum score may indicate that there is an enor or conuption in data. For example, one embodiment, an alternate technique is used where if any record has the same zip code, that record is considered as being a matching record.
  • step 1020 determines whether or not the score is less than or equal to 50%. If at step 1020 a determination is made that the score is less than or equal to 50%, control proceeds to step 1022.
  • step 1022 a determination is made as to whether or not the difference in the name length is less than or equal to three. If the difference in the name length field is not less than or equal to three, control proceeds to step 1028 where a determination is made in that no matching entry exists in the database. It should be generally be noted that the decision piocess and the companson process performed in steps 1020 and 1022 are performed for each matching enti y in the subset as produced from step 1008. It should generally be noted that the thieshold length of thiee for the name length used in step 1022 may be varied and tuned for each paiticulai embodiment and implementation.
  • step 1022 if a determination is made that theie is at least one entry in the existing database with a name length difference less than oi equal to three, control proceeds to step 1024, where the name edit distance heunstic may be used to compute the name distance.
  • the name edit distance is the minium number of insertions, deletions, and substitutions at the character lev el to turn one name entry or st ⁇ ng into a second name entry or string
  • the number of states that string A must pass through to be transformed into St ⁇ ng B is an entry or quantity refened to herein as the name edit distance.
  • the textbook entitled “Text Algorithms”, by Maxime Crochemore and Wojciech Rytter generally descnbe a technique for the name edit distance heuristic.
  • the name edit distance is computed, for example, using dynamic programming techniques known to those skilled in the ait, such as using a finite state machine, for each matching entry as in the subset produced by step 1008.
  • control proceeds to step 1100 of Figure 52 where a determination is made at step 1100 as to whether or not there is only one matching entry in the subset as de ⁇ ved from the Step 1008.
  • step 1112 determination is made that a matching entry has been found. If at step 1100 a determination is made that there is more than one matching entry in the existing database for a foreign source update record, control proceeds to step 1102, where a determination is made as to whether or not there is only one matching entry with a minimum distance. If a determination is made that there is only one matching entry with a minimum at a distance, control proceeds to step 1108 where it is determined that an entry in the existing database with the minimum distance is considered a match to the update record in the foreign source update.
  • step 1102 If at step 1102 a determination is made that theie is more than one matching entry with a minimum distance, control proceeds to step 1 104 where additional processing may be required in accordance with the types of data included in the database.
  • the additional processing required is generally the same types of processing that may be performed in accordance with the pieviously described step 1038 of Figure 47.
  • step 1000 if at step 1000 a determination is made that the phone number of the updated record is an 800 phone number, control proceeds to step 1002 where a determination is made as to whether or not the phone number, including the area code, and the zip code match one or more entries the existing database.
  • step 1002 if there is a determination that one or more entnes in the existing database match the phone number and zip code of the update record, control proceeds to step 1006 where a subset of one or more matching entries is found. Control then pioceeds to point B indicated at step 1010 in Figure 46 where execution continues.
  • step 1002 If a determination is made at step 1002 that the phone number and zip code do not match any entries in the existing database, a determination is made at step 1004 that no match exists in the database for the cunent update record.
  • FIG. 48 shown is a flow chart of an embodiment for the "match phone number" routine as performed at step 1008.
  • a table is used with old and new area codes and exchanges to determine if there are one or more matching entnes in the existing database which match the phone number of the cunent update entry.
  • the processing step of 1050 and the decision made at step 1052 may be used, for example, where area codes have changed due to the increased volume of phone numbers which require additional area codes to a particular locality to be added.
  • the 508 area code may be expanded to include the 781 area code.
  • an existing phone number may be included in the database with either the 781 or the 508 area code depending on the age of the data m the database. If a determination is made at step 1052 that either an old area code and exchange, or a new area code and exchange match, control proceeds to step 1054 where a subset of one or more matching entries is formed. Control proceeds to step 1056 where control returns to the calling procedure. In this instance, control returns to step 1008 where subsequent control proceeds to step 1010 of Figure 46.
  • step 1052 If at step 1052 a determination is made that there is no old or new area code and exchange in the existing database which match the cunent entry in the updated version of the database, control proceeds to node C of the "secondary search" in Figure 51 at step 1086.
  • the processing which occurs in the steps of Figure 51 attempt to find semantic equivalents of the name fields indicating a possible match.
  • the name of the update record is tokenized.
  • stop words are removed from the name field.
  • stop words may be words which may be ignored when doing a name comparison. For example, in this particular embodiment, the words "and", "or”.
  • a search of the existing database is performed on the conjunction of the tokenized name field components and the zip code. Generally, the search is being performed for entries in the existing database which match zip code and the different components of the name field.
  • a determination is made as to whether or not there are more than 5 matching entries in the existing database for the cunent update record. If at step 1092 a determination is made that there are more than five matching entries in the existing database, control proceeds to step 1094 where a determination is made that no match has been found. If at step 1092, a determination is made that there is not more than five matching entries, control proceeds to point B in the processing which is shown in Figure 46, step 1010 where these name matching entnes are used as the subset upon which subsequent processing is performed.
  • FIG. 49 shown is a flow chart of the steps of one embodiment performing a "name match" as part of a routine processing as invoked from step 1010 of Figure 46.
  • the steps of Figure 49 attempt to perform and find semantic equivalents of the names of a business in this particular instance.
  • the name entnes are canonized.
  • canonization rules are a set of transformations which occur, for example, transforming abbreviations and the like to semantic equivalents allowing for a common denominator of terms to be searched for.
  • Control proceeds to step 1062 where the name field is tokenized into components.
  • step 1064 a setwise contents comparison of the name components of each entry is determined against the cunent update entry.
  • step 1066 a score is computed for each name comparison of the existing database entry with a record of the updated version of the database. The score is computed as one point per matching component.
  • step 1068 control returns to step 1010 where subsequent processing resumes with step 1012.
  • the processing steps of Figure 49 attempt to formulate a numenc quantity or metric for determining whether tvv o name entries match. This weighted value or concatenation is used in further comparison in combination with other field, such as the zip code, and aniving at a final quantity in determining whether or not name fields of an existing database entry and an update record match.
  • FIG. 50 shown as a flow chart of the steps of one embodiment for performing the routine "derive score", as performed from step 1012 of Figure 46.
  • denve score attempts to produce normalized metnc or score based on the name field and the zip code.
  • the score previously derived from name match for each entry is updated by one if the zip codes of an existing database entry match an updated entry.
  • this score is normalized by taking the score computed thus far and dividing it by the number of tokens in the foreign source entry name field. It should be noted that other techniques may be used to produced a normalized score as in step 1082.
  • control returns to the point of call. In this particular instance, control returns to step 1012 where processing resumes with step 1020 of Figure 47.
  • the update techniques for native source assumes that two full sets of data are used - - the updated database version, and an unfiltered or raw version 1504 of the existing working database.
  • the techniques that are described below with regard to native source processing are data enhancement techniques applied to the unfiltered database 1504 to produce the working database 1508 of Figure 45.
  • step 1400 the computation of the data update is performed using two complete sets of data from native sources.
  • the latest set of data received such as from a data provider is submitted into the database and compared against the set that is in the existing database. All of the records in the data set are loaded in the following form.
  • record ID For companson purposes, in the steps that follow there is a distinct record ID followed by a st ⁇ ng where the st ⁇ ng is all the fields from the record concatenated together for companson purposes in steps that follow.
  • record I.D.s are unique against the set and indexed.
  • the delta or difference between the two data sets is produced.
  • Each entry in this delta or difference is classified as an insert, delete, or update operation.
  • a record is inserted into the existing database in which identifiers are in the new version of the data set but not in the existing database. All records which have identifiers in the existing database, but not in the new version, are slated for deletion from the existing database. Records in which identifiers are in both sets, but, however have associated stnngs that differ are considered update records having data contents in the st ⁇ ng that is updated for the conesponding identifiers.
  • the update records which include inserts and update transactions are applied to the existing database.
  • certain data post piocessmg is performed as will be described furthei in the paragiaphs that follow
  • Figures 46- ⁇ 4 geneially describe data integration of the native souice updates which are applied to the database of business listings and categories. In summary, for both business listings and categories, comparisons aie made between lecords of the native source unfiltered database and native source update
  • step 1406 a comparison is made between the existing database copy with the updated database copy by comparing the record identifiers and the string concatenation which represents the remainder of the records.
  • each update record is classified as one of a matching enti y. an insertion, a deletion, or an update with respect to the existing database
  • step 1416 a record is determined to be matching if the record identifier and string field in the existing and updated data base copies match.
  • a record has been classified as one to be inserted if there is a record with a record identifier in the update database which is not in the existing database.
  • data enhancements are performed and the record is integrated into the working database. It should be noted that the data enhancements also performed in step 1428 is descnbed in more detail in paragraphs that follow.
  • a record has been classified as one to be deleted from the existing database if there is a record with the record identifier in the existing database not in the updated database.
  • the data operation is performed integrating the data updates into the existing working database.
  • a record is considered an update transaction to an existing record in the existing database if the record identifiers match, but the remainder of the record represented as a stnng does not match.
  • the longitude and latitude of a record may be updated if the address has been modified.
  • data enhancements may be performed to the record, and the data update is applied to the existing working database as well as the unfiltered database. In the case of step 1416 where matching entries are found, no further processing may be required for existing database or the updated database record.
  • update records or transactions are generated to modify the existing database.
  • any of the foregoing operations which are modifications, including updates and deletions, to the existing working database records may be conditionally performed in an embodiment of the invention.
  • a protection or locking technique may be included in the database, for example, which prevents a deletion or modification of a particular business listing included in the database regardless of the processing classifications of Figure 54.
  • the data enhancements, as performed at steps 1418 and 1428, are generally data filtering steps prior to integrating the data update into the working database 1508.
  • the data filtering techniques generally facilitate matching conesponding records when performing updates.
  • Data enhancements may include, for example, upper/lower case justification, detection of synonyms and/or acronyms, transformation of abbreviations as may be used in business names (e.g., corp., inc.), street addresses (e.g., St., pi.), and city and state names.
  • FIG. 55 shown is an embodiment of a method for performing update computation of step 1400 as applied to the category file.
  • the category file in one embodiment includes a category identifier and a conesponding header that is a text description of the associated category identifier.
  • these updates are applied in a model similar to that of the business listing files for native source updates. The updates are first applied to a "raw" or unfiltered version of the category file, followed by data enhancements as appropriate, an then integration of the data updates into a working copy of the category file included in the working database 1508.
  • each update record is classified as one of several types of transactions.
  • a record in the updated category file is considered matching if the record identifier and the associated header match an entry in the cunent category file.
  • an lecord is inserted into the existing untilteied database and working database if the record identifiei is not in the existing unftlteied database copy of the categories.
  • data enhancements may be performed and the resulting filtered data further integrated into the existing category file in the woi king database 1508. The data enhancements, as included in steps 1468 and 1476, are descnbed in more detail in paragraphs that follow
  • a recoid in the existing category file is deleted if the record identifier of an existing record is not in the updated version.
  • this deletion operation may be performed to the working copy of categories included in the working database 1508.
  • an update record is used to update the database copies if the record identifier of an existing an update records match, but the heading names differ.
  • data enhancements are performed and the update operation is integrated into the working copy of the categories includes in working database 1508
  • the data enhancements, as performed at steps 1468 and 1476, upon the category listings may include processing of the headings.
  • the processing to enhance the text of the headings may include text transformations such as: upper/lower case justification, consolidation of abbreviations, and removal of ldiosynchratic and slang terminology.
  • the function of these data enhancements is to generally filter the data to provide more accurate determination of matching or conesponding catego ⁇ es.
  • step 1440 new categories may be added.
  • a data vendor may not provide an integrated version of all business categones. It may be possible to enhance some record catego ⁇ es as additional data is added. For example, a restaurant may be a particular type of category and there may be other subdata organized in the structure of the record indicating that there is a particular type of restaurant in accordance with the vanous ethnic cuisines, such as French or Italian.
  • Post-processing as in step 1440 may be wntten to search the data file in accordance with recognized structural format and add additional
  • NO- categories in accordance with any categones and subcatego ⁇ es. For example, if a determination is made that there is a large numbei of restaurants with a subcategory of French, a new record category may be added w hich is "French restaurant" Similarly, an Italian restaurant category may be added This is generally performed in accordance with the data oiganization and categories of the paiticulai data being examined in each implementation
  • step 1442 redundant categories as stored by business are collapsed and detected by removing the equivalent categories.
  • semantically equivalent categories are determined. Generally, this includes locating equivalent categories for which the spelling might be slightly different, or those fields which may be subsets or equivalents of other fields. For example, "animal doctor” may be interpreted as a semantic equivalent for "vet", or “veterinarian”.
  • this step may be done in an automated fashion using any programming language which is commercially available and may be used with the existing database.
  • the technique involves dropping or not including special non- alpha-numeric characters or other words, similar to the stop words White space may be compressed and companson may be done on a case insensitive manner.
  • the comparison may further be done by requiring an exact character match or with some at-a-distance technique similar to those previously described with other data processing.
  • the duplicate categones and records may be removed from the existing version as stored in the working database 1508.
  • step 1442 where there is a collapse of redundant categories by detecting and removing equivalent catego ⁇ es, different rules may be used to decide which category of several duplicates identified as the one to keep. For example, maybe the longest name, the shortest name, or simply the first name.
  • Figure 57 shown is a flowchart of one embodiment of a method of more detailed processing steps of step 1442 for collapsing redundant catego ⁇ es.
  • duplicate catego ⁇ es are determined A technique for determining duplicate catego ⁇ es is descnbed in paragraphs that follow in conjunction with Figure 58.
  • duplicate catego ⁇ es in the unfiltered database may be examined as a group and one of the category names or headings is chosen to be the heading included in the collapsed category recoid
  • One technique foi choosing the heading is be determining which category name is most frequently used, such as by examining the business listing files for frequency determination.
  • the business listing files, as included in the unfiltered database may be patched with the new heading and identifiei conesponding to the collapsed resulting lecord.
  • the category file is also updated to reflect the collapsed entry. It should be noted that these aie made to the existing working database.
  • a first category name in the category file of the unfiltered database is tokenized In othei words, each word included in the heading or category name is associated with a token.
  • step 1504 the next record of a category is examined and also tokenized
  • step 1506 a comparison of the two tokenized names is performed to derive a score in accordance with the number of matching name components. This may also be normalized, as described in accordance with the foreign source update processing techniques.
  • a determination is made as to whether or not the score is greater than a predetermined threshold. In this instance, the threshold is 75%. If the score is greater than the threshold, control proceeds to step 1512 where the categories are tagged as duplicates propagating any previous matching identifier tag. In other words, the transitive matching technique is used in marking matching categories. For example, if IDl ID2. Then, it is determined that
  • ID2 ID5
  • ID5 is also marked as having IDl as a matching identifier. Similarly, subsequent matches to ID5 further propagate the value IDl. Subsequently, control proceeds to steps 1510 for advancement to the next record. If it is determined at step 1508 that the score is not greater than the threshold, no match is found and control proceeds to step 1510 where the next category is advanced to. At step 1514, a determination is made as to whether all the categones have been processed in the category file. If they have, control proceeds to step 1516 where processing stops. Otherwise, control proceeds to step 1504 for further compansons and determinations of equivalent catego ⁇ es.
  • -11- foregoing data integiation techniques may be tuned 01 va ⁇ ed foi each paiticulai embodiment in accoidance with, for example, the data type and lecord lengths
  • Adaptive tuning of values used in making determinations may be automated, for example, by adjusting thresholds in accordance with actual data values to filtei out extreme data values
  • the categoiy table or file may be used by the query engine when processing a data query.
  • the category file may be used to identify valid categories specified in a usei queiy. It may also be used to categorize information displayed to a user In other woids. a resulting data set may be partitioned in accordance with the categories as included m business listings for the lesulting query.
  • a resulting data set includes 10 listings
  • these listings may be categorized or grouped in accordance with whether or not particular categories are associated with each listing.
  • the information displayed to the usei for these 10 listing may be 5 listings included in category A, and 5 listings included in category B
  • the table is propagated as part of the update data to the Front End Server and, subsequently, further to the query engine.
  • An efficient data transfer technique is used to transfer data between databases, such as between the BackOffice component 818 and the Primary Database 812 of Figure 4.
  • the types of data that are transfened generally relate to advertisements such as those displayed to the user 800 of Figure 2.
  • advertisement data includes text data and non-text data.
  • the non-text data may be refened to as "blob" data which includes, for example, image and audio data, as well as machine- executable programs, JAVA bytecode, and the like.
  • the technique which will be descnbed paragraphs that follow, generally uses different data channels depending on the type of data. For example, text data is transfened from the BackOffice component to the Front End Server 804 using a different data channel than blob data that is also transfened between the two components.
  • a sending component may be located within the BackOffice component 818 which includes software that decides the type of data, the channel used to transfer the data, and how to break up the data into portions which are transfened to a receiving component located in the Front End Server 804, such as the pnmary database 812 Located on the receiv ing component, as may be included in the Primary Database 812. is software which decides ho to synchronize or assemble data received from the BackOffice component 818
  • the advertisement data is geneially data that is displayed in response to a user query
  • the text data included m this data transfer may be characterized as structured data, as included in text which is displayed to the user.
  • blob data The second type of data generally transfened is denoted as "blob " data which is generally not able to be decomposed or operated upon in different portions
  • blob data may include a machine-executable program which is generally binary data type
  • the technique uses two separate data channels in which each channel transfeis a different type of data.
  • one data channel is used to tiansfer the text data
  • Database LinkTM software as included in the commercially available OracleTM database, is used to facilitate database communication of text data Therefore the database routines, such as those included in the Database Link software, may be used in transferring text data between databases
  • the Oracle database does not support direct non-text manipulation, such as for transfening data of different types, such as blob data.
  • a second different data channel is used to transfer the blob data from one database to another in which the second channel is external to the database since the version of the Oracle database software used in this embodiment does not provide the needed support for direct non-text data manipulation
  • the blob data which may also generally be characterized as multi-media data, is transfened asynchronously from the text data between databases
  • the blob data m this embodiment is copied from one database to another using a C++ program with calls to vendor-supplied library routines.
  • This is in contrast to the text data transfer which is done by a separate data channel, and the software used performs remote database copies as if they were local.
  • the text data transfer may be performed by calls to the Oracle procedures executed under the control of the Oracle database software.
  • the data channels used to transfer both the text and the blob or multi-media data may be network connections between the databases.
  • connections between the databases may also be possible, such as a dedicated haid line to facilitate database communication, as known to those skilled the ait
  • a dedicated haid line to facilitate database communication
  • Figuie 59 is a block diagiam of two tables in a piefe ⁇ ed embodiment depicting one technique for storing the advertisement data
  • the advertisement data and the I elation between the diffeient components of the advertisement data are described in two tables stored in the sending databases
  • Table 1200 is a lelational mapping table which geneially describes the relation between the various data entities as included in a particular advertisement page
  • the relational mapping data desc ⁇ bes a parent/child relationship between various data entities of an advertisement page forming a tree-like structuie
  • the data table 1220 includes the actual data as described by the relational mapping table 1200
  • the data included the data table 1220 includes a variety of data types as may be displayed with regard to an advertisement
  • the data included in table 1220 may be text data, machine executable code, or a JAVA program
  • one restriction is that each row of the data table 1220
  • the data table 1220 generally includes multiple columns depending on how many data fields are required for a particular implementation
  • a record identifier 1208 is used to uniquely identify a particular data entity in a table
  • data fields data-1 1210 through data-n 1214 in which each of these data fields includes one particular type of data entity as may be displayed to the usei in response to a data query
  • FIG. 60 shown is a moie detailed diagram of the tables as used in a data transfei on a sending and receiving side using this data transfer technique
  • Shown in Figure 60 is an example of a relational mapping table 1200 which includes multiple advertisement pages.
  • one tiee-hke structure is used to represent one advertisement page.
  • two tree structures may be produced using the data described in the relational mapping table 1200.
  • the data transfei of the advertisement page associated with the root node with the identifier 104 which includes identifiers 104. 105 and 106 in its tree-like structure.
  • FIG. 61 shown is the tree-like structure descnbed by the relational mapping table 1200 for the advertisement page with the root node identifier 104 shown in Figure 60.
  • FIG. 60 on the receiver side of the data transfer, shown are two tables, temporary table 1216, and ad page table 1218 n this particular embodiment these two tables are created on the receiver side for each advertisement transfened from the sender.
  • the two tables of data on the receiver side depict tables after the transfer of the ad page with the root node of the identifier 101 and prior to the transfer of the data associated with the advertisement page with the root node beginning with the root node of identifier 104.
  • the temporary table 1216 is filled with data during the data transfer and after the data is properly assembled on the receiver side, the temporary table 1216 is not used until the next data transfer operation.
  • the table ends in a state such that no data from the data transfer having just occuned is located m the table 1216.
  • FIG. 62 shown is a block diagram of the data on the sender side and the receiver side as associated with the data table 1220 previously discussed in Figure 59.
  • each identifier is associated with only blob data. It should be descnbed in paragraphs that follow involving the data transfer of identifiers 104-106, each identifier is associated with only blob data. It should be descnbed in paragraphs that follow involving the data transfer of identifiers 104-106, each identifier is associated with only blob data. It should
  • this geneial technique and the data included in the data table 1220 may additionally include text data associated with each identifiei or low in the table
  • An entiy in the table 1220 may also include only text data
  • the limitation is that only one field entry of blob data may be associated with each row in table 1220
  • thiee tables are associated with transfernng data which is blob data from the data table 1220
  • These three tables include a blob temporary table 1222, a blob table 1224, and a repository table 1226.
  • any text data included in table 1220 on the sender side may be transfened using the data transfer channel What is described in Figure 62 is that portion of the data included in the data table 1220 which is blob data In this example, only blob data is included in the advertisement page with the loot node 104 which will be described.
  • the blob temporary table 1222 is a temporary table used in the transfer of text information associated with blobs from the sending node to the receiving node.
  • the blob table 1224 in this particular embodiment, is an aggregate blob table which includes the blob data for multiple advertisement pages.
  • the snapshot of the data tables of Figure 62 shows that data associated with one advertisement page with the root node identifier 101.
  • the blob table 1224 will also include information to retrieve the blob data associated with identifiers 104 through 106. It should be noted that the contents of the blob table 1224 do not include the actual blob data itself.
  • the fields included in the blob table 1224 point to and further describe the actual blob data which is contained in the repository table 1226.
  • the blob table 1224 in this embodiment includes three fields per each entry associated with a blob data entity. It includes a sending record identifier 1228, a size 1230, and a pointer 1232 to the actual blob data.
  • the sending record identifier 1228 identifies a particular blob uniquely withm a particular table or advertising page in this particular embodiment. Thus, each of the entnes in the record identifier column 1228 may not be unique for all of the advertisement pages or data.
  • the purpose of the record identifier is to map or identify the particular blob pointer associated with a unique record identifier from the sending database.
  • the size 1230 indicates the size in bytes of the blob descnbed by the blob pointei field 1232
  • the size field may include othei units to identify the size of the particular blob data
  • the blob pointei field 1232 acts as an identifiei oi pointei into the lepository 1226 to uniquely identify within the lepository a particular piece of blob data It should be noted that othei embodiments oi implementations may include additional fields in the blob table 1224 as w ell as m the repository 1226 in accordance with other pieces of data that may be required in order to enable the transfei to occur in a paiticulai implementation
  • FIGS 62 thiough 66 show the block diagiams of an embodiment of transfen g the data associated with an advertisement fiom the sending side to the receiving side
  • Figuie 63 depicts a snapshot of the tables associated with the text or Database Link transfei channel as included in the sending and receiv g sides
  • the data table 1200 on the sending side has no modifications from the previously described initial table as depicted m Figuie 60. Howevei, the tables on the receiving side have been modified from those previously described in Figure 60
  • the temporary table 1216 serves as a temporary placeholder for the data involved in the data transfer of the particular ad page descnbed beginning with root node identifier 104.
  • the data associated with a particular advertisement page is extracted from the relational mapping table 1200 and is temporanly copied to and stored in the temporary table 1216 on the receiving side.
  • Shown in Figure 64 are the tables associated with transfernng the actual data from the sending side to the receiving side.
  • the data included in the data table 1220 is segregated into text data and non-text data.
  • the text data is transfened using the text channel.
  • the non-text, multimedia data, or blob data is transfened using an external process which creates a second multimedia data transfer channel in order to send data from the sending side to the receiving side.
  • the id and the size fields are copied to the blob temporary table 1222.
  • a global id (Gid) is generated on the sending side pnor to transmitting these fields to the receiving side.
  • This global id is transfened to the receiving side and included in each associated entry of the temporary table 1222 Generally, the Gid is a unique identifier associated with each record uniquely identifying the record among all tables associated with database information.
  • the blob data fiom table 1202 and the associated information in table 1242 are transfened to an external process 1240 located on the sending side
  • an Oracle I M pipe is the communication means used to transfer the data from the data table 1220 to the external process 1240
  • the external process 1240 furthei transmits the data via a multimedia data channel to the receiving side
  • Table 1242 may also be viewed as a temporary table which serves as a placeholder for that data which is transfened by the external process 1240 to the receiving side
  • Located in temporary table 1242 are four pieces of information including a table name, a field name, an identifier, and a global identifier associated with each blob data entity.
  • the table name generally describes or identifies the particular table ithm which a piece of blob data is located or associated.
  • each table is associated with a particular advertisement or advertisement name
  • the field name identifies the type of non-text data.
  • the field name is "Blob" refening to blob or multi-media data.
  • the identifier field (Id) of table 1242 is the unique record identifier copied from table 1220.
  • the global identifier (Gid) is a unique global identifier, identical to that which is produced on the sending side prior to sending the text data to the temporary blob table 1222. This information is passed or transfened to the external process 1240 which copies the actual blob data to the receiving side as w ell as the additional information descnbed in temporary table 1222.
  • the external process 1240 is a
  • the external process may copy blob data from multiple tables in which the associated field name may differ with each table Therefoie, the field name may also be included in table 1242 The external process uses this field name to retrieve blob data to be copied Other embodiments may communicate this field name using other mechanisms.
  • the external process 1240 uses the data included in the temporary table 1242 to fetch or access the blob data associated with a particular table name and field name to subsequently index into each particular table name using the identifier to extract the actual blob data.
  • This blob data is copied to the repository table 1226 on the receiving node by process 1240.
  • the repository table 1226 includes the blob data associated with advertisement identifier 104 This data is appended to already existing data in the repository 1226.
  • FIGS. 65 and 66 show the integration process of the tables of the text and the blob data for the advertisement page identified by the sending identifier 104 Refening now to Figure 65, shown is a block diagram of an embodiment of the tables resulting from the text data integration
  • table 1200 on the sending side remains the same as in pie iously described figures
  • table 1216 data has been integrated and copied into the table 1218
  • the function of temporary table 1216 is generally to hold that text data associated with the relational mapping table which is transfened from the sending side to the receiving side until all of the data entities associated with the paiticulai advertising page or table being transfened have anived on the receiving side.
  • the software on the receiving side performs
  • FIG. 66 shown is a block diagram of an embodiment of the data table 1220 whose contents have been transfened to the leceivmg side.
  • the assembling software on the receiver side integrates the data from temporary table 1222 into table 1224. Additionally, a link is established in table 1224 to the data in table 1226 and the associated global identifier removed. Each entry in table 1222 is copied into table 1224. In particular, the Id and Size fields are copied into table 1224 for identifiers 104, 105, and 106.
  • the integration software then uses the global Id obtained from temporary table 1222 to index into the repository 1226 m search for a matching global identifier entry.
  • the record identifier and table size are copied from the temporary blob table to the blob table.
  • the global identifiei fiom the tempoiaiy blob table is used as an index into the repository table to finding a matching global identifiei Foi this matching entry, as in step 1256, the repositoiy identifiei is copied from the repositoiy table to the blob pointer field of the blob table
  • the global identifiei field of the repository table is reinitialized
  • the end lesult of performing the steps as descnbed in Figure 67 result in the tables as displayed in Figuie 66 representing the integrated oi assembled blob table in which the blob data is integrated into the repositoiy table 1226 as further described by the blob table 1224
  • table 1200 and table 1218 are "minoi images" of each other.
  • the temporary table 1216 is used in performing the transfei as a temporary table until all of the data for this particular data transfer has arnved on the receiving side At that point, the data is integrated from the temporary table into the final resulting table 1218 resulting in a table 1218 which mirrors that on the receiving side which is on the sending side in table
  • the resulting tables 1224 and 1226 are functionally equivalent to the data descnbed m the sending side in table 1220.
  • one of the reasons for not further merging the data of tables 1224 and 1226 is due to the fact that transfernng blob data, including a copy of the blob data from table 1226 to be integrated into table 1224, requires the use of an external program in order to compress the tables further.
  • PCT WPD -82- transfer is between two databases.
  • the techniques descnbed may be adapted and used within other applications and a variety of env ⁇ onments
  • the overall technique is generall) to copy the text and blob or multi-media data asynchronously on two separate channels
  • This data is copied from a first database to a second database.
  • the data is located on the second database in a temporary location until all of the portions of the data associated with a particular data transfer a ⁇ ive at the second database
  • the assembly process of copying the data from the temporary locations and merging the information into other data tables is performed on the second database.
  • the foregoing technique for data transfer may be used m a variety of applications, such as for the data transfer bet een databases
  • this technique is included in a system for online Interactive Yellow Pages, GTE Superpages for the publication of multimedia advertisement content of GTE Superpages business customers.
  • the GTE Superpages system includes two major components: the server component which serves versatile user lequests for the information of more than 11 million businesses in the United States and (2) the BackOffice component that facilitates advertisement content, creation management and publication. Both these subsystems include databases where advertisement business information is persistently stored. The advertisement content produced or modified in the back office is published in the
  • the business advertisement includes an integrated set of structured textual information, such as business name, address, and multimedia or blob data, such as graphics, video, audio, job applets.
  • the data transfer technique descnbed is generally a technique for transfernng data using two data links between two databases. One of these data links is an internal data link with respect to the database, the second data link is an external data link with respect to the database.
  • the internal data link is optimized for the structured text data transfer while the external one is optimized for the multimedia data transfer, such as the transference of data stored in binary objects in the database.
  • This technique for data transfer generally alleviates the limitations of the existing database technology which does not provide for the transfernng of multimedia objects using the internal data link Moreover, by using the two data links to transfer the vanous data types, performance and stability are improved over an alternative prior art approach which uses only the external link for transfernng both text and multimedia or blob data.
  • the transfer technique includes four collaborative processes: a process on a sending component which decomposes data structuies and the like into text and non-text components assigning transient tags to the non-text components; two asynchronous transfer processes, one per data type, that each transfer, respectively, text and non-text components to a receiving component; and a process on the receiving component that reassembles transfened data and replaces transient tags ith persistent unique tags
  • This technique uses a multimedia data repository cable which is created and maintained in the receiving component, such as the receiving database in this embodiment
  • the non-text or multimedia data items aie stored m this repository with transient tags.
  • the reassembly process conelates the text tables with the multimedia objects and replaces them with persistent unique tags, thus leading to the re tegration of the transfened data.
  • the previously described technique includes features which provide for efficient decomposition and reassembly of data for efficient data transfer, as between two databases. Additionally, the multimedia repository serves as a vehicle for the reassembly of decomposed data items which are reassembled on a receiving component, such as a receiving database. [Incremental Update]
  • a descnption is provided of an incremental update procedure as performed upon the vanous databases included in the Front End Server component 804.
  • the data in the BackOffice component 818 may be updated, for example, on a daily basis.
  • These deltas or changes to this database in the BackOffice component are subsequently also applied to the copy of the database in the Front End Server component.
  • the number of transactions or updates to a database ranges from 30,000 to a half a million on a daily basis m accordance with the required data updates for the existing database Howevei .
  • this update technique is used to piovide data updates foi both native and foreign sources, and on-line updates, as descnbed in accordance with data processing techniques in other sections of this application
  • data updates to the databases included in the Front End Server may first be integrated into the BackOffice component Subsequently, these data modifications may be "pushed" to the Front End Server and integrated into the various data stores included therein, as will be further described in more detail in following sections.
  • data updates may originate from several sources, including native and foreign source updates, and on-line data entry, such as through an Internet connection via a browser.
  • the native and foreign source updates may generally be charactenzed as larger updates or data integration efforts. These are generally descnbed in other sections of this application.
  • the on-line data entry technique for updating information that may be included in the BackOffice component may be performed as previously descnbed through the menus initially displayed to a user, such as at the GTE Superpages Internet site, that provide access to the BackOffice component data information.
  • the data integration techniques as related to the foreign and native souice updates to integrate the data updates into the BackOffice component, are generally more detailed and involved than the integration of the on-line specified modifications.
  • the data updates may generally be a large number of data modifications requinng more computer resources than in the latter case.
  • the on-line modifications may be incorporated on a daily or other predetermined time penod using some data enhancement techniques as descnbed other sections of this application.
  • Other data updates may require additional time and computer resources and not be able to be completed, for example dunng non-peak usage, such as overnight on a daily basis.
  • additional planning and different processing techniques may be used with the vanous types and volume of data updates as included each embodiment
  • the data updates may be propagated to the Fiont End Seivei component.
  • the non-text or multimedia data for example, as included in advertisements with image files, may be transfened to the Front End Server from the BackOffice using multimedia transfei techniques, as generally described in other sections of this description
  • the updates to the Primary Database included in the Front End Server may be communicated as a table of commands created the BackOffice component and transfened. as by a network connection, to the Front End Server.
  • the table created in the BackOffice includes an application developed command language conesponding to the vanous types of record updates and modifications that may be included in this particular embodiment.
  • Each of these commands may be further translated in the Front End Server into one or more actual database commands that perform the table operation.
  • an entry in the table of database update commands may be specified as follows:
  • a Command field specifies the type of data command.
  • the Record #field identifies which records in the Pnmary Database this command applies.
  • the Optional Data includes data that may be related to the specified command. For example, if the command were update, the data field may specify the data which is to be included the records specified. In the above example, the command is to delete records 1-5.
  • This single table command may be translated, for example, by software included in the Pnmary Database, into 5 database commands m accordance with the particular database software.
  • the software which builds the table in the BackOffice and translates the commands into one or more database commands may be developed using a commercially available software system that is capable of communicating with the underlying database to perform the required operations.
  • each command may be sent as a separate message in other embodiments in accordance with the numbei of updates and othei associated computei resources and costs foi each data transaction
  • This may vary with implementation Refening to Figure 31 , shown is an embodiment of a dependency graph foi performing the various piocesses in an inciemental update.
  • the BackOffice data transfer must complete prior to beginning the update to the database in the Front End Server component
  • the BackOffice data transfers is complete when multimedia and text data has been transfened from the BackOffice component, such as data required when updating an advertisement page.
  • the operational table may include information about the updated normalized data, which has been applied to the BackOffice component, and which is now to be applied in this incremental update procedure to the Primary Database copy of the normalized data.
  • an initialization procedure may be executed to synchronize the beginning of the update procedure for the steps that will be descnbed in paragraphs that follow.
  • steps 1604, 1606. and 1608 may be performed independently and at the same time as steps 1610 through 1620.
  • the coordinating point labeled DB Prep at step 1622 serves as the coordinating point for the different procedures performed in updating the database on the Primary Database, and the local copies of necessary files, such as the
  • step 1604 the vanous advertisements are extracted from the data tables, such as those transfened from the BackOffice component in the multimedia and text data transfer.
  • the vanous advertisement pages are packaged and made into a complete advertisement page to be stored the Constructed Ad Repository 842.
  • the constructed ads are transfened and included in the Constructed Ad Repository. It should be noted that in this embodiment the existing copy of the Constructed Ad Repository is updated in accordance with those particular ads which have changed. Thus, the Constructed Ad Repository is updated on a delta or change basis. Simultaneously, steps 1610 through 1620 may be performed m conjunction with steps 1604 thiough 1608.
  • Steps 1610 through 1620 indicate that process by which the vanous identifiers and other files associated with the Pnmary and Secondaiy database are updated
  • Steps 1604 through 1608 reflect the updating of the Constructed Ad Repository 842 on an as-needed basis in accordance with changes which have occu ⁇ ed in the advertisements
  • step 1610 various changes to the Term lists identifiers are extracted In other words, it is determined at step 1610 what identifiers the Term lists need to be updated in accordance with the changes transfened fiom the BackOffice component. This is described in more detail in paragraphs that follow.
  • these various identifier updates are packaged.
  • these various identifier changes are transfened to each of the server nodes.
  • the actual data transfened at step 1614 are the raw operational commands as may be supplied by the BackOffice component to be applied to the existing Term lists.
  • step 1616 at each node, a working copy is made of the existing Term lists.
  • the changes are made to the working copy local to each server node.
  • the updated term list is installed.
  • the updated term list is not yet available for public use in the sense that it is published.
  • a new version of the Term lists has been created which includes the updated information as supplied in the transfer step 1614.
  • Step 1622 database preparation steps are performed.
  • Step 1622 serves several purposes. One is a coordination point for the updates of the various ads, as well as the various term list identifiers.
  • step 1622 serves as a step within which the normalized Primary Database information is propagated from the normalized copy of the Pnmary Database to a denormalized form in the Primary database and the denormalized form in the Secondary Database. In other words, the changes which are transmitted from the BackOffice component and reflected in the normalized Pnmary Database copy are now further propagated to the denormalized Pnmary database and the denormalized Secondary database copy.
  • step 1622 as part of the database preparation, the validity of the transactions and updates are verified such that at step 1626 the database knows it may fully commit to performing the update to the denormalized copies as used in performing user quenes.
  • Steps 1624. and 1630, and, respectively , step 1626 may be performed in parallel.
  • the ads may actually be published as in step 1624 in which the updated copies of the Consti ucted Ad Repository are actually made available for use Additionally, any updated images as stoied in the Image Repository are also available for use
  • Term lists as installed in step 1620, are published in step 1630
  • the publication of the various identifiers included in the Term lists generally means that the Term lists are available for use, as by the Query Engine
  • the database commits to performing the update
  • steps 1614 through 1620 are performed independently for each server node in this embodiment. Additionally, the actual amount of processing performed on the Term lists vanes in accordance with the number of updates or transactions, as will be described in conjunction with Figure 32. Refening now to Figure 32, shown is one embodiment of the various method steps for performing update steps m accordance with a particular number of update transactions as sent from the BackOffice component 818
  • a determination is made as to the number of update transactions. This determination involves a comparison with two threshold values each desc ⁇ bing a particular threshold number of transactions.
  • THRESHOLD 1 descnbes a relatively small number of transactions.
  • a relatively small number of updates generally refers to less than 30,000 update transactions.
  • THRESHOLD 2 value which generally represents a second, larger number of transactions.
  • THRESHOLD 2 represents approximately half a million transactions or update entries which conesponds to approximately five to ten percent of the number of records included in the Pnmary
  • step 1636 the normalized Pnmary Database is updated. Generally, this is performed at step 1602 of Figure 31 in which the copy of the normalized Primary Database is updated in accordance ith the operational table as transfened fiom the BackOffice component indicating the actual database update operations. At step 1638.
  • the actual identifiers of the Term lists are updated in other words, the Term lists aie updated as opposed to being rebuilt
  • steps 1640 and 1642 are executed.
  • the Primary Database is updated, as previously described in conjunction with step 1602 in which the normalized copy of the Primary Database is updated.
  • all of the identifiers as included in the Term lists are rebuilt In this particular embodiment, both identifiers and markup files are rebuilt due to the use of the mark-up files by the Ve ⁇ ty Information Retrieval software.
  • the Extraction Routines are executed to again produce the markup language files and vanous update records needed to update the denormalized data of the Primary Database.
  • the Information Retrieval software is executed to produce entire new sets of the Term lists.
  • Step 1642 is in contrast to step 1638. Rather than rebuild the Term lists as in step 1642, the Term lists are updated m step 1638.
  • step 1634 If a determination is made at step 1634 that the number of update transactions is greater than or equal to the larger threshold. THRESHOLD 2, step 1644 is executed. At this point, a determination has been made that the number of update transactions is so large that it has been deemed more efficient to rebuild the entire database and associated files, rather than update or patch the existing database and associated files, as in updating the identifiers of the Term lists of step 1638.
  • the previously descnbed procedure of performing a multimedia data transfer is used to transfer, for example, the multimedia and text data associated with ads, as may be included in the Constructed Ad Repository 642 and Image Repository 842 of Figure 4.
  • the granularity which indicates that an advertisement page has changed requiring the entire advertisement page to be replaced in the Constructed Ad Repository is if a single component within an ad page has changed. In this case, the entire ad page is reconstructed and replaced in the Consti ucted Ad Repositoiy 842 Foi othei systems, a different granularity of change may be used Generall) .
  • the various markup files and Term lists aie built as needed in accordance with the number of transactions as described in conjunction with Figuie 32
  • the actual thieshold values may be detei mined in accordance with tuning of a paiticulai system and the size of the database the number of transactions in each particular system.
  • the database as included in both the Front End Servei and the BackOffice component are O ⁇ acle I databases
  • the OracleTM piocedural language, PL/SQL may be used to read the operational table and perform the updates as needed to the normalized form of the data as stored in the Primary Database included in the Front End Server component
  • the same procedural language in files may also used to update the denormalized Primary Database copy and the denormalized form of the data as stored in the Secondary Database
  • Other embodiments may employ other techniques to update both the Pnmary and Secondary databases in accordance with a particular implementation.
  • the previously described incremental update procedure is one that is generally used to perform daily updates.
  • the same procedure may be used on a larger time period of transactions or updates. Due to the volume and size of the previously described embodiment, this procedure is one which performs well when performed on a daily basis. For other systems which may perform a similar number of transactions for a larger time penod, the previously descnbed techniques may also be used.
  • the vanous updates to a particular record or for a particular business or service may be collapsed before actually issuing the vanous database commands to perform the updates.
  • a certain amount of time such as withm five hours, a single record may be inserted, deleted and modified dozens of times. The end result of these modifications for the small time interval may result in no net modification or amendment to a particular record.
  • one optimization may collapse vanous updates associated with a particular recoid 01 business before actually issuing commands which perform a database update as applied to the copies in the BackOffice 818 and Fiont End Seiver 804 components Generally, this may be determined by using a finite state machine with the states of "insert”, “delete”, and "modify” If the same recoid, foi example, is modified twice and then deleted, the net result is that only a "delete" database command should be issued rather than issue two updates followed by a delete
  • the contents of the Page Cache 848 and the Query Cache 850 are reinitialized when an update is performed, as in performing the incremental update piocedures described in conjunction with Figuies 31 and 32.
  • the data included in the PHTML execution tree is also reinitialized
  • a failure may occur when performing any of the steps associated with Figures 31 and 32. If a failure occurs when performing certain steps, then a lecovery procedure may be performed. In this particular embodiment, a failure may occur for example, when using the Information Retrieval software, as depicted m conjunction with Figure 25. This may be due, for example, to a problem, such as a software bug, with the Information Retrieval software 908. For example, an enor may occur when extracting the identifiers associated with step 1610. Generally, step 1610 as previously described includes building the Term lists as determined in accordance with the number of update transactions in accordance with Figure 32.
  • an enor when producing or rebuilding the identifiers in the Term lists as in performing step 1642 and step 1644, it may be a recoverable enor if another node has successfully built the identifier files, for example.
  • a recovery procedure may be to copy the updated version of the Term lists from one node to another node which has been unsuccessful in the building the Term lists. This copy may occur, for example, after a predetermined number of builds of the Term lists on a particular node have failed.
  • Other embodiments of the invention may also include other alternative techniques in accordance with those steps associated with a particular system which it determines to be recoverable.
  • the update techniques may be included in a distributed computing system having multiple data representations as stored in a plurality of server nodes. The foregoing techniques provide for synchronized updates of the various data stores in the plurality of server nodes. [Targeted Banner Advertisements]
  • User query information may be used to influence the displays shown to the user by the browser 824.
  • the information retrieval software 908 can be used to assist in selecting other information to be displayed to the user, based on the nature of the user's query.
  • a banner ad 50 can be displayed to the user. Based on the user's query, the banner ad 50 may be targeted to characteristics of the user that are infened from the user's query. For example, an advertiser might conclude that a user who has entered a query with the category "art supplies" is interested in art, so that an advertisement for an art show or related matter would be an appropriate banner ad 50.
  • Banner ads 50 can also be targeted geographically, so that ads for businesses from a selected geographical area can be associated with search queries that include that geographical area as a search term. It should be understood that a system for targeting banner ads using user queries can use a range of information retrieval techniques, such as the Verity techniques described above in connection with processing of information retrieval requests using the term lists 836. However, in an embodiment, a separate banner ad retrieval program 909 is part of the query engine 862.
  • Initialization steps that permit execution of a banner ad retrieval program 909 are set forth in a flow chart 52 on Figure 68.
  • the system initiates the banner ad retrieval software 909.
  • the banner ad retrieval software 909 uses extraction routines to access markup language files and extract data.
  • the banner ad retrieval software then generates banner ad term lists 837.
  • the banner ad retrieval software retrieves a list of all yellow pages categories.
  • the categories are all of the available categones of business listings such as all available yellow pages categones
  • the system establishes a set of supei -categones
  • the super categones may consist of a sub-set ot the categones oi other categones
  • the supei categones are preferably smallei in numbei than the catego ⁇ es as the super-categories w ill be used to simply assignment of taigeted banner ads to paiticulai user queries and lesults ot the quenes
  • the system may map categories to supei categones m a step 70
  • the mapping at the step 70 many be a many-to-many mapping
  • a vanety of techniques may be used to map categones to super-categories
  • One such technique uses a combination of automatic and manual mapping Steps for accomplishing such a technique are set forth in a flow chart 73 depicted in Figure 69 First at a step 104,
  • control is returned, as represented by off-page connector B, to the flow chart 52 of Figuie 68
  • step 110 If at the step 110 it is detennined that no additional categories exist, then all categones to be assigned manually have been assigned, and control proceeds to a step 114, where the system returns to the first category that was not manually assigned, and it is determined whether the category will be assigned automatically based on the manual assignments If at the step 114 it is determined that the category will be assigned automatically based on the manual assignments, then, at a step 116, the system may compare terms that appear in the category to terms that appear in each of the manually assigned catego ⁇ es The system may thus obtain a ranking of the manually assigned categories in oider of the degree of co-occu ⁇ ence of terms Next, at a step 1 18.
  • the system may assign the same super-categoiy as as assigned the highest-ranked of the manually assigned categories
  • the sv stem may determine whether theie are any additional categories If not, then control passes, as depicted by off-page connector B, to the flow chart 52 of Figure 68. If additional categories remain, then control pioceeds to the step 114 for the next category.
  • step 114 If at the step 114 for a particular category it is determined that a category will not be automatically assigned based on the manual assignments, then at a step 122 a determination is made whether additional categories remain to be assigned If so. then at a step 124 processing skips to the next categorv and control is returned to the step 114 for the next category.
  • all catego ⁇ es that are to be automatically assigned based on the manual assignments may be completed at the steps 115 through 118 before control proceeds to the step 126.
  • processing returns to the first remaining category that was not previously assigned
  • the system may determine certain statistics regarding the co-occunence of terms between the category and one of the super-categones (perhaps also including the terms in the categories assigned to the super-categories).
  • a vanety of co-occunence techniques can be used.
  • the system may assign the category to the super-category for which the highest co-occunence is found.
  • it is detennined whether additional categories remain to be assigned. If not, then control proceeds, represented by off-page connectoi B, to the flow chart 52 of Figure 68 If so, then control proceeds to the step 126 for processing of the next un-assigned category.
  • an embodiment of a technique for mapping catego ⁇ es to super-categones is disclosed herein, it should be understood that other techniques are available. For example, manual mapping could be executed after all automatic mapping is completed, or the system could rely entirely on automatic mapping.
  • the banner ad retrieval software 909 may index the vanous super-categones in a banner ad term list 837.
  • the banner ad term list 837 may take the form of a linked list of the supei -categones, with each element in the list consisting of all of the teims that appear in the supei -category, as well as all of the terms that appear in each of the categories that was matched to the supei - category It should be understood that these terms may be expanded, as described in connection with Figuie 40 above, so that synonyms and related teims are also stored with each super-category element Storage of these terms may be in a hierarchical structure that is capable of execution using PHTML scnpts or similai techniques
  • the system may match one oi moie banner advertisements to each super-category. Thus, if that super-category is found to be the appropriate super-category, the matching banner ad or ads will be displayed.
  • the system may generate a banner ad for display to the user.
  • the banner ads may be stored on a server, which an embodiment is a separate banner ad server 809.
  • the banner ads may be either conventional banner ads or targeted banner ads
  • the banner ad server 809 may store the banner ads in a conventional manner and cycle between different ads according to a predetermined routine, such as a round- robin routine, so that when the system calls for a banner ad (such as via an appropriate URL for the banner ad server), the cunent banner ad is sent to the front end server 804 for further processing and display to the user in a banner on the user's browser 824.
  • the banner ad retrieval software 909 may be initiated. Steps that may accomplished by an embodiment of the banner ad ret ⁇ eval software 909 are depicted in a flow chart 132 as shown in Figure 70. First, at a step 60, the banner ad ret ⁇ eval software 909 obtains the user's query Next, at a step 62, the banner ad ret ⁇ eval software obtains the catego ⁇ es that match the user's query. These categories may be the categones that are obtained by the information retneval software 909 in response to a user query.
  • the user might retrieve a list of matching categories, such as the eight matching catego ⁇ es depicted in Figure 44.
  • the categories are those that were displayed as a results page in the flow chart 88 at the step 102 in Figure 41. That is, the catego ⁇ es are yellow pages categones of each of the business listings retrieved m the information retneval quei y that was executed by the system
  • the banner ad ret ⁇ eval software 909 may locate the particular terms that appear in the user query and in the catego ⁇ es obtained at the steps 60 and 62 in the banner ad term lists 837.
  • Location of a relevant term list 837 may be accomplished through use of a table of pointers or other conventional technique.
  • the argument of the table may consist of a tokenized version of the term and the table may point to the location of the linked term list 837 for that term in the database that stores the banner ad term lists 837.
  • a structure foi a linked banner ad term list 837 is depicted, in which a linked list of super-categones is depicted
  • One linked list may be established for each term that appears in a user's query or m a category, such as a yellow pages category, retrieved by the information retrie al softwaie 909.
  • the linked list may link elements 74, with each element 74 conesponding to a document (a document in this case consisting of all of the words m a particular super-category, plus all words in the categories mapped to the super-category) that includes the term
  • the elements 74 may include sub-elements, including a document identifier 76 for identifying the category and certain statistics regarding the document, including the term frequency 78, TF, which indicates the number of times the term appears in the document, and the inverse document frequency 80, EDF, which indicates the inverse of the number of times the term appears in the entire set of documents that are being searched.
  • the banner ad retrieval software 909 may at a step 81 rank the super-categories.
  • the system at the step 81 may rank the documents, i.e., the super-categories, according to the appearance of the words occurring in the user query and in the categories.
  • the ranking may be performed by a vanety of techniques.
  • One such technique obtains a number for each term that appears in the user query and in the categones that consists of the product of the term frequency for that term and the inverse document frequency for that term.
  • the sum of all the resulting numbers may be calculated for all super-categones, and the super-category with the highest sum may be the highest ranked document.
  • the banner ad that was assigned to that highest ranked super-category at the step 72 of the flow chart 52 can then be displayed upon completion of the ranking step 81 of the flow chart 132.
  • EDF log (N - IDFNlog (N) where N is the number of documents in the document set and EDF is raw inverse document frequency number.
  • TF term frequency
  • RTF TF/((TF + 0.5 + 1.5(DL/ADL)) where TF is the raw frequency of a term in a document, DL is the length of the document, and ADL is the average length of a document in the search.
  • weighting may be further improved by weighting other factors. For example, it is possible to weight each term that appears in one of the categories that is retrieved upon execution of a user query and to normalize the IDF and RTF statistics over the weights. Thus, if a particular category deserves a higher weight, then it might be accorded higher weight in ranking super-categories. For example, a category that is manually mapped to a super-category might be given a higher weight than a category that is automatically mapped. The user query might be given a higher or lower weight, than other information. Categories with a large number of listings may be given higher weight. In an embodiment, each category is given a weight conesponding to the number of listings that are associated with the category, normalized by dividing the total number of listings.
  • the user query terms are each given a weight of one.
  • the weight may be multiplied by the term element in performing the sum of the product of term frequency and inverse document frequency over all terms for all documents in the super- category linked list.
  • the highest ranked super-category is selected, and a banner ad that was assigned to that super-category at the step 72 of the flow chart 52 of Figure 68 is selected.
  • the banner ad may be retrieved, such as via a URL, from the banner ad server 809, for display to the user via the browser 824.

Abstract

Disclosed is a system for performing online data queries. The system for performing online data queries is a distributed computer system with a plurality of server nodes, each fully redundant and capable of processing a user query request. Each server node (808) includes a data query cache (850) and other caches (848) that may be used in performing data queries. The data query, as well as request allocation, is performed in accordance with an adaptive partitioning technique with a bias towards an initial partitioning scheme. Generic objects are created and used to represent business listing upon which the user may perform queries. Various data processing and integration techniques are included which enhance data queries. An update technique is used for synchronizing data updates as needed in updating the plurality of server nodes. A multi-media data transfer technique is used to transfer non-text or multi-media data between various components of the online query tool. Optimizations for searching, such as the common term optimization, are included for those commonly performed data queries. Also disclosed is a system for targeting advertisements that are displayed to a user of the system.

Description

TECHNIQUES FOR PERFORMING A DATA QUERY IN A COMPUTER SYSTEM
Technical Field
This application relates to the field of telecommunications and more particularly to the field of electronic commerce. Background Art
In electronic commerce, such as conducted ovei the Internet, markup language pages displayed to a user 800 using a browsei 824 typically include a mix of content and advertisements Thus, for example, a user may see the content of a seaich engine, such as a search template, along with advertisements from one or more companies. The advertisements, typically referred to as "banner ads," may include links to other site locations, such as the home page of the advertising company
As with other advertising, it is understood to be desirable to target the advertisement to a category of users. Thus, just as television advertisements are targeted to the demographic profile of the users who are believed to watch particular programming, companies wish to target online advertisements to the users. One method of such targeting is to display banner ads on pages that include content related to the banner ad. Thus, for example, a web page for an automobile dealer might include an advertisement and a link to a site offeπng financing for automobiles However, some web content is not clearly associated with a particular demographic group or user interest. For example, a search engine is likely to be used by a wide range of users who may be interested m a wide range of goods and services. Accordingly, a need exists for methods and systems that target banner ads to such users. The term "targeted banner ads" is used herein to refer to such methods and systems. Targeted banner ad methods and systems present a number of programming challenges. Among other things such methods and systems may need to provide relevancy ranking of categoπes of information related to a user's query. Accordingly, a need exists to provide improved methods and systems for performing such relevancy ranking.
Disclosure of Invention
Piovided heiein aie methods and systems foi taigeting advertisements The techniques used incude associating at least one categoiy with documents that may be retπeved in which the document includes at least one teim. At least one category is associated with at least one super-categoiy An advertisement is associated with at least one of the super-categones At least one term is determined that is associated with a data query A first of the at least one super-category is determined in accordance with at least one term of the data query and the at least one categoiy An advertisement associated with the first super-category is determined
In accordance with another aspect of the invention are methods and systems for targeting banner advertisements in which it may be desirable to map categories of documents to super-categones of documents Such a mapping may assist m obtaining improved relevancy ranking of advertisements to user queries. These techniques established super-category term lists used in performing a data query Categoπes of documents are obtained that may be retπeved in accordance with said data query. Super- categories are established for the documents. Each of the categories are mapped to a super-category wheπn at least one of the categories is mapped to a super-category automatically in accordance with one or more previously determined mappings of categoπes to super-categories. A super-category term list is established for each term. Each element of a list includes terms in the super-category and the terms categoπes that are mapped to that super-category.
In accordance with another aspect of the invention are methods and system for ranking super-category terms for use in performing a data query. Super-category terms may be linked to advertisements, so that an advertisement assigned to a super-category is displayed to a user if that super-category is identified as a relevant super-category based on a user's query. These techniques include establishing a super-category term list for each term appeaπng in one of a super-category or a category of document to be searched, each element of the super-category term list including terms in the super-category and terms in categones associated with that super-categon The terms in a data query are obtained Terms in categoπes are obtained in response to the data queiy. A modified query is formed consisting of the terms in the data queiy and the terms in the categories Terms of the modified query are weighted The supei -categoiy term list is ranked by applying the modified query to the super-category term lists to determine the most relevant super- category to the data query Brief Description of Drawings
Figuie 1 is an example of an embodiment of a system that includes an on-line query tool; Figure 2 is an example of a block diagram of a hardware view of an embodiment of an on-line query tool,
Figure 3 is an example of an embodiment of a user interface displayed with an online query tool;
Figure 4 is an example of a block diagram of a software view of an online query tool of Figure 2;
Figure 5 is an example of an embodiment of a table illustrating data storage for denormahzed objects in the databases.
Figure 6 is an example of an embodiment of a table representing data stored the geneπc object dictionary; Figure 7 is an example of an embodiment of a portion 440 of a PHTML execution tree;
Figure 8 is an example of an embodiment showing more detail of the parse dπver;
Figures 9 and 10 are an example of a user interface displayed in response to a user request with an online query tool; Figure 11 is an example of an embodiment of a user interface displayed with user query information;
Figure 12 is an example of the query results displayed in response to performing a user query of Figure 11 ;
Figure 13 is an example of a user interface which includes user-specified query information,
Figuie 14 is an example of a resulting display page in lesponse to the queiy performed with intoimation specified in Figuie 13,
Figuie 15 is a moie detailed display in lesponse to choosing a particulai category of Figure 14,
Figures 16 and 17 aie an example of a usei mteiface displayed in lesponse to selecting an option from the menu of Figuie 3 to add oi change a listing,
Figuies 18 is an example of a dιspla\ screen in lesponse to updating the business listing specified in Figures 16 and 17, Figures 19 and 20 are an example of a user mteiface scieen display results in response to a user lequest with regard to Figuie 18,
Figure 21 is an example of a screen display to a user with moie information with regard to the business listing selected from screen 20,
Figure 22 is the business information displayed with regaid to the business in Figure 21,
Figure 23 is an example of an embodiment of the processes included in the request router of Figure 22,
Figure 24 is an example of a block diagram of an embodiment of the BackOffice component, Figure 25 is an example of the flow piocess representing the piocessing of normalized data to the various data forms included in the Front End Server,
Figure 26 is an example of normalized data as may be included m an embodiment of the invention;
Figure 27 is an example of denormahzed data form as may be included in an embodiment of the invention,
Figure 28 is a flowchart of an example of an embodiment of a method for performing request processing in the system of Figure 2 and 4,
Figure 29 is a flowchart of an example of an embodiment of the method steps for performing parser processing the system of Figure 2 and 4, Figuie 30 is a flowchart of an example of a method with steps foi performing query engine processing in the system of Figure 2 and 4:
Figure 31 is an example of a dependency giaph as may be included in one embodiment of the invention for performing incremental update, Figure 32 is an example of a flowchart of the method steps for performing different update techniques in accordance with the number of tiansactions,
Figure 33 is a flowchart of an example of method steps of one embodiment for performing data queiy cache lookup as used performing a data query;
Figure 34 represents an example of applying the minimum cost derivation sequence as applied in the step of Figure 33.
Figure 35 is a flowchart of an embodiment of method with steps for forming a name and determining if the corresponding data set is located in the query cache;
Figure 36 is an example of an entity as stored in the data query cache;
Figure 37 is a flowchart of an embodiment of a method including steps for performing an additional total-city cache lookup;
Figures 37 and 38 are flowcharts for a method in one embodiment for performing total-city and multi-city cache searches;
Figure 39 is an example of more details that may be included in a embodiment of the query engine, Figure 40 is an example of an embodiment of method steps by which the information retπeval software may obtain results;
Figure 41 is a flow chart showing an example of an embodiment of method steps for obtaining results;
Figure 42 is a flow chart showing an example of method steps for classifying results for queπes using common terms;
Figure 43 depicts an example of a user interface for an on-line query tool, including a screen for initiating a user query;
Figure 44 depicts an example of a user interface for an on-line query tool, including categoπes that may be retπeved in response to initiation of a user query; Figuie 45 is a block diagram of an embodiment of the database as may be included in the BackOffice component,
Figuie 46 thiough 52 aie flowcharts depicting piocessmg steps in a method of one embodiment for performing foreign source data tegiation; and Figures 53 thiough 58 are flowcharts of a method of one embodiment foi performing native source data integration piocessmg
Figure 59 is an example of an embodiment of data tables included on a sending node for a multi -media data transfer;
Figure 60 is an example of an embodiment of the tables as appeaπng on the sending side and the receiving side in the multi-media data transfer,
Figure 61 is an example of a representation of a tree structure representing the relationships between entitites used in the multi-media transfer,
Figure 62 is a snapshot of the tables that may be included in a preferred embodiment in sending data in a multi-media data tiansfer; Figure 63 is a snapshot of an example of an embodiment of the tables on the sending and receiving side at another point when performing a multi-media data transfer;
Figure 64 is an example of an embodiment of tables and external processes on the sending and receiving side using the multi-media data transfer,
Figure 65 is an example of an embodiment of the tables resulting from the text data integration;
Figure 66 is an example of a block diagram of an embodiment of the data table whose contents have been transferred to the receiving side;
Figure 67 is a flowchart of a method of the steps of one embodiment for assembling blob data into a repository table when performing a multi-media data transfer; Figure 68 is a flow chart setting forth method steps for establishing super-category term lists and for matching advertisements to super-categones, to assist in targeting an advertisement to a user of an on-line query tool;
Figure 69 is a flow chart setting forth method steps for mapping categoπes to super-categones; Figure 70 is a flow chart setting forth method steps foi executing a modified query in an on-line query tool designed to assist in targeting an advertisement to a user of an online query tool, and
Figure 71 is a diagram showing an example of a linked super-category term list. Best Mode for Carrying Out the Invention
Refernng now to Figure 1, shown is an embodiment of an on-line query tool 1910. In an embodiment, one or more users 1900-1904 may connect to the on-line query tool 1910 via a network 1906. Users may interact with the query tool using conventional hardware and software, such as, in an embodiment, a web browser through the Internet. Refernng now to Figure 2, shown is an embodiment of a hardware view of an online query tool. In one embodiment, this on-line query tool may be the GTE SuperpagesSM query tool. Figure 2 shows a hardware view of the components that may be included in one embodiment of the query tool in typical operation as being accessed by a user through a network. The user 800 enters a query request which is sent via a network 802, such as the Internet, to the GTE Superpages Front End Server 804. The GTE Superpages Front
End Server 804 includes a hardware router 806 for receiving incoming query requests. The hardware router routes the request, using a simple hardware-based technique, to one of the server nodes 808-810 which may be designated to service the request by performing the requested query. The servers 808 through 810. server 1 through server n, respectively, interact with the Pnmary Database 812 and Secondary Database 814 to perform a data query. The Primary Database 812 interacts with the BackOffice component 818 at times, as will be descπbed in paragraphs elsewhere herein, to obtain data used in performing the queries. The BackOffice component 818 performs data filteπng and other processing, for example, to combine information that may be obtained from vanous data sets producing a resultant data set. The resultant data set is subsequently transferred to the Pnmary
Database for use by the various server nodes 808 through 810.
The process of data integration and updating the data, for example, from the BackOffice to the Front End Server, may be performed at a time other than peak demand time. These processes and data transfer techniques, as will be described in following paragraphs, are generally performed "off-line" and not in response to user query requests. Rather, these techniques may be performed as part of a data maintenance and update process performed in accordance with the load and the number and type of update transactions.
Figure 2 depicts a Superpages Front End Sei vei 804 which includes a varying number of server nodes 808-810 to respond to the vanous query requests as made by a user 800. The techniques and concepts which are described in paragraphs that follow may be used in a variety of different systems which include one or more server systems Additionally, a single database or other datastore may be used The techniques described herein may generally be applied to a large distributed system Additionally, these same concepts and techniques may be applied a single user system performing data queries and searches upon a local database.
Refernng now to Figure 3, shown is an example of a user interface screen as included in one embodiment of the system of Figure 2 Generally, Figure 3 is the initial screen 1800 that may be displayed to a user entenng a URL corresponding to the GTE Superpages Internet site. Figure 3 includes fields for query information 1802-1808, hyperlinks to other tools 1810, such as on-line shopping or placing advertisements, and other links 1812, for performing other tasks such as modifying an existing business listing.
The GTE Superpages Internet site is related to on-line yellow pages, similar to those included in a paper phone book. With these on-line yellow pages, vanous business services and user services may be provided For example, a user may query the on-line yellow page information for various businesses in the United States based on particular search cπtena. On-line shopping infoπnation regarding products and business services may be provided to a user performing a data query. Advertisers, such as the business providers of the vanous products and services, may also purchase advertisements similar to those that may be purchased in the paper copy of a phone book that includes yellow page listings of businesses.
The interface 1800 may include links to vanous services and functions. For example, one service provided permits businesses to advertise in the on-line yellow pages. Functions associated with this service may include, for example, purchasing advertisements and adding or changing a business listing that an advertiser or business includes in the yellow pages. In Figure 3. some of these functions are included in the interface portion 1812, with links to other tools in the screen portion 1810. A user may connect with any of these tools or functions to perform tasks related to the yellow pages advertising by selecting an option from the user interface 1800, such as by left-clicking with a mouse.
Other interfaces with varying functions may be directed to a user. Other types of network connections in addition to the Internet may also be included in other embodiments and may vary with each application and embodiment.
Referring now to Figure 4, shown is an embodiment of the various software components for an on-line query system. One embodiment may be the on-line query tool of the GTE Superpages system. Figure 4 depicts a software view of the typical operation of the system as being accessed by a user 800 through a network 802 using the hardware as described in conjunction with Figure 2. As previously described, the user may enter a request, as through a browser. This request is communicated through the GTE Superpages Front End Server 804 over the network 802 As shown in Figure 4, the Front End Server
804 includes server node 808 that includes a web server engine 852. In one embodiment, the web server engine 852 is a Netscape™ engine which serves as a central coordinating task for accessing files and displaying information to the user on the browser 824. The server node 808 also includes a request router 854, a monitor process 856 and a parser 866. The parser 866 generally includes a parse dnver 858, a genenc object dictionary 860, a query engine 862, and a data manager 864. The parse driver 858 operates upon data from a constructed ad repository 842 and the PHTML files 844. Additionally, the parse driver 858 stores and retrieves data from the PHTML execution tree 846 and the page cache 848. The data manager 864 included m the parser 866 is responsible for interacting with the database, which in the Figure 4 is the Pnmary Database 812. It should also be noted that the data manager 864 may also obtain data from a Secondary Database as previously shown in Figure 4. If there are multiple databases other than a Pnmary and Secondary Database, the data manager may also interact with these to obtain the necessary data upon which data queπes are performed. The query engine 862 operates upon data from, and writes data to, the data query cache 850. Additionally, the query engine uses data from the term lists 836 to obtain identifieis and possibly othei letπevable data in accordance with various key terms upon which a data queiy is being pei formed The request routei 854 generally interacts with the paiser and leads data fiom the configuration file 830 and load file 834 The monitor process 856 also leads and writes data to and from respectively the load file 834 The web se ei engine 852, in this embodiment the
Netscape engine 852, obtains data from the HTML repository 838 and the image repository 840 in accoi dance with various tequests fiom the biowser for different types of files. Each of the foregoing components will be descnbed in more detail in terms of function and operation in paragraphs that follow The monitor process 856 is generally responsible for indicating the availability of server nodes 808-810 in performing data queries. The monitoi is also generally responsible foi leceiving incoming messages from other server nodes as to then availability foi servicing lequests
The load file 834, upon which the monitor piocess 856 reads and writes data, is a dynamic file in that its contents are updated in lesponse to incoming messages indicating machine availability and the current load of the corresponding machine. The load file also includes static information components, such as the maximum load of each system. Generally, the actual executing load (current load) of a system is less than or equal to the maximum load (max load) as indicated in accordance with the load file. Each server has its own unique copy of the load file which is updated in accordance with messages which it receives from the other nodes. Below is an example of an entry that may be included m the load file representing the information descnbed above SERVER, MAX LOAD, CURRENT LOAD
The configuration file 830 may be a static file physically located on one of the server nodes 808-810 with a copy replicated on each other server node Generally, this file is created pnor to use of the system. It may specify which servers may service requests based on weighted parameters of a particular search domain associated with a particular server. Below is an example of an entry in a configuration file:
DOMAIN/PARTITION, SERVER, DOMAIN WEIGHT, SERVER WEIGHT The domain weight may be a normalized value representing costs (e.g., time) associated with processing a request for this associated search domain or partition. This domain weight is based on the median time to seivice a lequest in that domain based on the analysis of past data logs, foi example, as normalized by the number of listings in the domain Similarly, servei weights may lepiesent the cost associated with piocessmg a request on a particular servei The domain/partition indicates a portion of the search domain upon which a usei query may be pei formed that is associated with a particulai server
Other particular embodiments of the load and configuration files may include additional or different information in accordance with the particular policies and data required to implement the policies, such as lequest routing In this particular embodiment, an incoming lequest may be processed by one of a plurality of parsers 858 on each of the sen ei nodes The parser 858 generally transforms the user input query into a form used by other components, such as the request routei. The request router generally receives an incoming request as forwarded by the hardware router 806 of Figure 2. The request router subsequently uses the load file and the configuration file to decide which server node 808-810 a request is routed to based on the load and the availability of the server node, and the designated seiver for each partition or domain Once a request is routed to one of the server nodes 808-810, the query is performed producing data query information that may be cached, for example, in the memory of a data query cache 850. One use of the data query cache 850. as will be descnbed in paragraphs that follow, is its use in improving the performance in response to a user request in a subsequent query that may use a subset or superset of the data stored in the data query cache 850. A superset or composition query is one which is a boolean composite of several querying terms. A composition query may be determined by the parser 866, and the request router 854 may decide to which server node 808-810 the composition query or other query is sent for processing in accordance with domain weights as indicated m the configuration file Reallocation of requests when a server is unavailable may be performed generally with a bias toward the initial allocation scheme as indicated also by the configuration file. There is an assumption that reallocation of a request is on a transient basis, and that the initial allocation scheme is the one to be maintained This concept will be descnbed in paragraphs that follow in accoidance with lequest touting and data queiy caching
Also show n in Figuie 4 are the PHTML execution tiee 846, the page cache 848, and the PHTML file stoie 844 Generall) . the PHTML execution tiee 846 includes an expanded version of a PHTML file lequested fiom the PHTML file 844 as the lesult, foi example, of a usei queiy PHTML generalh is a modified version of the HTML language, which is a markup language according to the Standaidized General Markup Language (SGML) standard capable of interpietation b\ biowseis, such as a Netscape browsei PHTML generall) is a scripted version of HTML with conditional statements that piovide for alternate inclusion of blocks of HTML code in a lesulting HTML page transmitted to a browser in accordance with certain run time query conditions The expanded veision of a
PHTML file may be descnbed as a parse tiee lepiesenting paised and expanded PHTML files. For example, if a PHTML file conditionally includes accesses to other PHTML files or vanous portions of HTML commands, the parse tree structure reflects this in its representation of the parse tree which is cached in the PHTML execution tree 846 Upon a subsequent request for the same PHTML file, the cached, expanded version is retrieved from the PHTML execution tree 846 to inciease system efficiency, thereby decreasing user response time for the subsequent query
The first time a user makes a request \ la the biowser 824, a request is received by the ebserver engine 852 which interacts w ith the parser 866 For a particular user request, a PHTML file is obtained and executed from the PHTML file store 844 The expanded version of the PHTML file is cached in the PHTML execution tree 846. In response to a user's request, an HTML page is generally constructed and cached in the page cache 848. Generally, constructed HTML pages are stored in the page cache 848 if the amount of time taken to produce the resulting HTML page is greater than a predetermined threshold. Implementations of the page cache may implement different replacement schemes. In one preferred embodiment, the page cache implements an LRU replacement scheme. Additionally, the threshold, the amount of time used to determine which pages are stored in the page cache, may vary with system and response time requirements. When processing an incoming user request which results returning an HTML page to a usei a paiticulai seaich oidei of the pieviously described caches and file systems may be performed Initially, it is detei mined w hethei the HTML page to be displayed to the usei is located in the page cache 848 If not seaich lesults are obtained from the query cache and the lesulting HTML page is constiucted and itself may be placed in the page cache 848 If a PHTML file is lequiied to be executed in consti ucting the lesulting HTML file, the PHTML execution tiee 846 may be accessed to determine if theie is a paised version of the requned PHTML file ahead) expanded in the PHTML execution tiee If no such file is located in the PHTML execution tiee 846, the PHTML file 844 is accessed to obtain the required PHTML file The ordei in which these caches and file systems are searched is generally in accordance with a giaduated piocessmg state of producing the resulting HTML file Caches associated ith a latei state of processing aie generally searched prior to ones associated with an earhei processing state in producing the resulting HTML file
Also accessed by the parse driver 858 is a constiucted ad lepository 842 As will be descnbed in paragraphs that follow, the constructed ad repository generally includes constructed advertisement pages which may include, foi example, text and non-text data, such as audio and graphic images to be displayed in response to a user query which represent, for example, a yellow pages ad The webservei engine 852 accesses information from the image repository 840 and HTML repository 838 Generally, the image repository 840 includes vanous graphic images and other non-text data which may also be directly accessed by the webserver engine 852 in response to a user lequest, as by a user request for a specific URL Similarly, the HTML repository 838 includes vanous HTML files which may be provided to the user, for example, in response to a user request with a specific URL which indicates a file Included each of the server nodes 808-810 are one or more parsers 866 which perform, for example, parsing of the text of a user data query request Figure 4 includes some of the software components as included in the parser 866 The components of the parser 866, which are descnbed in more detail in the following paragraphs, generally communicate using a genenc object dictionary 860 The parser may include a parse dπver 858 which performs the actual parsing of a user query The parse dπver 858 interacts with the query engine 862 once a request has been parsed to formulate a data query which is further passed to the data manager 864. As previously described, the data manager 864 generally interacts with a database to actually retrieve the data to be included in the resultant data query as displayed to the user. The parse driver 858 generally uses a data schema description to interpret various data fields of the generic data objects. Generally, abstraction of the data interpretation into the data schema description enables different components of the parser 866 to operate upon and use generic data objects without requiring these components require code changes or recompilation in cases of the introduction of new data presentation types. Components which need to know the details of the generic data object, such as the parse driver 858, to perform certain functions, do this on a per-component basis using data schema descriptions to interpret a generic data object. This technique insulates code as included in the parser 866 from the introduction of new presentation types which may be represented as generic data objects. One common use of the GTE Superpages Internet site is to perform a data query.
In performing a data query, a user enters data query information, as in fields 1802-1808 of Figure 3, or may select other detailed search options, such as searching by distance, as included in field 1808. In this embodiment, data field 1802 is a category query field by which queries may be performed in accordance with specified search categories that may be associated with business listings included in the yellow pages database. Additionally, field 1802 also includes predetermined top categories, as may be determined by examining log files in accordance with user query selections and search criteria. In this embodiment, selection of the "top categories" of the field 1802, as by left-clicking with a mouse button, causes the interface 1820 of Figure 9 to be displayed in a user's browser. Referring now to Figures 9 and 10, shown is one embodiment of a user interface for displaying a first page of the top query categories 1820. Generally, these categories are associated with the various business listings and are tags by which a user may perform queries. In this embodiment, for example, the user may select the "top categories" from the initial interface as included in the field 1802. Referring now to Figure 11, shown is one embodiment of a user interface for displaying a "search by distance" option In this embodiment, this user interface scieen may be displayed by selecting "detailed seaich" from the field 1808 from the initial usei interface 1800 Foi example, the user interface 1830 may be displayed if the user wants to perform a data query for specified categories and certain distance criteria As shown in the example of user interface 1830, a data quen may be performed for lestauiants within five
(5) miles of Boston. MA. This query is performed when the user selects the "Find It" button 1832 as included in the user interface 1830 In this embodiment, a first screen 1840 of the data query results is shown m Figure 12.
Referring now to Figure 13, shown is an example of one embodiment of a user interface display 1850 for performing a user query in accordance with user-specified search criteria. User interface 1850 of Figure 13 is the interface 1800 of Figure 3, but with user- specified data query information included in various data fields. In Figure 13, a data query is performed for "shoes" as the category 1802 for "Boston, MA" in field 1804. The query is performed by selecting the "Find It" button of field 1806. The resulting screen displayed in response to selection of the "Find It" button is included in Figure 14.
Referring to Figure 14, shown is one example of a screen display in response to a performing a user query. The screen results 1860 may include displayed summaπzed business listing information in accordance w ith the search cnteπa previously specified m Figure 14. Vanous business listings may be grouped together in categories. In this example, relating to "shoes", are 154 business listings included in thirteen (13) categories.
From this listing of thirteen (13) categones. the user may select one of these relating to shoes. For example, selection, as by using a mouse, of "custom made shoes" 1862 results in the screen display of Figure 15.
Referring now to Figure 15, shown are the business listings relating to the user- specified search cπteπa selection relating to "custom made shoes". From this screen 1870, the user may further select one of the businesses for more information pertaining to the business, such as directions and business-provided advertisements.
Refernng now to Figures 16 and 17. shown is one embodiment of a user interface that may be displayed when a business or advertiser updates a business listing This screen may be displayed, for example, by selection of the "add or change your listing" option 1812 of Figuie 3 of the initial usei interface Λ usei interface 1880 provides data fields which allow a usei to entei in infoimation such as a telephone numbei corresponding to a business listing Conesponding business listing infoimation is then updated In this example, a phone numbei 617-832 5000 is enteied into field 1882 to letneve business listing information conesponding to this phone numbei By selecting the phone numbei field that is filled in with this phone numbei the lesulting scieen of Figuie 18 is subsequently displayed to the user in this embodiment The phone numbei coπesponds to a business as displayed in Figuie 18 If this is the conect business, a usei may select a displayed business for example, by clicking on the ' matching business" information of Figuie 18 In response to selecting the "matching business" infoimation, the scieen display of Figures 19 and 20 may be displayed to a usei To update the basic listing information associated with the business, selection of field 1890 of Figuie 20 results in display of the scieen of Figure 21 wheie the user has the option to eithei update the business information or change categories If business information is selected, Figure 22 may be displayed Figure 22 includes the business listing information that may be updated, such as a street address or e-mail address associated with this business listing
Refernng back to Figure 16, a section of the displayed interface 1883 indicates options for creating a website linked to a particular business listing Note also that in some embodiments, it is possible to enhance a business listing and/or link a listing to a pre- existing website or to one that is created
The foregoing user interfaces and display results may vary with embodiments and user-specified search cπteπa Vanous other user interfaces and other techniques known to those of ordinary skill the art for specifying user search cπteπa may be used in other embodiments of the invention Refernng to Figure 23, shown is an embodiment of the request router 854 In this particular embodiment, the request router 854 may be executed within a Netscape server process space and may be invoked when a user, via a browser, makes a request which results in a PHTML file being executed The PHTML files, as generally included in the PHTML file store 844, are in the form of a scnpt activated when a server node 808-810 is forwarded a user request The request router 854 is generall) responsible foi routing a request to the propei server node in accoi dance with data stored in the configuiation and load files The request is also forwarded to one of the plurality of paisers for processing once the proper servei node has been located In this embodiment, the lequest router 854 may include several threads of execution as shown in Figure 23. w hich opeiate under the control of, and in the same process space as, the Netscape brow sei As shown in Figure 23, the request routei 854 generally includes a housekeeping thread 880. a router thread 882, and one or more worker threads 884. Generally, the housekeeping thread 880 is responsible for maintaining a parser status table 886 and a parser queue 888. both of which are further described below The router thread 882 generally responds to the monitor process changes as recorded in the various data files with regard to servei node availability The router thread 882 reads data from the configuration and load files, and maintains an ln-memory copy foi use by the vanous threads of the request routei 854 The router thread 882 updates the m- memory copy of the configuration and load files in accordance with predetermined node fail-over and reallocation-of-request policies For example, if in reading the configuration and load files, the router thread 882 determines that a first server node is at maximum utilization, the router thread updates its in- memory, server-node, local version of the files. The router thread determines not to forward requests to the first server. When the first server node's actual utilization decreases and is now available for processing additional requests, the router thread accordingly updates its m-memory copy. ach of the worker threads 884 is initially forwarded a request which arπves at a server node. The worker thread 884 makes the decision whether the request should be routed to another node. The worker thread 884 makes this decision generally in accordance with the contents of the configuration and load files as previously described. If a request is determined to be routed to another server, the worker thread forwards the request to another worker thread on another server node. If the worker thread does not forward the request to another server, the worker thread determines which parser to send the request to for further processing. The list of available parsers is stored m the parser queue 888, which in this particular embodiment is implemented as an AT&T System 5™ with a system message queue. The parser queue is generally maintained by the housekeeping thread 880. It should be noted that the Netscape1 or other HTTP seiver provides as a service the dispatching of lequests to the various woiker threads Other implementations may provide this function using other techniques such as callback mechanisms which dispatch the user requests to one of the plurality of a\ ailable worker threads 884. Generally, the parser status table 886 includes information about use. availability and location of each of the plurality of parsers on each server node. The parser status information may be used in determining where to route requests for example, as performed by the worker thread 884 The parser status information as included in the parser status table 886 may be used to route requests based on an adaptive technique similai to the adaptive caching technique which will be described in paragraphs that follow This may be particularly useful in systems with multiple processors, for example, those in which certain CPUs are dedicated processors associated with predetermined parsers For example, as particular requests are processed by particular parsers, each associated with a particular CPU, the parsing results may be stored in the PHTML execution tree accessed by the particular processor. Subsequent requests which are also processed by the same parser may access the cache parsing results stored in the PHTML execution tree.
In this particular embodiment, the request processing model includes a plurality of parsers and a plurality of worker threads. Using this request processing model, an incoming request is associated with a particular worker thread which then forwards the request to a parser for processing. Once this request has been associated or forwarded to a particular parser, the worker thread is disassociated with the request, and is then available for use in the pool of worker threads. The number of parsers and worker threads may be tuned in accordance with the number of user requests. One point to note using this model is that the worker thread and the parser are disassociated and thought of as distinct processing units rather than as a unit in which a worker thread is associated with a particular parser for processing an entire life of a request.
Refernng now to Figure 24, shown is a block diagram of an embodiment of the BackOffice component 818. Generally, the BackOffice component includes a database 892 which provides data, for example, to the Front End Server 804 through connection 822. The database 892, as stored m the BackOffice component, may be updated, as through a webserver via a connection to a usei Such a connection as 896 may be used, for example, when a modification is made to an entry to conect typographical erroi A user may connect, such as via a browser, using connection 896. to the websei ver 894 included in the BackOffice component The database 892 is then accessed and updated in accordance with requests or updates made by the user.
Other embodiments of the BackOffice component may include other software components than those displayed in Figure 24 Additionally, a user may update entries included in database 892 using techniques other than by a connection 896 via a webserver to the database 892. As described in other sections of this description, different types of updates to database 892 may be performed m different embodiments of the invention. For example, the database 892 may be updated on a per-entry basis by a variety of users connecting via multiple webserver connections. Additionally, periodic updates, for example, for particular data set may be provided from a particular vendor, and accordingly integrated into database 892 through a database integration technique rather than having a user manually enter these updates such as via a connection to the webserver 894.
The connection to the Front End Server 822 may be used, for example, to load a new copy of the database 892 into the Front End Server Primary and Secondary Databases 812, 814 as shown m Figure 2. The way in which these updates may be sent across the connection 822 to the Front End Server may be as previously described in terms of database operational commands which perform updates from the computer system which include database 892. For example, one embodiment, the database 892 included in the BackOffice component and both the Pnmary and Secondary Databases, as included in Figure 24, are Oracle™ databases. Oracle provides remote database update and access commands which allow for remote database access and updating, such as update requests from the database server node 892 to update the Pnmary Database 812 as stored in the
Front End Server 804. In this embodiment, updates as made to the database 892 are "pushed" to the Front End Server 804 via the connection 822. These modifications are pushed via database-provided update techniques such as those included when sending the operational table commands to the Front End Server 804. In this particular embodiment when information is sent via connection 822 to the Front End Server 804 from the BackOffice component 818, eιτor messages and other status codes may be sent back to the BackOffice component 818 in accordance with an indication as to whether a data transfer, for example, has been successfully completed. Referring now to Figure 25, shown is an embodiment of a general process by which data that is transferred from the BackOffice 818 to the Front End Server 804 is further integrated into other data stores withm the Front End Server 804. Data is stored in the BackOffice component in this particular embodiment in a normalized dataform, as will be further described in paragraphs that follow. These normalized data changes are transfered to the Front End Server 804 from the BackOffice component in one of several forms. For example, the entire database may be transferred to the Front End Server 804. Additionally, changes or updates to particular entries may also be transmitted to the Front End Server 804 from the BackOffice component rather than updating or overwriting the entire copy of the database as stored in the Front End Server 804. Each of these types of database updates from the BackOffice component to the Front End Server 804 may be done in accordance with the number of transactions or updates to be performed. This is further described in other sections of this descnption.
Data which is stored in the Front End Server 804 may be stored in a normalized data format 900. Extraction routines 902 operate upon this normalized data to produce denormalized data 904 and markup language files 906. The markup language files 906 serve as input to information retrieval software 908 which outputs term lists 836. As known to those skilled in the art, a markup language file generally includes tags which represent commands or text identifiers for processing the contents of the file. For example, Structured Generalized Markup Language, SGML, is a standard based markup language known to those skilled in the art.
The process depicted in Figure 25 is performed once data has been received in the Primary Database 812, and is first stored in the Pnmary Database 812 m normalized data form, as in the normalized data store 900. Extraction routines 902 examine the normalized data store 900 and rearrange the information to place it in the denormalized data form, also included in the Primary Database 812 of this embodiment. These changes or updates for the normalized data which are transformed into the denormalized data form aie integiated into the denormalized data store 904 Additionally, the exti action loutmes 902 produce markup language files 906 which aie pπmaiily used by the infoimation retrieval software to produce identifiers and conesponding woids oi teims upon which a query may be performed. These lists of key words or terms w hich may be seaichable or retrievable and the corresponding record identifiers as included in the denormalized data store 904 may be stored in a list structure as included in the term list data store 836
Generally, the markup language files include one file oi document per business for which there is an advertisement, for example, m this particular embodiment. Each of the markup language files 906 includes markup language statements, such as SGML-like statements, with tags identifying key data items in the document for each business. In this particular embodiment, the information retriev al software is Veπty software which uses as input markup language files 906. Additionally. Verity uses its own schema file by which a user indicates what key words or terms as indicated in the markup language files are searchable and which of the data fields contain retnevable information. "Searchable" as used herein means fields or key words and terms upon which searches may be performed, like index searching keys. "Retnevable" as used herein generally means fields or categories with associated data that may be retπeved. All searchable fields have a tag, such as a business name or city. Identifiers are generally produced by the information retneval software 908. Verity™, in this particular embodiment, produces term lists 836 in which there exists a list for each particular key word, term or category followed by a chain of identifiers that indicate the record number in the denormalized data store 904. Additionally, associated with each element in the term list which indicates a record in the denormalized data, retnevable data associated with that record may also be included. For example, if the field "zip code" includes a tag as included in the mark-up language file 906 which indicates that this particular field is searchable, it may be desired that whenever a user wishes to do a search for "zip code" what is actually retneved or displayed to the user is the city and the state. Accordingly, in this instance, the term list and the term list data store 836 contain a list conesponding to the key word "zip code". There is a term list for each particular value of a zip code. Attached to each key word "zip code" and the particular value may be a list oi a chain of identifieis Associated with each identifier on the chain may be associated data, such as the city and state, which may be retrieved when a particular zip code is searched.
Other types of data may also be included in othei prefened embodiments of the term lists. For example, the data included in the term lists may be data that is also needed in performing search optimizations, weighted searches, or different types of searches, such as proximity searches. This data may furthei be stored in the various data files and caches of the Front End Server as needed in accordance with each implementation, for example in accordance with the types of searches and data upon which queries may be performed or otherwise operated upon by the Front End Server.
Referring now to Figure 26, shown is a detailed description of one embodiment of an example of normalized data, as may be stored in the BackOffice component and one copy in the Primary Database 812. Generall). in the Primary and Secondary Databases 812 and 814, respectively, of Figure 2, the Pnmaiy Database 812 includes both normalized and denormalized data form, and the Secondary Database 814 includes only denormalized data form. Normalized data is that representation of the data in which each data relation is represented independent of other relations Generally, denormalized data is the antithesis of a normalized data in which one data relation represents all relations. Different databases may be of different degrees of normalized and denormalized data. The BackOffice component 818 generally stores the data in normalized data form of a certain degree.
Similarly, the databases used in this server store the data in a form of a normalized form also of a certain degree and additionally in a denormalized form for search performance optimizations on performing data quenes. In one embodiment, for example, the data is stored m third degree normal form. Additionally, in the denormalized form, sets of data may be stored together within a single field, such as multiple mailing addresses. Other embodiments may have one field per address. This may prove to be advantageous, for example, for high performance and better flexibility in systems subject to multiple and diverse data sources, and a high rate of modifications.
As shown Figure 26, for example, each particular business entry may have a unique identifier, (ID). Additionally, three pieces of information may be stored for each particular business The normalized data form may look as in Figure 26 In this particular example, there may be a separate table foi each ID conesponding to a business and its business address 910 Additionally, there
Figure imgf000025_0001
be two other data tables of information also indexed by each particular business ID, such as email addiess 912 and telephone number 914 Generally, as indicated in Figuie 26, the normalized data representation for each business associated with a paiticulai ID is repiesented as a separate data relation independent of the othei relations
The conceptual opposite of normalized data is denormalized data, as depicted in Figure 27. Refernng now to Figure 27, shown is an example of denormalized data stored in table 916. In this example of denormalized data, for each ID associated with a business, the business address, email and telephone number, may be stored in a single record. In other words, one data relation, which is a single recoid in the table 916, represents all relations for one particular data set, such as the ID conesponding to a business Vanous degrees of denormalized and normalized data as known to those skill in the art, may be used. The optimal degree of normalized and denormalized data forms may vary with each particular implementation and embodiment
Refernng back to Figure 20, it may generally be noted that the BackOffice component 818 may include one or more database servers 892. A user may directly interact with the web server 894 included in the BackOffice component via connection 896 which, for example, may be a network connection of a user accessing the web server through the Internet. The user may also interact directly with the BackOffice component through the Front End Server Connection 822.
In this embodiment, the particular type and number of data fields may vary with embodiment. Additional structure may also be imparted to data fields, such as a telephone number may include an area code and exchange component. Additionally, interactions between the Pnmary Database 812 of the Front End Server 822 and the BackOffice component may be dnven or controlled by the BackOffice component. For example, when there is an update to be performed to the Pnmary Database server 820, an automatic transfer of the new information may be transmitted to the Primary Database 812 by the BackOffice component. Data may be transmitted to the Pnmary Database 812 using connection 822 Additionally, connection 822 may be used to piovide feedback or status information to the back office component 818. foi example, regarding success or failure of a data transfer using connection 822.
As generall) descnbed, the PHTML files 844 of Figuie 4 are generally HTML instructions as interpreted generally by a brow sei w ith additional embedded processing instructions Generally, the PHTML execution tiee 846 may be implemented as a C++ applet class with various execute methods w hich aie conditionally performed based upon the evaluation of certain conditions as indicated in the PHTML scripting language statements. Each of the PHTML files 844
Figure imgf000026_0001
be expanded and evaluated in accordance with the particular conditions of the user request. The fust time a PHTML file is accessed, it is expanded and the expanded version is placed in the PHTML execution tree 846 of Figure 4. Subsequent accesses to the same PHTML file result in the conditional evaluation of the stored and expanded PHTML file accordance with the run time performance and evaluation of a user request, as from browser 824. An HTML page is generally formed and displayed to the user. For example, the
HTML page may be formed by the parser after interaction with the data manager and query engine to select a specific number of items to be displayed to the user. The HTML page may be stored in the page cache 848. The page cache generally includes a naming convention such as a file system which the name of the file conesponds to the arguments and parameters of the query. The technique for forming the name is descnbed in other paragraphs of this application.
The query engine 862 is generally responsible for performing any required sorting of the query information or subsettmg and supersetting of information. Generally, the query engine 862 retπeves vanous identifiers which act as keys into the Pnmary Database 812 or Secondary Database 814 for accessing particular pieces of information m response to a user query. After the query engine 862 formulates and retneves various identifiers, for example as from the term lists, which conespond to a particular user query, this query information in the form of term list and retπeved information may be stored in the data query cache 850. A technique similar to the page cache query-to-filename mapping technique may be used to map a particular query request to a naming scheme by which data is accessed in the data queiy cache The technique foi foiming this name is described in other sections of this application
Additionally, data which is stored in the data queiy cache 850 ma\ be compiessed or stored in a paiticulai format which facilitates easy letneval as well as attempting to optimize storage of the various data queries w hich aie cached, as discussed in othei portions of this application
In the following Figuies 28-30, show n are flowcharts of method steps of embodiments for performing processing m \ anous components of the previously descnbed system of Figures 2 and 4 Refernng now to Figure 28, shown are steps of one embodiment of a method of processing a request in the system of Figuies 2 and 4 At step 920, the Webserver engine invokes the Request Router in accordance w ith the PHTML MIME (Multipurpose Internet Mail Extension) At step 922, the Worker thread as included in the Request Router is initially forwarded the request for processing At step 924, a determination is made as to whether or not this lequest is serviced by this node in accordance with the information included in the configuration and load files If. at step 924, a determination is made that the request is not to be serviced by this node, the request is forwarded to another servei node in accordance with the load and configuration file information. If, at step 924, a determination is made that this request is to be serviced by this node, control proceeds to step 926 where the Worker thread allocates an available parser from the parsei queue to process the incoming request. At step 928. the incoming request is passed to the designated parser for processing.
Refernng now to Figure 29, shown is a flowchart of one embodiment of method steps as may be performed by the parser. At step 940, the parse driver of the parser parses the incoming request. In this embodiment, the query request that is parsed is included as a
URL parameter that is processed by the parse dπver. For example, if the query includes syntax enors, the parse dπver will detect and report out such enors. At step 942, a unique file name is determined in accordance with the query request. This filename conesponds to the display results that may be included in the page cache. It should be noted that this filename is unique for a particular user query and in accordance with "look and feel" parameters of the display lesults Foi example, "look and feel" refers to parameteis that describe the displayed results, such as numbei of business listings displayed in an HTML page, the paiticulai starting point of the displaved lesults with regard to the resulting data set For a given resulting data set conesponding to a user query, on a particular type of user display windo . 15 items may be
Figure imgf000028_0001
The same query performed by a second user from a different display window may display 17 items Thus, the resulting HTML page in both cases is different even though the lesulting data set used in forming each of the HMTL pages is different. The page cache may include a different HTML page for each of the 15 and 17 item displays. A determination is made at step 944 as to whether the page cache includes the data in the filename determined at step 942 If a determination is made that the data is included m the page cache by the existence of the file, control proceeds to step 946 where the data in the filename is retrieved from the page cache Contiol proceeds to step 956 where the resulting HTML including the data in display format is delivered to the user's browser. If a determination is made at step 944 that the data is not in the page cache, control proceeds to step 948 where a determination is made as to whether or not there is a PHTML file the PHTML execution tree. If a determination is made that the expanded PHTML representation for this request is included in the PHTML execution tree, control proceeds to step 950 where the expanded PHTML representation is retπeved Control proceeds to step 954 where portions of the PHTML file are executed in accordance with the user query to obtain data to produce the resulting HTML page by invoking the Query engine for data results. The data results are returned to the parse driver that creates a resulting HTML file returned to the user's browser at step 956. Additionally, it should be noted that the resulting HTML file may be cached in the Page cache in accordance with predetermined cπtena, as previously descnbed. The resulting HTML file is communicated directly to the user's browser. If a determination is made at step 948 that the PHTML file is not in the PHTML cache, control proceeds to step 952 where the PHTML file is retneved from the PHTML file storage and subsequently expanded. The expanded PHTML file is stored in the PHTML cache. Control proceeds to step 954, which is described above. Refemng now to Figure 30, shown is a flowchart of the method steps of one embodiment for performing queiy engine piocessmg At step 962, the query engine receives an incoming request, as forwarded the pai se dπver in step 954 At step 964, the data is retπeved for the "normal" search results as appropriate from the data queiy cache, or using an alternate technique. Details of this step are described in moie detail in following paragraphs describing the use of the data query cache Generally, "normal" search results refeis to the resulting data set formed by business listing data associated with a well-defined geographic area. In addition to "normal" seaich result data are othei search result data that may not be associated with a single well-defined geographic area, such as virtual businesses in the Internet These othei search results that may not be associated with a single well-defined geographic area are described in more detail in paragraphs relating to the data query cache and its use. At step 966, other search data in addition to the "normal" search data may be retneved and integrated into the resulting data set. At step 968, the result data set is formulated in accordance with the user query request, such as displaying results in a particular order or beginning at a particular point. At step 970, the resulting data set is returned to the parse dπver for formatting in a display format in an
HTML file.
In this particular embodiment, the Standard Industry Classification (SIC) may be used to indicate vanous name categones and synonyms. These various name categones and synonyms are produced, for example, by the extraction routines which produce the markup files, as used in this particular embodiment by the information retrieval software.
Other techniques may be used to facilitate name categoπes, and equivalents thereof, for searching in other prefened embodiments.
It should generally be noted that in the various descnptions included herein, certain portions of the data storage, such as the image repository 840, are updated on an incremental change or delta basis. Other prefened embodiments may have different thresholds or techniques to update vanous data stores included in the Front End Server 804. These techniques may vary with implementation.
The architecture descnbed in Figures 2 and 4 is a highly optimized, distnbuted, fault tolerant, collaborative architecture. The pnmary purpose of this architecture is to support a high volume of searches, which may be performed for example, through the Internet. In this particular embodiment, the databases may include business information, such as for specific businesses 01 classifications of businesses Additionally, data queries may be performed based on characten sties of the various businesses, such as location, name, or category. Furthermore, the architectuie described herein supports a flexible presentation of these businesses, based on business agieements and service offerings. The architecture described herein uses various techniques and combinations to achieve high performance while maintaining flexibility and scaleabihty
The architecture as depicted in Figures 2 and 4 includes a set of fully redundant server nodes in which each node is capable ot lesponding to any search request Each server node communicates with all the other nodes, as previously described, establishing the health and availability of each server node Incoming requests are classified by each node, as routed by the hardware router, using a classification scheme held common and by consensus. The nodes agree to a disjoint partitioning of requests to each of the server nodes in which one server node will service a set of classes of requests that no other node will generally service. A number of complimentary techniques, including Subsumption and Highly Redundant Caching, may be then used to adapt a particular node to a particular class of requests. Thus, the latency for request servicing by that node decreases as additional user quenes are performed for each particular class of requests.
Adaptive techniques, as those performed by the Front End Server 804, may be most effective when dealing with repeated requests or queries similar to those previously performed. Based on the adaptive techniques used herein, an initial search request may be the most costly in terms of system resources and search time Therefore, other techniques are used conjunction with the adaptive techniques to further facilitate performing an optimal query m response to a user request. For example, common term optimization (CTO) is one technique which is used that generally takes advantage of a statistical bias in both submitted queries and result sets towards particular words or combinations of words. By anticipating particular word combinations or precalculated result lists that match, the CTO matches the initiating search problem.
In the embodiment descnbed herein, the Front End Server 804 has a data set domain which includes electronic yellow pages and advertising requmng a high degree of flexibility in the piesentation of data. Data is geneially piesented using the look and feel of business partners in each business listing w hich may have distinct requirements for presentation. Additionally, new modes of data piesentation may be defined on a monthly basis requinng updates to large numbei s of data stored in the back office component in the pnmary and secondary database To support flexibility, the aichitecture described uses several techniques that also support performance iequirements of the particular data domain in this embodiment and application Geneially, techniques such as the generic object and the generic presentation language may be used to facilitate rapid introduction of new services and additional presentation data m a variety of forms to a user Additionally, in the embodiment descnbed in Figuies 2 and 4, each server may be fully redundant, and there are two additional servers that are designated database servers which have additional supporting software and hardwaie for facilitating database access. Other embodiments of the invention
Figure imgf000031_0001
include additional configurations of servers and databases in their particular implementation. While including concepts and techniques described herein, for example, the different databases and packages commercially available which may be used, as known to those skilled in the art, vary with the type of data access using searches to be performed. In this particular embodiment, a relational database structure is used to store and retneve information in the Front End Server 804 Other embodiments may include additional types of database storage using other commercially available packages or specialized software which facilitate each particular application. • TGenenc Objects]
The PHTML files 844 that are provided to the parse dπver 858 are scnpts that direct the parse dπver 858 to perform quenes, view the results of quenes, and provide information to the browser 824. In a prefened embodiment, the PHTML files 844 are expanded into the PHTML execution trees 846 the first time the parser 866 accesses the PHTML files 844. The parse dnver 858 accesses the PHTML execution trees 846 duπng operation in a manner descnbed in more detail below.
The scnpts that are stored in the PHTML files 844 may include commands that are interpreted by the parse driver 858, C++ objects that are executed, blocks of HTML code that are provided by the paise dπvei 858 to the biowser 824, and any other appiopriate data and/or executable statements The PHTML scripts perform operations of objects in a way that is somewhat independent of specific attributes of the objects and thus, as described in more detail below, provide a generic mechanism foi displaying and piesenting many types of objects. The PHTML scripts include com entional commands to include other files
(such as other PHTML files), conditional files/text inclusion commands, switch statements, loop statements, variable assignments, random number generation, string operations, commands to sort and iterate on attributes/fields of an object according to aspects thereof, such as the name, and logging values to files The specific syntax used for the PHTML scripting commands is implementation-dependant but includes conventional key words
(such as "if and "then") and conventional anangements of parts of the various types of statements. As described in more detail below . the scnpts provided in the PHTML files 844 are used to construct the PHTML execution trees 846 that control the operation of the parse driver 858 Each business listing may be represented as a document stored in the primary and secondary databases 812, 814. The documents may be manipulated as generic objects. As discussed in more detail below, representing each business listing as a generic object facilitates subsequent handling of the business listings
Refernng to Figure 5, a table 400 illustrates data storage for a plurality of denormalized objects in the databases 812, 814. The differences between normalized and denormalized data is discussed in more detail elsewhere herein. The denormalized data format is optimized for fast performance while, perhaps, foregoing some storage compaction.
A plurality of rows 402, 404, 406 represent a plurality of denormalized geneπc objects, each of which conesponds to a business listing. A plurality of columns 412, 414,
416, 418 represent vanous attnbutes of the denormalized objects. In a prefened embodiment, the first attnbute 412, conesponds to an identifier for the objects 402,404,406 and thus identifies a particular listing. Each of the attnbutes contains a number of fields and contains descnptor information identifying the type, size, and number of fields. Attnbutes may be added to the normalized objects, or only to a specific subset thereof A denormalized repiesentation of any one of the objects 402, 404, 406 contains the same number of attributes as any of the other one of the objects 402, 404, 406. This allows the denormalized objects to be transfened fiom the primary or secondary databases to the data manager 864 in a stung format w herein each object can be identified. Accordingly, if values for a new attribute aie added to only a subset of the objects, then the other objects, outside the subset, will contain a null value or some other conventional marker indicating that the particular attribute is not defined (or contains no data) for the objects in question. For example, assume that a new attribute 420 is added. Further assume that the new attribute 420 only contains values for the object 402, but is not defined for the objects 404, 406. In that case, data space foi the attribute 420 is still added to the denormalized version of the objects 404, 406. but no value is provided in the attribute 420 for the objects 404, 406.
Refernng to Figure 6, a table 430 repiesents data stored in the geneπc object dictionary 860 conesponding to results of a search query provided by the query engine 862 or from the data query cache 850 in the case of a previous search having been performed.
In the table 430, it is assumed that a search returns a plurality of objects conesponding to n categories and up to m listings for each of the categones. The annotation oJ means the object conesponding to the jth category and the kth listing. In the case of the table 430 (and thus the generic object dictionary 860), the objects may be object identifiers. For example, the field 412 may conespond to an object identifier of each of the objects 402,
404, 406. As discussed in more detail below . the parse dπver 858 uses the table 430 provided by the generic object dictionary 860 along with the PHTML execution trees 846, to provide specific HTML code from the parse dπver 858 to the browser 824 of the user 802. Refernng to Figure 7, a diagram illustrates a portion 440 of the PHTML execution trees 846. The portion 440 is constructed using the scnpts in the PHTML files 844 and consists of a plurality of nodes conesponding to the decision points set forth in the PHTML scripts and a plurality of C++ objects and HTML pages that are executed and/or passed to the browser response to reaching a node conesponding thereto. Thus, for example, a node 442 can conespond to a PHTML lf-then-else statement having two possible outcomes wheiein one branch from the node 442 conesponds to one outcome (i.e., the conditional statement evaluates to tiue) and another bianch from the node 442 conesponds to another outcome (i.e., the conditional statement evaluates to false) Such a structure may be implemented in a conventional mannei given a scripting language such as that described above in connection with the PHTML language. That is, implementing such a tree structure using a scripting language is stiaightfoiward to one of ordinary skill in the art using conventional techniques in a straightforward manner
Representing the documents (business listings) of the databases 812, 814 as generic objects facilitates modifying the documents, oi a subset thereof, without modifying the parser 866. For example, if an attribute is added to some of the objects, then it is only necessary to modify the objects (schema and data) that will contain that attribute and to also modify the PHTML files 844 to include new scripting to handle that new attnbute. The scnpting may include statements to determine if the particular attribute exists for each object. For example, suppose the business listings weie in black and white and then color was added to some of the listings. The color attnbute could be added to some, but not all, of the objects only in normalized form. Once the new color attnbute has been added, the denormalized versions of all of the objects w ould contain a data space for the attnbute, but the objects that do not possess a color attnbute will have a null marker. The PHTML files 844 can be modified to test if the color attnbute is available in a particular object (e.g., to test for a null value) and to perform particular operations (such as displaying the color) if the attribute exists or, if the attribute does not exist for a particular object, displaying the object in black and white. In this way, the color attribute is added to some of the objects without modifying the parser 866 and without modifying existing objects that do not contain the attnbute. For each query that is presented to the query engine 862, the query engine 862 determines whether the query is found in the data query cache 850 or whether it is necessary to perform a query operation using the Verity software (discussed elsewhere herein) and the term list 836. In either instance, the results of the query are provided by the query engine 862 to the geneπc object dictionary 860 in a form set forth above in connection with the descπption of Figure 6. The parse dπver 858 and PHTML execution trees 846 then operate on the geneπc object dictionaiy 860 to determine what data is displayed to the user by the browser 824 In some instances, the PHTML execution trees 846 may require the parse driver 858 to obtain additional data from the databases 812, 814 through the data manager 864 Foi example, in instances where the categories conesponding to the retrieved documents (business listings) are displayed, the PHTML execution trees 846 may cause the parse dm ei 858 to obtain information fiom the genenc object dictionary 860 that identifies each category and the number of listing conesponding to each category. Then, the portion of the PHTML execution trees 846 may cause the parse dπver 858 to use the data manager 864 to access additional information fiom the databases 812, 814, such as the names of the categories conesponding to the category identifiers provided in the geneπc object dictionary 860
Refernng to Figure 8, the parse dmei 858 is shown in more detail An instantiator 452 creates the PHTML files 844 and constructs the PHTML execution trees 846 from the PHTML scripts the first time the PHTML is invoked by the parse dnver 858. Instantiation includes reading the PHTML files and consti uctmg trees, such as that shown in Figure 7, based on the PHTML scnpts provided in the PHTML files 844. As discussed above, constructing such trees from a scripting language is generally known in the art.
An interpreter 454 accesses the PHTML execution trees 846 and, based on the information provided therein, provides HTML data to the browser 824 and/or executes a C++ object. The interpreter 454 also accesses a configuration file 456 and a state file 458 which keeps track of the state of various values dunng traversal of the PHTML execution trees 846. The interpreter 454 also receives other data that is used to traverse the PHTML execution trees 846 and to provide information to the browser 824. The other data may include, for example, data from the data manager 864 and data from the genenc object dictionary 860. The state data 854 includes information such as the number of iterations
(in the case of an iterative loop), the values of vanous environment and other vanables from the PHTML execution trees 846, and the values of other vanables and data necessary for performing the operations set forth in the PHTML execution trees 846.
The technique disclosed herein relates to a new data type which abstracts the data interpretation from the data typing by using data schemas. A novel approach is the use of this data typing for rapid service deployment in search engines for advertising services on the Internet. For example, new presentation types may be introduced by an advertiser due to the large number of possible ways to present data to a user. An advertiser may wish to change the information displayed when a user performs a query that results in displaying information regarding the advertiser's business. If there are tens of thousands of advertisers which perform this task on a monthly basis, this implies a very high rate of new presentation types which an online advertising service must be able to accommodate. Use of this generic data type in GTE Superpages™ provides a flexible and efficient approach to incorporate these additional and new presentation types for large numbers of advertisers. Generally, this technique provides for rapid integration of new data types without requiring recompilation or code changes in source code which uses instances of data that include the additional data types. This provides for the flexible and efficient introduction of data changes.
The generic data typing is optimized for performing multiple data operations by providing a small subset of possible operations or accesses upon any data of the generic data type. Therefore, these small subset of operations which are known may be optimized wherever there is a data access, for example, within the parser. This is in contrast to a non- generic data typing scheme which requires the introduction of a new data type and additional associated access patterns. In a non-generic data typing scheme there is an unlimited and unknown number of access patterns for which optimizations must be performed on an ad-hoc basis as new data types are introduced. Thus, when a new data type is introduced, the possible accesses need to be analyzed and optimized. In addition, the technique described herein provides for denormalized, flat, representations of the objects that facilitate rapid and efficient handling thereof. The parse driver 858 uses a data schema description to interpret the various data attributes and fields of the generic data objects. Generally, the abstraction of the data interpretation into the data schema description enables different components of the parse driver to operate upon and use generic data objects without having these components require code changes or recompilation due to the introduction of new presentation types. Components which need to know the details of the generic data object, such as the parse driver 858, to perform certain functions, do this on a per component basis by using the data schema descnption to interpret a geneπc data object. This insulates code from the introduction of new presentation types hich aie represented as the genenc data objects TQuery Cache and Request Allocation] When performing the routing of particular requests, such as data queries, existing systems may perform request routing to a particular server in a distributed computer system without reference to certain available factois. such as an initial partitioning of the entire domain, or an assumption that data queries w ill be cached a data query cache and subsequently reused for additional searches Generally, using the concepts which will be descnbed in paragraphs that follow, the larger the number of queries that are performed when routed to a particular node in accordance with an initial allocation scheme, the quicker subsequent searches on this same particular node may be performed due to the use of the data query cache.
This embodiment relates to concepts that may be included in a variety of applications. One embodiment that includes these is the GTE Super Pages on-line Internet tool that may be used to perform data quenes As an example, consider using this tool to perform an on-line query of all French restaurants within thirty (30) miles of Boston. Generally, GTE Super Pages performs this query returning search results to an on-line user. Concepts which will be descnbed paragraphs that follow may be generally used and adapted for use in querying any search domain.
A worker thread classifies a request and performs query partitioning in accordance with the URL information. For example, this may include data from the query request such as a specified state, zip code, or area code. The request router 854 receives an incoming request as forwarded by the hardware routei Within the request router 854, Figure 4 is generally machine-executable code which embodies the concepts of an adaptive and partitioning scheme with regard to routing requests. Use of this technique allows for high performance search optimizations that leverage and ensure server node adaption to a particular class of requests. The technique of adaptive query partitioning generally increases the performance in terms of high throughput and low latency where quenes include Boolean search terms. This search optimization technique may include three components: queiy partitioning, highly redundant caching, and subsumption.
Query partitioning is the stnct classification and i outing of a particular query based on its input term characteristics to a node oi a particular set of nodes This infoimation is stored in the various configuiation and load files, as descnbed in other sections of this application. Query partitioning ensures that any adaption a node undergoes based on the characten sties of queries that it processes is maintained. Specific nodes may serve specific query partitions. Caching and result set manipulation techniques may then be used on each particular node to bias each particular node to the queiy partition to which it has been assigned. Highly redundant caching is generall) a technique that trades storage space against time by storing result sets along with subsets of these iesult sets The highly redundant caching technique generally relies on the fact that the seaich time to locate an existing result is generally less than that amount of time which would result in creating the query result from a much largei search space. One highly effective set manipulation technique, refeπed to as subsumption, is especially important in the adaption of a particular node Subsumption is generally the deπvation of query results from previous results, which can be either a superset of the requested result or subsets of the requested result. Subsumption is also the recognition of the relationship between queries and the determination of the shorted denvation path to a result set. That derivation may be the composition of several subsets resulting in a superset, or the extraction of a subset from a recognized result set. In subsumption, the presence of an additional conjunctive ("and") search term conesponds to the formation of a subset from the superset descnbed without the additional term. The presence of an additional disjunctive ("or") search term conesponds to the identification and composition of existing subsets each descnbed by one of the disjunctive clauses.
Consider the following example of the use of the data query cache and subsequent searches which use a subset of the data stored in the cache. For example, suppose the first request results m a query of all of the restaurants withm thirty (30) miles of Boston. This query data is placed in the data query cache. A second request results in a query of all the seafood restaurants within thirty (30) miles of Boston. The second request is routed to the same node as the first request in accoi dance w ith loading configuration files, foi example, as shown on Figure 4 The second queiy is performed quickly by using the data quen cache information and searching for a subset of the cached data indicating restaurants withm thirty (30) miles of Boston for a subset of this fust search data which indicates seafood restaurants. Subsequently, this second request query data which indicated all the seafood restaurants within thirty (30) miles of Boston is also stoied as a separate data set withm the data query cache.
It should generally be noted that the data included in the data query cache is placed in nonvolatile storage such that if the node w eie to become unavailable, data from the data cache may be fully restored once the node resumes service.
The composition query also uses the data in the data queiy cache A composition query may generally be refened to as one which is a composition of several queries, foi example, when using several conjunctive search terms For example, a request of all the French restaurants in Massachusetts, Texas and California is a composition query that may reuse any existing cached data from previous queries stored individually regarding restaurants in Massachusetts, Texas and California. A composition query is generally determined by the Parse Dπver, and the request router decides to which server node 808- 810 withm the Front End Server the composition query is sent for processing in accordance with domain weights of the configuration file Consider the following Configuration File information based upon the previous composition query:
DOMAIN SERVER DOMAIN WEIGHT MA 1 1000
TX 1 2000 CA 2 4000
The Request Router may route the composition request to either server 1 or 2. If the request is routed to server 1, data may be cached regarding MA and TX for reuse and a new query may be performed for the CA information. If the request is routed to server 2, data may be cached for reuse regarding CA and new queries performed for the MA and TX information. The Request Router, based on the weights, sends the request to server 2 since the cost associated with performing the MA and TX queries is less than the cost of performing the CA query.
In the above caching scheme, a particular domain is associated with a particular server node upon which data query caching is perfoimed for designated domains. The domain and server weights reflect the cost associated with processing a request on each node using the data query cache Accordingly, routing a lequest in accordance with these weights results in faster subsequent query times for those requests
Reallocation of the requests when a sei ver is unavailable is performed with a bias toward the initial allocation scheme as indicated by the Configuiation File. There is an assumption that reallocation is on a transient basis and that the initial allocation scheme is the one to be maintained. Consider the following server nodes (M1-M4) and the domains initially allocated to each node as indicated below: Domains Dl and D2 allocated to node Ml. Domains D3 and D4 allocated to node M2. Domains D5 and D6 allocated to node M3.
Domains D7 and D8 allocated to node M4. At a first time, node Ml becomes unavailable and the routers reallocate Domain Dl to node M2 and D2 to node M3. At a second time, node M2 also becomes unavailable. Domains Dl and D3 are reallocated to node M3 in addition to domains D5 and D6. Domain D4 is reallocated to node M4 in addition to domains D7 and D8. At a third time, node Ml is restored and node M2 is still unavailable Domains Dl and D2 are reallocated to Ml in addition to Domain D3. Domains D5, D6 and D4 are allocated to node M3. Domains D7 and D8 are allocated to node M4. There is a bias toward restonng the initial allocation scheme when a node becomes available. This bias contributes to faster subsequent query times upon re-entry of a server node due to the use of the data query cache, and routing of subsequent requests to the particular nodes in accordance with this bias.
In paragraphs that follow, descnbed are data query caching techniques as may be used in conjunction with the foregoing descnbed request routing techniques. Refernng now to Figure 33, shown is an example embodiment of a flowchart of method steps for performing a data quen At step 200, a detei mination is made as to whether a data set in the data queiy cache conesponds to the cunent query being made If so, control proceeds to step 202 where this data is letneved and used by the query engine in formulating the queiy results that are dιspla\ed to the usei At this point, the processing stops at step 216
If a determination is made at step 200 that no data set in the data query cache conesponds to the cunent query being made, contiol pioceeds to step 204 where parents of the data query are determined. In this embodiment, parents of the cunent query are determined by dropping one of the terms Foi example, if the queiy being made is for "MA AND RESTAURANTS AND FLOWERSHOPS", each of the three terms is sequentially dropped to form all combinations of two possible terms. In this instance, the set of parents is the following:
MA AND RESTAURANTS MA AND FLOWERSHOPS RESTAURANTS AND FLOWERSHOPS
It should be generally noted that this embodiment, a search is made for only the parent terms. Similarly, other embodiments may go further in searching for results in the data query cache by also forming grandparent terms, as by dropping two terms. This process can be repeated for any number of terms being dropped and subsequently determining if any data sets in the data query cache conespond to the resulting terms.
At step 205. a determination is made as to whether data results in the data query cache conespond to any of the parent terms If not, control proceeds to step 212 where a closest ancestor may be used as a basis for starting to form the resulting data set. In one embodiment, preprocessing insures that ancestor-based geography exists. In one implementation, that ancestor is a Veπty term list associated with a particular state. This implementation uses API calls to retrieve the data identifiers conesponding to the resulting data to be included in the query results.
If, at step 205, it is determined that there are one or more data sets in the data query cache that conespond to one or more of the parent terms, control proceeds to step 206 where a cost is associated with each parent. One embodiment associates a cost with each parent term in accoi dance with the numbei of listings of each paient term This may also be normalized and used in a percentage form by dividing the numbei of listings in the parent domain by the total number of listings m the query domain. This percentage represents the probability of a business listing belonging to the paient data set appearing in the database. Contiol pioceeds to step 208 w heie the paient with the minimum cost is chosen as the starting data set foi formulating the data lesults At step 210. the minimum cost derivation sequence is applied to produce the resulting data query Generally, the minimum cost deπvation sequence is obtained by operating upon the least probability terms first. It should generally be noted that m othei embodiments in which other extended parentage thresholds are used, such as grandparents, the determination of the start data set in step 208 may be the data set with is closest in terms of parentage and ith the least number of listings in the data set The proximity in parentage is the primary ranking basis and the number of listings being secondary in determining ranking. Refernng now to Figure 34, shown is a diagram of one example used in step 210 for determining and applying the best derivation sequence. In this example, the query is for MA AND RESTAURANTS AND FLOWERSHOPS. As represented in state 230, it has been determined that MA is the starting data set which is located in the data query cache. In this example, the parentage has been extended to grandparents, and MA has been determined to be the first ranking data set in terms of parentage and number of listings in the data set. At this point, control proceeds to one of two states, 232 representing "MA AND RESTAURANTS", or 234 representing "MA AND FLOWERSHOPS". The state to which control is advanced depends generally on choosing the path with the minimum associated cost at each step. In this instance, the number of elements in the data sets "FLOWERSHOPS" (state 234) and "RESTAURANTS" (state 232) may be considered in determining cost. If the number of elements in FLOWERSHOPS is less than the number of elements in the data set RESTAURANTS, control proceeds to state 234 where each business listing m the data set FLOWERSHOP is examined to determine if it is also in MA. The resulting data set forms the set of all business listings in MA AND FLOWERSHOPS. In contrast, if the number of elements in the data set RESTAURANTS is less than FLOWERSHOPS, state 232 is entered and similai searching of the data set is performed. From either state 232 or 234. contiol proceeds to state 236 wheie seaiching of the data set elements is performed to produce the final resulting data set representing "MA AND RESTAURANTS AND FLOWERSHOPS" Generally, the approach just described is to advance to the next state which has the minimum cost associated until the final resulting data set is determined.
It should also be noted that some of the determination of data sets as used in performing queries may be done as preprocessing to partition the data sets For example, in one embodiment, the data is partitioned by states The adaptive techniques as described with regard to the GTE Superpages application described herein include partitioning the data sets based on geography, particularly within each state. In this instance, particular server nodes are designated as primary query servei s based on geographic location by state Additionally, as part of this partitioning of requests, the data query caches and term lists of identifiers are also partitioned according to state. In this embodiment, this partitioning is done as a preprocessing step prior to servicing a request in that the identifiers are formed and placed on each dedicated server node. Similarly, other data partitioning may also be performed as part of a preprocessing step Geneially, this partitioning may be determined based on expected data queries and data sets formed accordingly, for example, by examining log files with recorded data query search histories to determine frequently searched categones or combinations of categoπes.
A query request, as made by a user, is generally the combination of boolean operators and search terms. In this embodiment, the general form of a term in a query request is: key=value in which the "key" represents some category or search term, such as STATE. "Value" represents the value which this key has in this particular query. With regard to the previous example, "S=MA" may represent the query term STATE=MA. Key-value pairs or terms may be joined by the logical boolean AND operation, represented, for example, as "&". The logical boolean OR operation may also be represented, for example, by another symbolic operator such as a",". For example, when looking for either cities of ACTON or BOSTON, this a) be lepiesented as
T=ACTON.BOSTON The number and types of "keys" varies with embodiment For example, in this embodiment, keys include (T) City, (B) Business Listing, (S) State, (R) Sort Order, (LT) Latitude, (LO) Longitude, and (A) Area Code In this application, foi example, LT and LO may be used to calculate data sets lelating to pioximity searches, such as restaurants withm thirty (30) miles of Boston
The Data Query Cache 850, in this embodiment, generally includes a "hot" and "cold" cache In this embodiment, the caching technique implemented is the LRU (Least Recently Used) policy by which elements of the cache are selected for replacement in accordance with time from last use These and other policies aie generally known to those skilled in the art. Generally, the "hot" cache may include the most recently used items and the cold cache the remaining items In this embodiment, each of the data query caches and other caching elements as depicted in Figure 2. may be fast memory access devices, as known to those skilled in the art, used geneially for caching
It should generally be noted that in this particular embodiment, the "hot" cache is implemented as stonng the data in random access memoiy. This may be distinguished from the storage medium associated with the "cold" cache representing those items which are determined, m accordance with caching policies such as the LRU, to be least likely to be accessed when compared with the items m the hot cache which are determined to be more likely to be accessed.
In this embodiment, a double ended queue structure is used to store cached objects, but other data structures known to those skilled m the ait may be used in accordance with each implementation. Data sets that are stored in the data query cache and page cache each conespond to a particular search query. In other words, a mapping technique may be used to map a particular query to conesponding data as stored in the data query cache and the page cache. Generally, this mapping uniquely maps a data query to a name refernng to the data set of the data query. In this embodiment, this allows quick access of the data set associated with a particular query and quick determination if such a data set exists, for example, the data query cache
Refeπing now to Figuie 35, shown is a flowchart of an embodiment of the steps foi forming a name associated with a data set. as may be stored in the data queiy cache or page cache. At step 240, a subset of queiy terms is determined such that a string lepresenting a particular query is uniquely mapped to a name conesponding to a data set In this embodiment, the subset of keys that are used m mapping a string conesponding to a query to a name of a data set include
Proximity, City, State, Street, Zip. Category, Category Identifier, Business name, Aiea code, Phone numbei, Keywoids. and National Account Generally, "Proximity" represents the pioximity in physical distance to/from a geographic entity, such as a city "City". "State Street", "Zip", "Aiea Code', "Phone Number", and "Business Name" represent what the keys semantically describe as pertaining to a business listing "Category" lepresents a classification as associated with each business, such as representing a type of business service "Categoiy Identifier" is an integer identifier representing a category id "Keywords" indicate an ordering pnoπty for the resulting data set. "National Account" repiesents a business or service level parent- child relationship where the national account indicates the parent An example is a parent- child relationship between a parent corporation and its franchises
At step 244, a query stnng conesponding to a particular user query is formed using the onginal stπng as formed, for example, by the Parsei of Figuie 2 The query stnng includes only those terms which are included m the subset as identified in step 240. If the ongmal stnng does not include an item that is in the subset, for example, since the user query does not include the item as a search term, that item is omitted in forming the query stπng conesponding to the data set. At step 248, this query string is used to determine if a data set is located in the data query cache that conesponds to the cunent user query request. In this embodiment, the data sets each conespond to a filename. Thus, a lookup as to whether a data set conesponding to a particular user query exists may be determined by performing a directory lookup, for example, using file system services as may be included m an operating system upon a device which serves as a fast memory access or other caching device. It should be noted that this technique may be used generally withm the Superpages Front End Servei and BackOffice to form unique names that conespond to paiticulai search terms. For example, one embodiment may include services for operating upon the original query string as formed by the Parser to produce parents and grandparents of the terms included in a query when performing the method steps of Figure 33 and 34 if there is no exact data set match in the data query cache This may provide the advantage of insulating other code, such as in data encapsulation, fiom knowing the internal structuie of the query stπng. Generally, as known to those skilled in the ait, this is a common programming technique to minimize code portions from changes data types and structures to minimize, for example, the amount of recompilation when a new data type is introduced or existing data type modified. Other techniques, such as hashing, may be used to generate a unique identifier for the input stπng, as know n to those skilled in the art.
It should be generally noted that a similar mapping technique is used in forming a Page Cache name. The technique used is as described for forming the Query Cache filename with additional qualifying terms in accordance with the "look and feel", such as display features, used to produce the Page Cache name. For example, if the displayed resulting HTML page includes 15 listings/page, the Page Cache name includes a parameter in forming the name uniquely identifying the filename including the result set for a query in this particular display format. Generally, in this embodiment, the data query cache includes cache objects in which each cache object conesponds to a particular cached query resulting data set. Refernng now to Figure 36, shown is a block diagram of one embodiment of a data set as stored in the data query cache. Generally, each data set 250 includes header information 252 and information conesponding to one or more business listings. Generally, header information may include information descnbmg the data query set, such as the number of business listings m the data set. Other types of information may be included in accordance with each particular application and implementation.
Each business listing 254 generally includes information that descnbes the business listing. More particularly, this information includes data that is cached as needed by other components the Front End Server, for example, in performing vanous searches, data retrieval, and othei operations upon data in accoi dance with functionality piovided by the embodiment In this instance, the follow mg types of fields of information are stored foi each business listing 254:
1) number of categories associated w ith this business listing 2) latitude
3) longitude
4) business name
6) state 7) list of categories associated with this business listing
8) database key or identifier used as an index into the databases
9) relevance information
10) advertiser pnoπty
In the above fields, relevance information is Verity-specific information as it relates to the query. For example, this generally represents the frequency of words or terms in a document. The advertiser pπoπty indicates a service level that may be used in presenting business listings, for example, in a particular order to a user. For example, if a first advertiser purchases "gold" level advertising services, and a second advertiser purchases "silver" level advertising services, when a user requests only 15 listings to be displayed, the "gold" level advertisements may be displayed prior to the other advertisements by other advertisers, such as the "silver" level service purchaser. Thus, a higher level of service may guarantee an advertisement be placed earlier in the displayed results.
The technique used to store the data in the data cache from memory includes object senalization and desenahzation techniques, as known to those of ordinary skill in the art. These techniques transform an internal storage format, as may be stored in random access memory, to a format suitable for persistent storage in a file system, as in the data query cache. The complementary operation is also performed from persistent storage to the m- memory copy. For each of the above-named fields, object senalization, i.e., from memory to persistent storage device in cache, is performed by storing the data type, its length, and the data itself. It should be noted that the length may not be needed for each data field, for example, in fixed length data types. The complementary operation of object desenalization is generally performed by reading the fields in the same 01 der as written to the cache
In this embodiment, other caches may have other storage techniques For example, the Page Cache may be implemented as HTML files in a file structure located on a disk or other storage device The PHTML execution tree may be implemented as an ln-memory linked list or other abstract data structure representation of the C++ objects
It should be noted that in this particular embodiment, the data query cache may include different types of cached geographical data as may be used in performing different data queries. For example, the type of data cached descnbed in the prior paragraphs is the "normal" business listing data as associated w ith a well-defined geographic area. Other businesses, for example, such as a florist or an airline, may not be associated with a single well-defined geographic location. A business may not have any geographic bounds, such as if it is an Internet business with a virtual storefront accessible on the Internet. Also, other businesses may be located in a particular well-defined geographic area, such as an airline with a physical presence in a particular city, but the service area which conesponds to the service offered does not conespond to the location of the business itself. To include businesses with these particulaπties, in addition to the "normal" business listing just described in which the geographic business location and service areas conespond, the concepts of multi-city and total-city placements have been included in this embodiment. Generally, multi-city placement may be descnbed as representing a business' service area in multiple cities when data queries are performed. An example may be a plumbing service located in three (3) cities with service areas in ten (10) cities. The total- city placement may generally be descnbed as representing a business' service area in all cities when searches are performed. An airline is generally an example of this which services all major U.S. cities. Generally, in this embodiment, the total city and multi-city search results are cached separately from the "normal" query results, but are composited with the normal search results pπor to retrieving the data from the database.
It should generally be noted that in this embodiment, the total and multi-city query results are retnevable independent of the "normal" search results. However, the storage format for this information, this embodiment, may be as descnbed for "normal" query results Generally, othei embodiments may use a different format for storage than the "normal" search results, foi example, if othei information is deemed to be important in accordance with each implementation
The technique of performing the total and multi-city query search optimization in conjunction with the normal query caching w ill be described in paragraphs relating to
Figures 37 and 38 that follow.
Refening now to Figures 37 and 38. shown is a flowchart of an embodiment of a method for integrating total-city and multi-city cache results into "normal" cached search results. At step 260, a total-city cache name conesponding to the data query is formed. In one embodiment, the total city cache name is formed by starting with the string
"SCOPE=T" to identify a total-city name Additionally, the following information is extracted from the original query stπng, as formed by the parser: category, category id, business name, street address, keywords, longitude, latitude These key-value pairs are extracted from the ongmal query string and appended to the "SCOPE=T" to form the total-city cache name In one embodiment, these functions of extracting the information from the original query string and forming the total-city cache name may be performed by the same software as forming the name for the data query cache "normal" query name, such as by API calls to the same routines with parameters, as known to those of ordinary skill in the art of programming. At step 262. it is determined if the total-city query data set conesponding to the total-city cache name for the cunent query exists. If it does, control proceeds to step 264 where the total-city data set cached item is moved to the hot cache, if not all ready in the hot cache. A reference to this data set is saved for later retrieval in other processing steps. If at step 262, a determination is made that the total-city query cached data set conesponding to the total-city cache name does not exist, control proceeds to step 266 where a search is performed for the total-city query. At step 268, the search results are cached, as in the "hot"cache. A reference to these search results are stored for use in later processing steps. Generally, an empty or null search results stored in cache may be just as important for performance as a non-null search results that is cached. Control proceeds to step 270 of Figure 38 where a multi-city cache name is constructed representing the multi-city cache conesponding to the cunent data query. In one embodiment, this multi-city cache name may be constructed by forming a string using the same fields extracted from the original data queiy stnng as formed by the parser in conjunction with forming the total-city name Similai to forming the data query name for the "normal" cached search results, the string conesponding to the cached data set for a given query uniquely identifies the data set. In forming the multi-city cache name, appended to the concatenated key-value pairs is a stπng of "SCOPE=M rather than the stnng "SCOPENT", as with the total-city cache name
At step 272, a determination is made as to whether theie is multi-city cached data conesponding to the cunent multi-city cache name If, at step 272, a determination is made that such a data set exists in the multi-city cache, control pioceeds to step 274 where the data is moved to the "hot"cache, if not all ready located theie. Additionally, a reference to this location in the "hot"cache is saved for use in latei processing steps. If, at step 272, a determination is made that such a data set does not exist in the multi-city cache, control proceeds to step 276 where a search of the database is performed. The query results, if any, are cached in the "hot"cache with a reference to the results saved for use in later processing steps.
At step 280, the total-city and multi-city data cache results are integrated with the "normal" query results. After the "normal" query is performed, but before sorting the search results, the total-city-cached results, if any, may be combined with the "normal" query results. If there are no total-city cached results, the multi-city results may be included, if any.
The combined search results are then sorted such that any redundant listings are removed. Any additional processing is performed, as m accordance with the user query, for example, as producing the listings which begin with "B", or only listing the top ranked fifteen (15) listings as ranked in accordance with other user specified cnteπa.
In all the caches, a garbage collection technique may be included to remove or delete cached objects that have been determined to be "old" in accordance with predetermined cnteπa. For example, in one embodiment using the LRU caching scheme, whenever the amount of fiee cache space falls below a thieshold level, the garbage collection routine is invoked The thieshold level includes parameteis relating to a predetermined number of cache objects and the accumulated size of the objects in the cache. In this embodiment, although theie may be multiple conceptual caches, such as the "normal" data queiy cache, the multi-city cache, and the total-city cache, the cached lesults may physically reside in the same "hot" and "cold" caching devices. How evei , in this embodiment, the different types of caching results may be accessed independent of the other caching results Other embodiments may have other organizations of the caches in accordance with othei implementation and associated data requiiements [Information Retrieval]
A variety of information retrieval techniques may be used to retrieve recoids stored m the Pnmary Database 812 Further details of the query engine 862 are presented schematic format in Figure 39 When the parse dπvei 858 of the parser 866 of one of the servers 808 delivers a parsed instruction to the query engine 862, the query engine 862 may, in an embodiment of the invention, include information retrieval software 908 to retπeve records from the Pnmary Database 812 that conespond to the user's query. The query engine 862 may include more than one form of information retrieval software Foi example, the query engine, in addition including the information retrieval soft are 908 that is to be used to obtain listings in response to user queries, may further include banner ad retneval software 909 for retrieving advertisements that relate to the user's query.
In an embodiment of the invention, the information retrieval software 908 may include functionality of software such as the Information Server Version 3 6 software commercially available from a company known as Verity. Other commercial packages of information retneval software are available, and the techniques described herein could also be employed using propπetary software coded by the user In an embodiment, the information retneval software 908 includes the Information Server Version 3.6 software and additional extensions provided by the host of the GTE Superpages system
Refernng to Figure 40, steps by which the information retneval software 908 obtains results are set forth in a flow chart 83 The information retneval software 908 may at a step 82 access markup language files 906, as depicted in Figure 25, which are produced by the exti action routines 902 fiom the normalized data 900 In an embodiment, the markup language files consist of business listings that aie stored in the Pπmaiy Database 812 The information retrieval softw are 908 may then, at a step 84 produce term lists 836 that are further used by the information retrieval softwaie 908 to handle queries that are delivered to the queiy engine 862. The term lists 836 may consist of a linked list for each term that appears in one of the business listings, with the elements of the linked list including a document identifier for the business listing and certain statistics regarding the frequency of occunence of the particular term in each document and in the document set as a whole. The banner ad retrieval softw are 909 may similarly generate and use banner ad term lists 837 that are further used by the banner ad retrieval software 909 to handle generation of appropriate banner ads. Next, at a step 90, the term lists, which in an embodiment are generated using Veπty softw are, may be expanded at a step 86 to include synonyms for the terms appearing in the business listing. For example, if the term "diner" appears in a business listing, then the term "restaurant" might be assigned to the file for that business listing as stored in the Pnmary Database 812 The expansion of the listings to include synonyms of the words included in the listings may be accomplished by execution of PHTML scripts or other programming techniques. The expansion may establish a hierarchical structure; for example, the term "restaurant" may be stored in a tree that includes the sub-category of "ethnic restaurant." which may further include the sub- category "greek restaurant." PHTML scnpts may be provided to establish the tree structure and to operate on the tree structure to retrieve results that will be provided to the user. The steps 82, 84 and 86 may be accomplished at initialization of the system, thus establishing and expanding the term lists 836, 837 for later use.
Once the system is initialized, the system may operate to obtain results that are to be displayed to the user. The steps for obtaining results may be seen in a flow chart 88 displayed in Figure 41. Refernng to Figure 41, the parse dnver 858 may at a step 20 parse a user query and deliver the parsed query in suitable form for handling by the query engine 862. The query engine may include the information retneval software 908. At a step 22, the query engine 862 may operate the information retrieval software 908 to take the parsed user request and expand the query, turning the user request into a detailed query. Next, at a step 24, the information retrieval software may operate on the expanded term lists 836 by identifying documents associated with the terms identified in the expanded query. In an embodiment, the term lists 836 are the business listings described in connection with steps 82, 84 and 86 above, expanded to include synonyms and terms that are determined to be related to the words in the business listing. Identification of documents may be accomplished by a variety of information retrieval techniques. Documents may also be associated with queries by sorted relevancy ranking, clustering (automated grouping of related documents), automated document, summarization (creation of content abstracts, not simply the first few sentences of the document) and query-by-example (turning an individual document into a query in order to retrieve "more documents like this"). These functions may be accomplished by software techniques, such as having a table of pointers having as an argument a tokenized version of each possible term from the expanded user query from the step 22. The table of pointers may point to the location of a term list 836 for each such term. The term list may be a linked list of documents that include the term. The linked list may include information about each document, such as the number of occunences of the term in the document, the inverse frequency of the term in the entire set of documents, the association of the document with other documents, the association of the document with categories, and the like.
A variety of different techniques can be used to index documents for information retrieval. In embodiment, an indexing architecture such as that provided by Verity allows for incremental indexing, so that only new, updated or deleted documents require changes, avoiding the need for a complete re-index each time a document changes. Online identifiers may be provided, so that searches can continue while the identifiers are modified. This function is also provided by the Verity software. At a step 28 a variety of weighting algorithms can be used to rank documents identified in the step 24 according to the information stored in the term lists 836. For example, a simple weighting algorithm might take a single term query, such as a category of information, and rank each document in a term list 836 in numerical order according to the product of the term frequency (the number of times a term appears in the document) and the inverse document frequency (the inverse of the number of times the term appears in the entire document set)
Once the documents aie ranked, at a step 30 a list of the ranked documents may be further processed by the information letπex al softwaie to piovide a results page In particular, at the step 30, the information retnev al softwaie 908 may determine categories into which the retrieved documents fall. In an embodiment, the categories are yellow pages categories, which have been previously assigned to the documents, which are business tings, prior to entry of the business listings in the Primary Database 812 Thus, at the step 30, the information retrieval software 908 detei mines what categories are associated with the business listings retrieved by the ranking at the step 28 Next, at a step 98, the information retrieval software 908 may compare the categories identified at the step 30 to the terms in the user query. If categoπes are piesent that do not include any of the terms in the user query, then, at a step 92, such categories may be discarded. Thus, the user will not retπeve categories that are unrelated to the user query Such categories might otherwise appear, for example, if the information retπev al software 908 retrieves a business listing that is associated with two unrelated categoπes, only one of which is relevant to the user query. For example, a query for a restaurant might retrieve a listing for "Joe's restaurant and bowling alley." The information retneval software 908 might then retrieve the categories "restaurants" and "bowling" that ould have been associated with that listing. The "bowling" category would be discarded, because the user query for a restaurant is unrelated to the "bowling" category. The term comparison may use an expanded version of the terms in the query and m the categoπes Thus, a category would not be discarded if it includes a synonym of a query term, even if the category does not include an exact term match.
Once the non-matching categoπes are discarded at the step 92, the information retneval software may, at a step 94, determine whether there are any remaining categoπes.
If not, then control proceeds to a step 96, at which the user is informed that there are no matching categoπes. The user may then be returned to the query screen. If, at the step 94, at least one category remains, then, at a step 98, the information retneval software determines whether there is more than one category. If not, then at a step 100 the system may display the actual business listings that appear in that one category to the user. If at the step 98 it is determined that more than one categoiy remains, then at a step 102 the system may display a results page that consists ot a list of the lemainmg categories The results page may furthei include an indication of the numbei of listings that aie associated with each category The document identifiers established for information letπeval software 908 may maintain pointers to other documents or to souices of the documents, such as URLs oi file names. Thus, the identifieis may be stored apart from the documents allowing separate, non-invasive use of the identifiers, while maintaining the integrity of the data. [Common Term Optimization (CTO)] In an embodiment of the information retrieval system disclosed herein, common terms may be identified order to optimize the retπeval of infoimation in cases where user quenes employ such terms.
A series of steps may be performed as pie-processing operations in order to classify and establish query result sets for common queries. Refernng to a flow chart 31 in Figure 42, at a step 32 common terms may be identified pnor to system initialization. Designation of common terms may be performed based on a number of different factors. For example, a single word might in theory be designated a common term, if it appears with a high frequency result sets obtained by users. It is noted that a single word common term may offer relatively little benefit in search efficiency, because the term lists 836 already permit searching based on individual terms. Alternatively, common terms might consist of multiple word combinations of any length, whether bi-grams, tπ-grams, or n-grams. Thus, words that co-occur in high frequency can be designated as common terms, such as in a bi- gram format. For example, the bi-gram "Boston - restaurant" might be designated a common term. Next, at a step 33, terms may be linked to specific contexts; that is, terms may be designated or classified as common terms in part according to their context. For example, the term "Boston," might be considered a common term if entered m the "city" field, but it might not be considered a common term if entered in a "business name" field or a "category" field. Similarly, the term "restaurant" might be a common term in the "category" field, but would not be considered a common term in the "city" field. Thus, at the step 33, the common term sets may be sti uctuied to letlect context Thus, the bi-gram "Boston - Restauiant' might be stored as an expanded foim that leflects both the term and the context m which it is to be treated as a common term, foi example "City = Boston, Category = Restaurant " Refernng to Figure 42, it may be desπ able to expand, at a step 35, the terms that are to be designated as common terms. Thus, each term might be expanded to include both synonyms for the term and other terms that aie semantically I elated to the common term in the established context for the term. For example, the common term "category = restaurant" might be expanded to cover results m which synonyms for restaurant are included in the results, such as "diner," "bar and grill," "eatery" and the like. Similarly, a city term might be expanded to include suburbs or neighboi hoods, thus, the term "City = New York" would be expanded to include "City = Biooklyn," "City = Queens," and "City = Manhattan." Note that the synonyms for a
Figure imgf000056_0001
en term might be different depending on the context. For example, the term "Dorchester" might be a related term for "City = Boston," but it might not be a related term for "business name = Boston."
The pre-processing steps 32, 33 and 35 might be accomplished in a different order, and other steps might be included in embodiments of the invention Once common terms are identified, linked to contexts, and expanded at the pre-processing steps 32, 33 and 35, it is possible to establish lists or identifiers at a step 46 that include the expanded common term n-grams. One way of dealing with common term combinations would be to generate in advance term lists 836 that are predicted to be used with some frequency (e.g., restaurants, Boston. New York, etc.) and to pie-calculate the intersection of the likely combinations. This approach requires substantial processing and would have to be performed frequently, given frequent changes the identifiers. Instead, it is possible, at the step 46 to create special identifiers, or term lists 836, that represent the expanded common terms, as linked to their contexts. Thus, a term list 836 might consist of a linked list of documents, such as business listings, that contain the terms "Boston" and "restaurant," (or synonyms thereof) m the contexts in which those terms are common. The term lists 836 may, like other term lists 836 descnbed elsewhere herein, may further include information as to the term frequency of each term, synonym or related term, and the inverse document fiequency of the term, svnonym 01 1 elated term in all documents in the set. In an embodiment, the synonyms and 1 elated teims may be included in the actual business listings that are used to generate term lists 836. so that those listings will be included in the generation of common term lists In an embodiment, the listings themselves may be classified as to common terms and synonyms or related terms of those terms. Listings may be further classified as to sub-contexts, depending on the search context. Listings using identical terms should also be included in term lists, because they use identical token identifiers for such terms Foi example, the term "Boston" should be understood in a nationwide search to include listing in both Boston, Massachusetts and Boston, Kentucky, because the token for the term "Boston" will be the same in each case.
Result sets must be identified as tokenwise semantically related to the classifications that are possible in a seaich. Results are thus classified into common term groups on a sting- by-listmg basis.
At a step 48, the common term lists 836 for combined terms can be stored in a designated area of the primary database 812. front end server 804, or server node 808-810 that allows a rapid search in the event common term combinations are included in the user query. The common term lists are thus assigned to a special results area for common term searches.
The steps 46 and 48 may be performed upon initialization of the system. Thus, with the pre-processing steps 32, 33 and 35 and the initialization steps 46 and 48, result sets are established for common term searches, and the result sets are stored in a special location in memory for rapid retrieval.
Next, at a step 49, query rules may be established that direct appropπate user quenes to the special location in memory established at the step 48. Refernng to Figure 43, the user might enter a query on a template 34 that is displayed as a page, such as markup language page, on the user's browser 824. The template might include fields 36, such as a category field 38, a business name field 40, a city field 42 and a state field 44. When the user enters a term into one or more of the fields 36 and initiates a query, such as by pressing "enter" on the keyboard or clicking the appropπate screen location, the query is delivered to the parser 866 of the server 808 to which that user has been routed. The query is then used, as described above in connection with Figuie 41, to retrieve documents In an embodiment of the invention, the documents that aie letneved at the step 28 and displayed at the step 30 of Figuie 41 aie a set of matching categones for the queiy Foi example, as depicted in Figure 44, if the user enters the categoi y "ait supplies," the information retrieval software 908 may retrieve a set of matching categones that relate to art supplies
The retrieved categories may be ordered alphabetically, by oidei of significance, oi grouped by sub-categories. The user then may select categories among the matching categories to receive either further sub-categones or documents, such as advertisements oi other markup language pages, that conespond to the categones in an embodiment, lather than matching categories, the information retneval software 908 may immediately retrieve matching documents, such as specific adv ertisements or other maikup language pages, rather than categories of documents. This dnect retrieval step may be accomplished, for example, when one of the user-entered categoπes is an exact match to one of the categories included in the term lists 836. A similar series of steps takes place if the user enters a query for a particular location in the city field 42 or the state field 44, or for a business name in the business name field 40. The information retrieval software 908 retrieves documents from the term lists 836 that conespond to a ranking of an expansion of the user-entered query.
When both a category and a location or a business name, or all three, are entered by the user, then the information retrieval software 908 may, in a conventional manner, retπeve term lists 836 that conespond to each of the terms of the query, such as a list conesponding to the category "restaurant" and a list conesponding to the city field "Boston." The information retneval software 908 could then perform an intersection of the two sets and perform a ranking of the related categories (e.g., Italian restaurants in Boston, French restaurants in Boston, etc.) or related listings (for specific Boston restaurants).
Because the term list 836 for documents containing the term "Boston" (including all businesses in Boston) and the term list 836 for documents containing the term "restaurant" (including all restaurants, nationwide) are both very large, the processing involved in retrieving each list and performing an intersection in order to identify matching categones or documents can be substantial. Accordingly, it is desirable to reduce the processing involved.
The information retneval software 908 may be progiammed with query rules at the step 49 to recognize when a queiy includes a common term n-giam, such as "City = Boston; category = lestaurant " That is, whate er common terms are identified at the pie- processing steps 32, 33 and 35 should be lecog zed by the information retrieval soft are
908, so that queries that use the common terms in the appropriate contexts (or synonyms or related terms those contexts) are designated for special processing. In particular, the information retrieval software 908 may be programmed to execute the search for the user's query in the special area of memory that was established for stoiage of the special common term lists 836 at the step 48 of Figure 42.
In one embodiment of the invention, refeπed to as "CCC-index g," the common terms that are selected for combined common term lists and special storage are bi-grams in the form "City = xxx, category = yyy" and in w hich the most common categories, such as restaurants, are found in the category field and the largest cities, such as New York, Boston, and the like, are found in the city field.
[Data Integration]
Refernng now to Figure 45, shown is one embodiment of the database included in the BackOffice component as included in Figure 2 and 4. Generally, data updates included m the database come from three different sources in this particular embodiment. One source is on-line updates, as provided by users making updates or entering new information for business listing via network connections through the BackOffice component as through the Front End Server. A second source of data updates is based on foreign source updates. Generally, foreign source updates are those update records which come from a different data source than the onginal existing database. A third type of data integration or update source is refened to as a native source update. Generally, a native source update is when an updated version of the existing database having the same source as the existing database is provided. For example, a database copy may be provided as an update on a monthly basis using full sets of data where a data provider provides an updated version of the same data set. The native source data integration procedure integrates those changes in the new data set into the existing database. This is in contrast to a foreign source update, for example, where the existing database is piov ided by one vendoi , and the update lecords foi example, are provided by a different vendoi The update vendors being from a foreign source are called foieign source data integration or updates
It should be noted this particular embodiment that the native souice update records are provided using full sets of data In othei woids, the existing database is a complete database The native source updates are piovided in the form of a complete database as opposed to only providing update iecoids The foreign source update records are generally records obtained from a souice different fiom the working database and are merged into the existing database. Shown in Figure 45 is a native source update database 1500 which is integrated into the unfiltered database 1504. Generally, this is done by performing comparisons of the records of the native source update database 1500 and the unfiltered database recoids 1504 m determining the various types of operations that need to be performed to integrate the changes from the native source update into the unfiltered database. This will be descnbed in more detail in paragraphs that follow. Applying data enhancement techniques to the unfiltered database, these record changes are integrated into the working database 1508. Generally, the unfiltered database 1504 is a complete version equivalent to the working database. However, the records included in the unfiltered database 1504 generally include raw data which has not had the benefit of the data enhancement techniques as applied to the working database records 1508. The on-line update records 1506 and the foreign source update records 1510 are integrated directly into the working database copy 1508. It should generally be noted that the foreign souice update records 1510 are integrated or merged into the working database records 1508 by applying data merging techniques that will be described in more detail in paragraphs that follow. It should also be noted that the denormalized data, as included in the BackOffice component and the Front End Server, include this particular embodiment, three tables or components of data. Generally, the three components of data include a category file, a fact file, and a business listing file. The business listing file has been previously descnbed in conjunction with the architecture in other sections of this descπption. The fact file includes information additionally provided by vanous advertisers or business services
PCT WPD -58- which are geneially static in nature. For example, the fact file may contain information such as hours of opeiation and extra attributes such as biand names or products pioduced by a business. This file generally does not change ith updates. The third file is a category file may include a category identifiei and a conesponding heading Geneially, the category identifiei is a numeric quantity or other identifier that may be used in performing queries. The heading is a textual description of the various category identifiers which may be used either for performing data queries In the various data integration and updates, as will be described paragraphs that follow . it should be noted that the business listing file is generally what is updated when consideπng the techniques which will be described. However, the category file is also updated as part of the native source update, as will also be described in paragraphs that follow.
In paiagraphs that follow descnbe general integration techniques for the foregoing types of data updates. Each of these techniques which will be described is associated with one type of data integration. However, in other prefened embodiments, each technique may be associated with and applied to other data types
The foreign source update will be descnbed in paragraphs that follow. However, the concepts and techniques included herein may also be applied to different types of data updates.
Generally, in the description that follows for data entries, there is one existing record or data entry per business listing. In this particular embodiment, a business listing is the atomic unit of granulanty by which updates are performed. Any information and data such a phone number, name and address associated with a particular business entity is considered to be part of one logical piece of information or record. Thus, in the descπptions that follow, updates are made with regard to the information associated with one particular business listing or entity.
The techniques which will be descnbed regarding the foreign source update generally assume that an existing database and update records are provided, and that each originate from different or foreign sources. It should generally be noted that since the sources are different, there is no general assumption made as to particular data fields or the structure of the foreign records as compared to the existing database. It is first determined whether there is a matching entiy in the existing database for an entiy in the updated version of the database II no match is found in the existing database for an entry 01 business listing which appears in the updated version of the database, this new entry is added and integrated into the existing database The techniques which will be descnbed in paragraphs that follow may be adaptable, as known to those skilled in the art, to update situations in which an implementation uses something other than two complete sets of data when performing a system update.
In this embodiment, this process of foreign souice update is performed in the BackOffice component 818 m which the existing database to be updated is generally in normalized form. The updated version of the database may be in normalized or denormalized form. Depending on the form, additional processing steps, as known to those skilled in the art, may be needed to retrieve and update the actual files that include the data, for example, associated with a particular business entity or record. In the description below, the described technique assumes that each business listing generally includes the following data items: business name, zip code, and at least one of a pnmary phone number or toll-free phone number. Generally, the foreign source integration technique is based on the premise that a phone number and zip code of a business are sufficiently unique to significantly reduce the matching problem to comparisons of a few listings. In paragraphs that follow, a determination is trying to being made as to whether entnes the update and existing database match to further determine if update records are to be added, or if existing database records are to be deleted or modified.
Generally, the matching technique described for foreign source update determines a conespondence between the foreign source update records 1510 and the records m the existing working database 1508. The matching technique generally includes: 1) determining which records in the existing working database match which update records; 2) if more than one record in the existing database conespond to the same record in the existing working database, determining which record in the existing database is the closest match for the update record; and 3) if the foreign source update records include duplicate records such that multiple update records conespond to the same set of one or more existing database lecoids, collapsing the duplicate foieign souice update iecoids into a single update record that is matched to a single lecoid in the existing database
After determining which records in the foieign source update conespond to which records in the existing woiking database, opeiations aie determined and applied to the existing working database Generally, as will be descnbed, transactions with lespect to the existing working database are determined Generally, an update to an existing record is performed so as not to lose any existing infoimation while also incorporating the new additional information or updated information For example, an existing listing includes a business name and address, and phone numbei . but no e-mail address. A foreign souice update record includes a business name and address, e-mail address, and phone number
The information from the foreign source update recoid is included in the existing database in union with the fields that are blank in the update record such that the e-mail address in the existing database is not removed when the updated information from the update record is applied. It should be noted that in this embodiment, no delete operations are performed with the foreign source update data integration due to the nature of combining data originating from different sources. However, other embodiments may include delete operations in addition to update and modify operations in foreign source data integration.
Refernng to Figure 46, at step 1000 a comparison is made between the phone number of an update record and the phone numbei field of each entry in the existing database. At step 1000, a determination is made as to whether or not the record in the latest version of the database copy is an 800 phone number. If a determination is made at step 1000 that the phone number of the cunent update entry is not an 800 number, control proceeds to step 1008. At step 1008, the procedure "match phone number" is performed to produce a subset of one or more entnes of the existing database which match the existing phone number. Control proceeds to step 1010 where the procedure "name match" is performed. Generally, "name match" will be descnbed in paragraphs that follow to determine whether there is a business name match for a particular entry. Control proceeds to step 1012 where "denve score" is performed based on the zip code and the name match score. Generally, the result of step 1012 produces a score representing a statistic relative to determining whether two entnes in a particular database and an updated version of the database match
Aftei performing step 1012, contiol pioceeds to step 1020 of Figuie 47 where a comparison oi a determination is made as to w hethei oi not the derived score is greater than 50%. If the derived score is greater than 50%. contiol pioceeds to step 1034 where a determination is made whether there is only one matching entry the database for an update record. If a determination is made at step 1034 that there is only matching entry in the database, control proceeds to step 1042. w here a determination is made that a match has been found. Alternatively, if at step 1034 there is moie than one matching entry in the database for a record in the cunent updated v eision of the database, control pioceeds to step 1036, where a determination is made whether theie is only one entry with a maximum score. If there is only one entry with a maximum score, control proceeds to step 1046, where this maximum scoring entry in the existing database is determined to be the matching entry for the updated version. If at step 1036 there are multiple entnes with the same maximum score, control proceeds to step 1038 where additional processing is required to determine which is the matching entry, if any.
It should generally be noted that the score threshold of 50% may be tuned and vaned for each particular implementation and embodiment. This value is generally a configurable threshold value that may be defined heuπstically, for example, by examining data samples. The processing of step 1038 is geneially performed off-line. It may be done manually or in an automated fashion in accordance with the types of data in the existing database. For example, at step 1038, having multiple entnes with the same maximum score may indicate that there is an enor or conuption in data. For example, one embodiment, an alternate technique is used where if any record has the same zip code, that record is considered as being a matching record.
If at step 1020 a determination is made that the score is less than or equal to 50%, control proceeds to step 1022. At step 1022, a determination is made as to whether or not the difference in the name length is less than or equal to three. If the difference in the name length field is not less than or equal to three, control proceeds to step 1028 where a determination is made in that no matching entry exists in the database. It should be generally be noted that the decision piocess and the companson process performed in steps 1020 and 1022 are performed for each matching enti y in the subset as produced from step 1008. It should generally be noted that the thieshold length of thiee for the name length used in step 1022 may be varied and tuned for each paiticulai embodiment and implementation.
At step 1022, if a determination is made that theie is at least one entry in the existing database with a name length difference less than oi equal to three, control proceeds to step 1024, where the name edit distance heunstic may be used to compute the name distance. Generally, the name edit distance is the minium number of insertions, deletions, and substitutions at the character lev el to turn one name entry or stπng into a second name entry or string The number of states that string A must pass through to be transformed into Stπng B is an entry or quantity refened to herein as the name edit distance. For example, the textbook entitled "Text Algorithms", by Maxime Crochemore and Wojciech Rytter generally descnbe a technique for the name edit distance heuristic. At step 1024, the name edit distance is computed, for example, using dynamic programming techniques known to those skilled in the ait, such as using a finite state machine, for each matching entry as in the subset produced by step 1008. At step 1026, if a determination is made that there are one or more entries with a distance less than 10% of the length of the update name string, then control proceeds to step 1100 of Figure 52 where a determination is made at step 1100 as to whether or not there is only one matching entry in the subset as deπved from the Step 1008.
Refernng now to Figure 52, if a determination is made at step 1100 that there is only one matching entry, control proceeds to step 1112, where determination is made that a matching entry has been found. If at step 1100 a determination is made that there is more than one matching entry in the existing database for a foreign source update record, control proceeds to step 1102, where a determination is made as to whether or not there is only one matching entry with a minimum distance. If a determination is made that there is only one matching entry with a minimum at a distance, control proceeds to step 1108 where it is determined that an entry in the existing database with the minimum distance is considered a match to the update record in the foreign source update. If at step 1102 a determination is made that theie is more than one matching entry with a minimum distance, control proceeds to step 1 104 where additional processing may be required in accordance with the types of data included in the database. The additional processing required is generally the same types of processing that may be performed in accordance with the pieviously described step 1038 of Figure 47.
Refening back to Figure 46, if at step 1000 a determination is made that the phone number of the updated record is an 800 phone number, control proceeds to step 1002 where a determination is made as to whether or not the phone number, including the area code, and the zip code match one or more entries the existing database. At step 1002, if there is a determination that one or more entnes in the existing database match the phone number and zip code of the update record, control proceeds to step 1006 where a subset of one or more matching entries is found. Control then pioceeds to point B indicated at step 1010 in Figure 46 where execution continues.
If a determination is made at step 1002 that the phone number and zip code do not match any entries in the existing database, a determination is made at step 1004 that no match exists in the database for the cunent update record.
Refe ing now to Figure 48, shown is a flow chart of an embodiment for the "match phone number" routine as performed at step 1008. At step 1050, a table is used with old and new area codes and exchanges to determine if there are one or more matching entnes in the existing database which match the phone number of the cunent update entry.
Generally, the processing step of 1050 and the decision made at step 1052 may be used, for example, where area codes have changed due to the increased volume of phone numbers which require additional area codes to a particular locality to be added. For example, the 508 area code may be expanded to include the 781 area code. Thus, an existing phone number may be included in the database with either the 781 or the 508 area code depending on the age of the data m the database. If a determination is made at step 1052 that either an old area code and exchange, or a new area code and exchange match, control proceeds to step 1054 where a subset of one or more matching entries is formed. Control proceeds to step 1056 where control returns to the calling procedure. In this instance, control returns to step 1008 where subsequent control proceeds to step 1010 of Figure 46.
If at step 1052 a determination is made that there is no old or new area code and exchange in the existing database which match the cunent entry in the updated version of the database, control proceeds to node C of the "secondary search" in Figure 51 at step 1086. Generally, the processing which occurs in the steps of Figure 51 attempt to find semantic equivalents of the name fields indicating a possible match. At step 1086, the name of the update record is tokenized. At step 1088, "stop words" are removed from the name field. Generally, stop words may be words which may be ignored when doing a name comparison. For example, in this particular embodiment, the words "and", "or". "the"," a", "an", "to", " m", and "at" are considered "stop words" for which a matching entry may contain any number or combination of these and the match should still succeed Thus, at step 1088. these words are removed and not considered when performing a name companson.
At step 1090, a search of the existing database is performed on the conjunction of the tokenized name field components and the zip code. Generally, the search is being performed for entries in the existing database which match zip code and the different components of the name field. At step 1092. a determination is made as to whether or not there are more than 5 matching entries in the existing database for the cunent update record. If at step 1092 a determination is made that there are more than five matching entries in the existing database, control proceeds to step 1094 where a determination is made that no match has been found. If at step 1092, a determination is made that there is not more than five matching entries, control proceeds to point B in the processing which is shown in Figure 46, step 1010 where these name matching entnes are used as the subset upon which subsequent processing is performed. Referring now to Figure 49, shown is a flow chart of the steps of one embodiment performing a "name match" as part of a routine processing as invoked from step 1010 of Figure 46. Generally, the steps of Figure 49 attempt to perform and find semantic equivalents of the names of a business in this particular instance. At step 1060, for each entry in the subset formed by step 1008, the name entnes are canonized. Generally, canonization rules are a set of transformations which occur, for example, transforming abbreviations and the like to semantic equivalents allowing for a common denominator of terms to be searched for. For example, if all entries in a database use the entire work "incorporated" to indicate an incorporated business, then if a name entry includes the abbreviation "inc". this is expanded to the full name "incorporated" prior to being compared Generally, the precise canonization rules or transformations depend upon the particular data being examined in a paiticulai application.
Control proceeds to step 1062 where the name field is tokenized into components. At step 1064, a setwise contents comparison of the name components of each entry is determined against the cunent update entry. At step 1066, a score is computed for each name comparison of the existing database entry with a record of the updated version of the database. The score is computed as one point per matching component. At step 1068, control returns to step 1010 where subsequent processing resumes with step 1012. Generally, the processing steps of Figure 49 attempt to formulate a numenc quantity or metric for determining whether tvv o name entries match. This weighted value or concatenation is used in further comparison in combination with other field, such as the zip code, and aniving at a final quantity in determining whether or not name fields of an existing database entry and an update record match.
Refening now to Figure 50, shown as a flow chart of the steps of one embodiment for performing the routine "derive score", as performed from step 1012 of Figure 46. Generally, denve score attempts to produce normalized metnc or score based on the name field and the zip code. At step 1080, the score previously derived from name match for each entry is updated by one if the zip codes of an existing database entry match an updated entry. At step 1082 this score is normalized by taking the score computed thus far and dividing it by the number of tokens in the foreign source entry name field. It should be noted that other techniques may be used to produced a normalized score as in step 1082.
At step 1084, control returns to the point of call. In this particular instance, control returns to step 1012 where processing resumes with step 1020 of Figure 47.
Just descnbed with regard to Figures 46 through 52 are processing techniques for determining matching entries for foreign data. What will now be described are techniques which provide for data enhancements where the two databases or two data sources being integrated are fiom the same souice Geneially . where theie is this native source processing, there will be fewer diffeiences between the data entries due to the fact that both data sets come from the same source Thus, the techniques which aie described in paragraphs that follow may generally be refened to as data enhancements However, similar to the processing just described with iegard to foieign source integiation and processing, the concepts and processing steps hich will be described may be readily adaptable to other types of data updates accoi dance with other particular implementation and data sets.
The update techniques for native source assumes that two full sets of data are used - - the updated database version, and an unfiltered or raw version 1504 of the existing working database. Generally, the techniques that are described below with regard to native source processing are data enhancement techniques applied to the unfiltered database 1504 to produce the working database 1508 of Figure 45.
Refernng now to Figure 53, at step 1400, the computation of the data update is performed using two complete sets of data from native sources. Generally, at step 1400, the latest set of data received such as from a data provider is submitted into the database and compared against the set that is in the existing database. All of the records in the data set are loaded in the following form. For companson purposes, in the steps that follow there is a distinct record ID followed by a stπng where the stπng is all the fields from the record concatenated together for companson purposes in steps that follow. In this particular instance record I.D.s are unique against the set and indexed. As a result of processing at step 1400, the delta or difference between the two data sets is produced. Each entry in this delta or difference is classified as an insert, delete, or update operation. A record is inserted into the existing database in which identifiers are in the new version of the data set but not in the existing database. All records which have identifiers in the existing database, but not in the new version, are slated for deletion from the existing database. Records in which identifiers are in both sets, but, however have associated stnngs that differ are considered update records having data contents in the stπng that is updated for the conesponding identifiers. At step 1402, the update records which include inserts and update transactions are applied to the existing database. At step 1404, certain data post piocessmg is performed as will be described furthei in the paragiaphs that follow
Figures 46-^4 geneially describe data integration of the native souice updates which are applied to the database of business listings and categories. In summary, for both business listings and categories, comparisons aie made between lecords of the native source unfiltered database and native source update
Refening now to Figure 54, shown aie more detailed steps of one embodiment of step 1400 involving the computation of the data update as pertaining to the native source business listings previously described. At step 1406 a comparison is made between the existing database copy with the updated database copy by comparing the record identifiers and the string concatenation which represents the remainder of the records. At step 1410 each update record is classified as one of a matching enti y. an insertion, a deletion, or an update with respect to the existing database At step 1416, a record is determined to be matching if the record identifier and string field in the existing and updated data base copies match. At step 1420, a record has been classified as one to be inserted if there is a record with a record identifier in the update database which is not in the existing database. Subsequently, at step 1418, data enhancements are performed and the record is integrated into the working database. It should be noted that the data enhancements also performed in step 1428 is descnbed in more detail in paragraphs that follow. At step 1424, a record has been classified as one to be deleted from the existing database if there is a record with the record identifier in the existing database not in the updated database.
Subsequently, at step 1422, the data operation is performed integrating the data updates into the existing working database. At step 1430, a record is considered an update transaction to an existing record in the existing database if the record identifiers match, but the remainder of the record represented as a stnng does not match. Subseqently, at step 1426, the longitude and latitude of a record may be updated if the address has been modified. At step 1428, data enhancements may be performed to the record, and the data update is applied to the existing working database as well as the unfiltered database. In the case of step 1416 where matching entries are found, no further processing may be required for existing database or the updated database record. However, at steps 1420, 1424, 1430, update records or transactions are generated to modify the existing database. It should generally be noted that any of the foregoing operations which are modifications, including updates and deletions, to the existing working database records may be conditionally performed in an embodiment of the invention. A protection or locking technique may be included in the database, for example, which prevents a deletion or modification of a particular business listing included in the database regardless of the processing classifications of Figure 54. The data enhancements, as performed at steps 1418 and 1428, are generally data filtering steps prior to integrating the data update into the working database 1508. The data filtering techniques generally facilitate matching conesponding records when performing updates. Data enhancements may include, for example, upper/lower case justification, detection of synonyms and/or acronyms, transformation of abbreviations as may be used in business names (e.g., corp., inc.), street addresses (e.g., St., pi.), and city and state names.
Other embodiments may include other enhancements in accordance with the type of data and the various applications.
Refening now to Figure 55, shown is an embodiment of a method for performing update computation of step 1400 as applied to the category file. Recall that the category file in one embodiment includes a category identifier and a conesponding header that is a text description of the associated category identifier. It should generally be noted that these updates are applied in a model similar to that of the business listing files for native source updates. The updates are first applied to a "raw" or unfiltered version of the category file, followed by data enhancements as appropriate, an then integration of the data updates into a working copy of the category file included in the working database 1508.
At step 1460, the cunent and updated category files are compared in terms of identifiers and associated headers. At step 1462, each update record is classified as one of several types of transactions.
At step 1464, a record in the updated category file is considered matching if the record identifier and the associated header match an entry in the cunent category file. At step 1466. an lecord is inserted into the existing untilteied database and working database if the record identifiei is not in the existing unftlteied database copy of the categories. At step 1468, data enhancements may be performed and the resulting filtered data further integrated into the existing category file in the woi king database 1508. The data enhancements, as included in steps 1468 and 1476, are descnbed in more detail in paragraphs that follow
At step 1470. a recoid in the existing category file is deleted if the record identifier of an existing record is not in the updated version. At step 1472. this deletion operation may be performed to the working copy of categories included in the working database 1508.
At step 1474, an update record is used to update the database copies if the record identifier of an existing an update records match, but the heading names differ. At step 1476, data enhancements are performed and the update operation is integrated into the working copy of the categories includes in working database 1508 The data enhancements, as performed at steps 1468 and 1476, upon the category listings may include processing of the headings. For example, the processing to enhance the text of the headings may include text transformations such as: upper/lower case justification, consolidation of abbreviations, and removal of ldiosynchratic and slang terminology. The function of these data enhancements is to generally filter the data to provide more accurate determination of matching or conesponding categoπes.
Refening now to Figure 56, shown are general post processing steps for one embodiment of expanding more detailed steps of step 1404 of Figure 53. Generally, these steps may be performed to the category file as included in the working database 1508. At step 1440, new categories may be added. Generally, a data vendor may not provide an integrated version of all business categones. It may be possible to enhance some record categoπes as additional data is added. For example, a restaurant may be a particular type of category and there may be other subdata organized in the structure of the record indicating that there is a particular type of restaurant in accordance with the vanous ethnic cuisines, such as French or Italian. Post-processing as in step 1440 may be wntten to search the data file in accordance with recognized structural format and add additional
NO- categories in accordance with any categones and subcategoπes. For example, if a determination is made that there is a large numbei of restaurants with a subcategory of French, a new record category may be added w hich is "French restaurant" Similarly, an Italian restaurant category may be added This is generally performed in accordance with the data oiganization and categories of the paiticulai data being examined in each implementation
At step 1442, redundant categories as stored by business are collapsed and detected by removing the equivalent categories. Geneially, at step 1442, semantically equivalent categories are determined. Generally, this includes locating equivalent categories for which the spelling might be slightly different, or those fields which may be subsets or equivalents of other fields. For example, "animal doctor" may be interpreted as a semantic equivalent for "vet", or "veterinarian". Geneially, this step may be done in an automated fashion using any programming language which is commercially available and may be used with the existing database. The technique involves dropping or not including special non- alpha-numeric characters or other words, similar to the stop words White space may be compressed and companson may be done on a case insensitive manner. The comparison may further be done by requiring an exact character match or with some at-a-distance technique similar to those previously described with other data processing.
At step 1444. the duplicate categones and records may be removed from the existing version as stored in the working database 1508.
It should be noted that in general the processing of step 1442 where there is a collapse of redundant categories by detecting and removing equivalent categoπes, different rules may be used to decide which category of several duplicates identified as the one to keep. For example, maybe the longest name, the shortest name, or simply the first name. Refening now to Figure 57, shown is a flowchart of one embodiment of a method of more detailed processing steps of step 1442 for collapsing redundant categoπes. At step 1520, duplicate categoπes are determined A technique for determining duplicate categoπes is descnbed in paragraphs that follow in conjunction with Figure 58. At step 1530, duplicate categoπes in the unfiltered database may be examined as a group and one of the category names or headings is chosen to be the heading included in the collapsed category recoid One technique foi choosing the heading is be determining which category name is most frequently used, such as by examining the business listing files for frequency determination. At step 1534, the business listing files, as included in the unfiltered database, may be patched with the new heading and identifiei conesponding to the collapsed resulting lecord. At step 1536, the category file is also updated to reflect the collapsed entry. It should be noted that these aie made to the existing working database.
Refernng now to Figure 58, shown is a flowchart of an embodiment of method steps for detecting duplicates in the category file Generally, these steps are more detailed processing steps of step 1520 of Figure 57. At step 1500. a first category name in the category file of the unfiltered database is tokenized In othei words, each word included in the heading or category name is associated with a token. Similai ly, in step 1504, the next record of a category is examined and also tokenized At step 1506, a comparison of the two tokenized names is performed to derive a score in accordance with the number of matching name components. This may also be normalized, as described in accordance with the foreign source update processing techniques. At step 1508, a determination is made as to whether or not the score is greater than a predetermined threshold. In this instance, the threshold is 75%. If the score is greater than the threshold, control proceeds to step 1512 where the categories are tagged as duplicates propagating any previous matching identifier tag. In other words, the transitive matching technique is used in marking matching categories. For example, if IDl = ID2. Then, it is determined that
ID2=ID5, ID5 is also marked as having IDl as a matching identifier. Similarly, subsequent matches to ID5 further propagate the value IDl. Subsequently, control proceeds to steps 1510 for advancement to the next record. If it is determined at step 1508 that the score is not greater than the threshold, no match is found and control proceeds to step 1510 where the next category is advanced to. At step 1514, a determination is made as to whether all the categones have been processed in the category file. If they have, control proceeds to step 1516 where processing stops. Otherwise, control proceeds to step 1504 for further compansons and determinations of equivalent categoπes.
It should generally be notes that vanous percentages and lengths used in the
-11- foregoing data integiation techniques may be tuned 01 vaπed foi each paiticulai embodiment in accoidance with, for example, the data type and lecord lengths Adaptive tuning of values used in making determinations may be automated, for example, by adjusting thresholds in accordance with actual data values to filtei out extreme data values It should also be noted that the categoiy table or file may be used by the query engine when processing a data query. For example, the category file may be used to identify valid categories specified in a usei queiy. It may also be used to categorize information displayed to a user In other woids. a resulting data set may be partitioned in accordance with the categories as included m business listings for the lesulting query. For example, if a resulting data set includes 10 listings, these listings may be categorized or grouped in accordance with whether or not particular categories are associated with each listing. The information displayed to the usei for these 10 listing may be 5 listings included in category A, and 5 listings included in category B Thus, when the category table or file is updated, the table is propagated as part of the update data to the Front End Server and, subsequently, further to the query engine.
[Multi-media Data Transfer]
An efficient data transfer technique is used to transfer data between databases, such as between the BackOffice component 818 and the Primary Database 812 of Figure 4. In this particular embodiment, the types of data that are transfened generally relate to advertisements such as those displayed to the user 800 of Figure 2. Generally, advertisement data includes text data and non-text data. The non-text data may be refened to as "blob" data which includes, for example, image and audio data, as well as machine- executable programs, JAVA bytecode, and the like. The technique, which will be descnbed paragraphs that follow, generally uses different data channels depending on the type of data. For example, text data is transfened from the BackOffice component to the Front End Server 804 using a different data channel than blob data that is also transfened between the two components. A sending component may be located within the BackOffice component 818 which includes software that decides the type of data, the channel used to transfer the data, and how to break up the data into portions which are transfened to a receiving component located in the Front End Server 804, such as the pnmary database 812 Located on the receiv ing component, as may be included in the Primary Database 812. is software which decides ho to synchronize or assemble data received from the BackOffice component 818 In this particular embodiment, the advertisement data is geneially data that is displayed in response to a user query Generally, the text data included m this data transfer may be characterized as structured data, as included in text which is displayed to the user. The second type of data generally transfened is denoted as "blob" data which is generally not able to be decomposed or operated upon in different portions For example, blob data may include a machine-executable program which is generally binary data type Generally, the technique uses two separate data channels in which each channel transfeis a different type of data. In this particular embodiment, one data channel is used to tiansfer the text data, and Database Link™ software, as included in the commercially available Oracle™ database, is used to facilitate database communication of text data Therefore the database routines, such as those included in the Database Link software, may be used in transferring text data between databases In this particular embodiment, the Oracle database does not support direct non-text manipulation, such as for transfening data of different types, such as blob data. Therefore, a second different data channel is used to transfer the blob data from one database to another in which the second channel is external to the database since the version of the Oracle database software used in this embodiment does not provide the needed support for direct non-text data manipulation The blob data, which may also generally be characterized as multi-media data, is transfened asynchronously from the text data between databases
As will be described in paragraphs that follow, the blob data m this embodiment is copied from one database to another using a C++ program with calls to vendor-supplied library routines. This is in contrast to the text data transfer which is done by a separate data channel, and the software used performs remote database copies as if they were local. In this embodiment, the text data transfer may be performed by calls to the Oracle procedures executed under the control of the Oracle database software. Generally, the data channels used to transfer both the text and the blob or multi-media data may be network connections between the databases. Other types of connections between the databases may also be possible, such as a dedicated haid line to facilitate database communication, as known to those skilled the ait As will be described in pai agiaphs that follow, data is oiganized and associated with a paiticulai advertisement that may be displayed to a usei
Figuie 59 is a block diagiam of two tables in a piefeπed embodiment depicting one technique for storing the advertisement data In this paiticulai embodiment, the advertisement data and the I elation between the diffeient components of the advertisement data are described in two tables stored in the sending databases Table 1200 is a lelational mapping table which geneially describes the relation between the various data entities as included in a particular advertisement page In this paiticulai embodiment, as will be described in an example, the relational mapping data descπbes a parent/child relationship between various data entities of an advertisement page forming a tree-like structuie The data table 1220 includes the actual data as described by the relational mapping table 1200 The data included the data table 1220 includes a variety of data types as may be displayed with regard to an advertisement For example, the data included in table 1220 may be text data, machine executable code, or a JAVA program In this paiticulai embodiment which uses the Oracle database software, one restriction is that each row of the data table 1220 may contain at most one field of blob data Thus, if an advertisement, m this particular embodiment, requires the use of multiple blob files, they must be stored in different rows of the data table 1220 Other implementations and embodiments may have similar or other restrictions that may effect the particular organization of the data as required for advertisements or other data displayed to the usei It should generally be noted that the structure of the tables depicted in Figure 59 are particular to this implementation and embodiment of the invention Other embodiments of the invention may include different table structures in accordance with vanous implementation restnctions The relational mapping table 1200 includes two columns of data The first column
1204 is the record ID of the child data entity The second column 1206 is the record D of the parent data entity The data table 1220 generally includes multiple columns depending on how many data fields are required for a particular implementation In this particular embodiment, a record identifier 1208 is used to uniquely identify a particular data entity in a table Also included are data fields data-1 1210 through data-n 1214 in which each of these data fields includes one particular type of data entity as may be displayed to the usei in response to a data query
Refening now to Figure 60, shown is a moie detailed diagram of the tables as used in a data transfei on a sending and receiving side using this data transfer technique Shown in Figure 60 is an example of a relational mapping table 1200 which includes multiple advertisement pages. In this particular embodiment, one tiee-hke structure is used to represent one advertisement page. As shown m Figuie 60, two tree structures may be produced using the data described in the relational mapping table 1200. What will be described in paragraphs that follow is the data transfei of the advertisement page associated with the root node with the identifier 104 which includes identifiers 104. 105 and 106 in its tree-like structure.
Refening now to Figure 61, shown is the tree-like structure descnbed by the relational mapping table 1200 for the advertisement page with the root node identifier 104 shown in Figure 60. Refening back to Figure 60, on the receiver side of the data transfer, shown are two tables, temporary table 1216, and ad page table 1218 n this particular embodiment these two tables are created on the receiver side for each advertisement transfened from the sender. In the snapshot of Figure 60, the two tables of data on the receiver side depict tables after the transfer of the ad page with the root node of the identifier 101 and prior to the transfer of the data associated with the advertisement page with the root node beginning with the root node of identifier 104. Generally created on the receiver side for each advertisement page is a separate ad page table 1218 The temporary table 1216 is filled with data during the data transfer and after the data is properly assembled on the receiver side, the temporary table 1216 is not used until the next data transfer operation. In this particular embodiment, the table ends in a state such that no data from the data transfer having just occuned is located m the table 1216.
Refe ing now to Figure 62, shown is a block diagram of the data on the sender side and the receiver side as associated with the data table 1220 previously discussed in Figure 59. In the example which will be descnbed in paragraphs that follow involving the data transfer of identifiers 104-106, each identifier is associated with only blob data. It should
-16- be noted that this geneial technique and the data included in the data table 1220 may additionally include text data associated with each identifiei or low in the table An entiy in the table 1220 may also include only text data As pieviously described in this embodiment, the limitation is that only one field entry of blob data may be associated with each row in table 1220 On the receiving side thiee tables are associated with transfernng data which is blob data from the data table 1220 These three tables include a blob temporary table 1222, a blob table 1224, and a repository table 1226. It should generally be noted that any text data included in table 1220 on the sender side may be transfened using the data transfer channel What is described in Figure 62 is that portion of the data included in the data table 1220 which is blob data In this example, only blob data is included in the advertisement page with the loot node 104 which will be described.
The blob temporary table 1222 is a temporary table used in the transfer of text information associated with blobs from the sending node to the receiving node. The blob table 1224 in this particular embodiment, is an aggregate blob table which includes the blob data for multiple advertisement pages. In other words, the snapshot of the data tables of Figure 62 shows that data associated with one advertisement page with the root node identifier 101. After the completion of the advertisement page with the root node identifier 104 on the receiving side, the blob table 1224 will also include information to retrieve the blob data associated with identifiers 104 through 106. It should be noted that the contents of the blob table 1224 do not include the actual blob data itself. Rather, as will be noted in the description that follows, the fields included in the blob table 1224 point to and further describe the actual blob data which is contained in the repository table 1226. The blob table 1224 in this embodiment includes three fields per each entry associated with a blob data entity. It includes a sending record identifier 1228, a size 1230, and a pointer 1232 to the actual blob data. The sending record identifier 1228 identifies a particular blob uniquely withm a particular table or advertising page in this particular embodiment. Thus, each of the entnes in the record identifier column 1228 may not be unique for all of the advertisement pages or data. Rather, the purpose of the record identifier is to map or identify the particular blob pointer associated with a unique record identifier from the sending database. The size 1230 indicates the size in bytes of the blob descnbed by the blob pointei field 1232 In othei embodiments, the size field may include othei units to identify the size of the particular blob data The blob pointei field 1232 acts as an identifiei oi pointei into the lepository 1226 to uniquely identify within the lepository a particular piece of blob data It should be noted that othei embodiments oi implementations may include additional fields in the blob table 1224 as w ell as m the repository 1226 in accordance with other pieces of data that may be required in order to enable the transfei to occur in a paiticulai implementation
FIGS 62 thiough 66 show the block diagiams of an embodiment of transfen g the data associated with an advertisement fiom the sending side to the receiving side Figuie 63 depicts a snapshot of the tables associated with the text or Database Link transfei channel as included in the sending and receiv g sides The data table 1200 on the sending side has no modifications from the previously described initial table as depicted m Figuie 60. Howevei, the tables on the receiving side have been modified from those previously described in Figure 60 In particular, the temporary table 1216 serves as a temporary placeholder for the data involved in the data transfer of the particular ad page descnbed beginning with root node identifier 104. Generally, the data associated with a particular advertisement page is extracted from the relational mapping table 1200 and is temporanly copied to and stored in the temporary table 1216 on the receiving side.
Shown in Figure 64 are the tables associated with transfernng the actual data from the sending side to the receiving side. The data included in the data table 1220 is segregated into text data and non-text data. The text data is transfened using the text channel. The non-text, multimedia data, or blob data, is transfened using an external process which creates a second multimedia data transfer channel in order to send data from the sending side to the receiving side. In this particular embodiment of the data table 1220, the id and the size fields are copied to the blob temporary table 1222. Additionally, a global id (Gid) is generated on the sending side pnor to transmitting these fields to the receiving side. This global id is transfened to the receiving side and included in each associated entry of the temporary table 1222 Generally, the Gid is a unique identifier associated with each record uniquely identifying the record among all tables associated with database information. The blob data fiom table 1202 and the associated information in table 1242 are transfened to an external process 1240 located on the sending side In this particular embodiment, an Oracle I M pipe is the communication means used to transfer the data from the data table 1220 to the external process 1240 The external process 1240 furthei transmits the data via a multimedia data channel to the receiving side Table 1242 may also be viewed as a temporary table which serves as a placeholder for that data which is transfened by the external process 1240 to the receiving side Located in temporary table 1242 are four pieces of information including a table name, a field name, an identifier, and a global identifier associated with each blob data entity. The table name generally describes or identifies the particular table ithm which a piece of blob data is located or associated. In this particular embodiment, each table is associated with a particular advertisement or advertisement name The field name identifies the type of non-text data. In this particular embodiment the field name is "Blob" refening to blob or multi-media data. The identifier field (Id) of table 1242 is the unique record identifier copied from table 1220. The global identifier (Gid) is a unique global identifier, identical to that which is produced on the sending side prior to sending the text data to the temporary blob table 1222. This information is passed or transfened to the external process 1240 which copies the actual blob data to the receiving side as w ell as the additional information descnbed in temporary table 1222. It should be noted that in this particular embodiment, the external process 1240 is a
C++ program with library calls to facilitate the transfer of data between the databases. However, it should be noted that this is an external process with regard to the database. In other words, in this particular embodiment the facilities used to transfer the data from the sending side to the receiving side are external with respect to the database. In this particular embodiment, "external" generally refers to the fact that the external process 1240 executes outside of the Oracle process space. Certain tasks must be performed by the external process order to transfer the data from the sending side to the receiving side. For example, the external process must connect to each of the databases in order to access and transfer the data. This is in contrast to the Database Link or text channel which is internal to the database and no such connections are implied. In other words, the routines
-19- which perform the data tiansfei of the text aie internal to the database and data copying, foi example, in this embodiment, is performed between remote databases as if they weie local copies. The precise way in which both the text and blob data transfers are performed withm other prefened embodiments may v an w ith implementation and facilities available for communication and data transfei
It should also generally be noted that the external process may copy blob data from multiple tables in which the associated field name may differ with each table Therefoie, the field name may also be included in table 1242 The external process uses this field name to retrieve blob data to be copied Other embodiments may communicate this field name using other mechanisms.
The external process 1240 uses the data included in the temporary table 1242 to fetch or access the blob data associated with a particular table name and field name to subsequently index into each particular table name using the identifier to extract the actual blob data. This blob data is copied to the repository table 1226 on the receiving node by process 1240. In Figure 64, the repository table 1226 includes the blob data associated with advertisement identifier 104 This data is appended to already existing data in the repository 1226.
It should generally be noted that the transfer of the text data through a first data channel and the transfer of the blob data through an alternate or second multi-media data channel are performed asynchronously. When the receiving side has determined that all of the necessary data entities associated with a particular table or advertisement have been transfened successfully to the receiving side, the process of assembling the data into the advertisement page begins. It should also generally be noted that the data described in tables 1224 and 1226 are functionally equivalent to the data stored in table 1220. For example, table 1224 includes a blob pointer field which acts as an index into the repository table 1226, whereas table 1220 includes the actual blob data in a field. Thus, the use of the blob pointer field in table 1224 which acts as an index into the repository table 1226 performs the same function as the actual data in the blob data field of the data table 1220. What will be descnbed in conjunction with FIGS. 65 and 66 is the integration process of the tables of the text and the blob data for the advertisement page identified by the sending identifier 104 Refening now to Figure 65, shown is a block diagram of an embodiment of the tables resulting from the text data integration In paiticulai, table 1200 on the sending side remains the same as in pie iously described figures On the receiving side, table 1216 data has been integrated and copied into the table 1218 The function of temporary table 1216 is generally to hold that text data associated with the relational mapping table which is transfened from the sending side to the receiving side until all of the data entities associated with the paiticulai advertising page or table being transfened have anived on the receiving side. At this point, the data integration on the receiving node begins. The software on the receiving side performs a state integration process The previously described task of integrating the data from temporal y table 1216 into table 1218 is one such task performed by this integration software
Refening now to Figure 66, shown is a block diagram of an embodiment of the data table 1220 whose contents have been transfened to the leceivmg side. The assembling software on the receiver side integrates the data from temporary table 1222 into table 1224. Additionally, a link is established in table 1224 to the data in table 1226 and the associated global identifier removed. Each entry in table 1222 is copied into table 1224. In particular, the Id and Size fields are copied into table 1224 for identifiers 104, 105, and 106. The integration software then uses the global Id obtained from temporary table 1222 to index into the repository 1226 m search for a matching global identifier entry. When a matching global identifier is found in table 1226, the repository Id from table 1226 is copied into the blob pointer field (Blob Ptr) of table 1224. Subsequently, the global Id in table 1226 for the conesponding entry is reinitialized to an empty field The resulting table 1226 shows this process as repeated for each entry in the previously described table 1222 from Figure 64. Refening now to Figure 67, shown are method steps of one embodiment for assembling the blob data into the repository table. The steps descnbed in Figure 67 generalize the method previously descnbed in conjunction with FIGS. 64 and 66 where the data shown in Figure 64 is integrated and assembled into the tables on the receiving side resulting in those as displayed in Figure 66. Generally, at step 1250, the record identifier and table size are copied from the temporary blob table to the blob table. At step 1254, the global identifiei fiom the tempoiaiy blob table is used as an index into the repository table to finding a matching global identifiei Foi this matching entry, as in step 1256, the repositoiy identifiei is copied from the repositoiy table to the blob pointer field of the blob table At step 1258, the global identifiei field of the repository table is reinitialized The end lesult of performing the steps as descnbed in Figure 67, result in the tables as displayed in Figuie 66 representing the integrated oi assembled blob table in which the blob data is integrated into the repositoiy table 1226 as further described by the blob table 1224 It should generally be noted that the files resulting from the copying of the text and the blob data as described in FIGS 65 and 66 have a particular relationship Generally, the sending and receiving side for the text data have miπored files. In this particular example, table 1200 and table 1218 are "minoi images" of each other. The temporary table 1216 is used in performing the transfei as a temporary table until all of the data for this particular data transfer has arnved on the receiving side At that point, the data is integrated from the temporary table into the final resulting table 1218 resulting in a table 1218 which mirrors that on the receiving side which is on the sending side in table
1200.
Regarding the multi-media or blob data on the sending side and the receiving side, the resulting tables 1224 and 1226, in combination, are functionally equivalent to the data descnbed m the sending side in table 1220. In this particular embodiment, one of the reasons for not further merging the data of tables 1224 and 1226 is due to the fact that transfernng blob data, including a copy of the blob data from table 1226 to be integrated into table 1224, requires the use of an external program in order to compress the tables further. This is due to the fact that in order to perform any transfer of data which is not text, an external program, similar to external program 1240, is generally used since a version of the database software, as in this embodiment, may not be capable of copying and directly manipulating non-text data as needed in performing data operations.
The tables which are described in the preceding figures and associated descπptions may have a different number of entnes and fields particular to each implementation of the concepts which have been descnbed herein What has been descnbed is a flexible and efficient technique for performing data transfers. In this particular embodiment, the data
PCT WPD -82- transfer is between two databases. The techniques descnbed may be adapted and used within other applications and a variety of env πonments
The overall technique is generall) to copy the text and blob or multi-media data asynchronously on two separate channels This data is copied from a first database to a second database. Initially, the data is located on the second database in a temporary location until all of the portions of the data associated with a particular data transfer aπive at the second database When it has been determined that all portions of the data have successfully anived on the second database, the assembly process of copying the data from the temporary locations and merging the information into other data tables is performed on the second database.
Generally, the foregoing technique for data transfer may be used m a variety of applications, such as for the data transfer bet een databases In one embodiment, this technique is included in a system for online Interactive Yellow Pages, GTE Superpages for the publication of multimedia advertisement content of GTE Superpages business customers. Generally, the GTE Superpages system includes two major components: the server component which serves versatile user lequests for the information of more than 11 million businesses in the United States and (2) the BackOffice component that facilitates advertisement content, creation management and publication. Both these subsystems include databases where advertisement business information is persistently stored. The advertisement content produced or modified in the back office is published in the
Superpages by virtue of its transfer from the persistent storage in the back office to the persistent storage in the server. Generall) . the business advertisement includes an integrated set of structured textual information, such as business name, address, and multimedia or blob data, such as graphics, video, audio, job applets. The data transfer technique descnbed is generally a technique for transfernng data using two data links between two databases. One of these data links is an internal data link with respect to the database, the second data link is an external data link with respect to the database. The internal data link is optimized for the structured text data transfer while the external one is optimized for the multimedia data transfer, such as the transference of data stored in binary objects in the database. This technique for data transfer generally alleviates the limitations of the existing database technology which does not provide for the transfernng of multimedia objects using the internal data link Moreover, by using the two data links to transfer the vanous data types, performance and stability are improved over an alternative prior art approach which uses only the external link for transfernng both text and multimedia or blob data.
Generally, the transfer technique includes four collaborative processes: a process on a sending component which decomposes data structuies and the like into text and non-text components assigning transient tags to the non-text components; two asynchronous transfer processes, one per data type, that each transfer, respectively, text and non-text components to a receiving component; and a process on the receiving component that reassembles transfened data and replaces transient tags ith persistent unique tags
This technique uses a multimedia data repository cable which is created and maintained in the receiving component, such as the receiving database in this embodiment Once the data is transfened, the non-text or multimedia data items aie stored m this repository with transient tags. Using the transient tags, the reassembly process conelates the text tables with the multimedia objects and replaces them with persistent unique tags, thus leading to the re tegration of the transfened data.
The previously described technique includes features which provide for efficient decomposition and reassembly of data for efficient data transfer, as between two databases. Additionally, the multimedia repository serves as a vehicle for the reassembly of decomposed data items which are reassembled on a receiving component, such as a receiving database. [Incremental Update]
In paragraphs that follow, a descnption is provided of an incremental update procedure as performed upon the vanous databases included in the Front End Server component 804. The data in the BackOffice component 818 may be updated, for example, on a daily basis. These deltas or changes to this database in the BackOffice component are subsequently also applied to the copy of the database in the Front End Server component. It should generally be noted that in this application, as in the GTE Superpages online system, the number of transactions or updates to a database ranges from 30,000 to a half a million on a daily basis m accordance with the required data updates for the existing database Howevei . the techniques which will be described in paragraphs that follow may be applied to different systems with diffeient transaction throughput and tuned in accordance with each particular implementation Generally, this update technique is used to piovide data updates foi both native and foreign sources, and on-line updates, as descnbed in accordance with data processing techniques in other sections of this application
Generally, data updates to the databases included in the Front End Server may first be integrated into the BackOffice component Subsequently, these data modifications may be "pushed" to the Front End Server and integrated into the various data stores included therein, as will be further described in more detail in following sections. Generally, in this embodiment, data updates may originate from several sources, including native and foreign source updates, and on-line data entry, such as through an Internet connection via a browser. The native and foreign source updates may generally be charactenzed as larger updates or data integration efforts. These are generally descnbed in other sections of this application. The on-line data entry technique for updating information that may be included in the BackOffice component may be performed as previously descnbed through the menus initially displayed to a user, such as at the GTE Superpages Internet site, that provide access to the BackOffice component data information. The data integration techniques, as related to the foreign and native souice updates to integrate the data updates into the BackOffice component, are generally more detailed and involved than the integration of the on-line specified modifications. In the former case, the data updates may generally be a large number of data modifications requinng more computer resources than in the latter case. Thus, for example, the on-line modifications may be incorporated on a daily or other predetermined time penod using some data enhancement techniques as descnbed other sections of this application. Other data updates may require additional time and computer resources and not be able to be completed, for example dunng non-peak usage, such as overnight on a daily basis. Thus, additional planning and different processing techniques may be used with the vanous types and volume of data updates as included each embodiment
Once the data modifications are incorporated into the BackOffice component, the data updates, including the updates to advertisement data and other data associated with each business listing, may be propagated to the Fiont End Seivei component. The non-text or multimedia data, for example, as included in advertisements with image files, may be transfened to the Front End Server from the BackOffice using multimedia transfei techniques, as generally described in other sections of this description The updates to the Primary Database included in the Front End Server may be communicated as a table of commands created the BackOffice component and transfened. as by a network connection, to the Front End Server. Generally, in this embodiment, the table created in the BackOffice includes an application developed command language conesponding to the vanous types of record updates and modifications that may be included in this particular embodiment. Each of these commands may be further translated in the Front End Server into one or more actual database commands that perform the table operation. For example, an entry in the table of database update commands may be specified as follows:
COMMAND RECORD # OPTIONAL DATA DELETE 1-5
In this above example table, three fields of data may be included A Command field specifies the type of data command. The Record #field identifies which records in the Pnmary Database this command applies. The Optional Data includes data that may be related to the specified command. For example, if the command were update, the data field may specify the data which is to be included the records specified. In the above example, the command is to delete records 1-5. This single table command may be translated, for example, by software included in the Pnmary Database, into 5 database commands m accordance with the particular database software. The software which builds the table in the BackOffice and translates the commands into one or more database commands may be developed using a commercially available software system that is capable of communicating with the underlying database to perform the required operations. It should be noted also that the entire table may be transfened from the BackOffice to the Front End Server, or it may be divided into sections and updates performed for each section Additionally, each command may be sent as a separate message in other embodiments in accordance with the numbei of updates and othei associated computei resources and costs foi each data transaction This may vary with implementation Refening to Figure 31 , shown is an embodiment of a dependency graph foi performing the various piocesses in an inciemental update. At step 1600, the BackOffice data transfer must complete prior to beginning the update to the database in the Front End Server component The BackOffice data transfers is complete when multimedia and text data has been transfened from the BackOffice component, such as data required when updating an advertisement page. Additionally, other information from the BackOffice component is transfened to the Front End Serv er component 804, such as in the form of an operational table. The operational table may include information about the updated normalized data, which has been applied to the BackOffice component, and which is now to be applied in this incremental update procedure to the Primary Database copy of the normalized data. step 1602, an initialization procedure may be executed to synchronize the beginning of the update procedure for the steps that will be descnbed in paragraphs that follow. As indicated by Figure 31, steps 1604, 1606. and 1608 may be performed independently and at the same time as steps 1610 through 1620. The coordinating point labeled DB Prep at step 1622 serves as the coordinating point for the different procedures performed in updating the database on the Primary Database, and the local copies of necessary files, such as the
Term list identifiers, located on each of the server nodes. step 1604, the vanous advertisements are extracted from the data tables, such as those transfened from the BackOffice component in the multimedia and text data transfer. At step 1606, the vanous advertisement pages are packaged and made into a complete advertisement page to be stored the Constructed Ad Repository 842. At step 1608, the constructed ads are transfened and included in the Constructed Ad Repository. It should be noted that in this embodiment the existing copy of the Constructed Ad Repository is updated in accordance with those particular ads which have changed. Thus, the Constructed Ad Repository is updated on a delta or change basis. Simultaneously, steps 1610 through 1620 may be performed m conjunction with steps 1604 thiough 1608. This may be done, foi example, in a parallel fashion Steps 1610 through 1620 indicate that process by which the vanous identifiers and other files associated with the Pnmary and Secondaiy database are updated Steps 1604 through 1608 reflect the updating of the Constructed Ad Repository 842 on an as-needed basis in accordance with changes which have occuπed in the advertisements
At step 1610. various changes to the Term lists identifiers are extracted In other words, it is determined at step 1610 what identifiers the Term lists need to be updated in accordance with the changes transfened fiom the BackOffice component. This is described in more detail in paragraphs that follow At step 1612. these various identifier updates are packaged. At step 1614, these various identifier changes are transfened to each of the server nodes. In this embodiment, the actual data transfened at step 1614 are the raw operational commands as may be supplied by the BackOffice component to be applied to the existing Term lists. At step 1616, at each node, a working copy is made of the existing Term lists. At step 1618, on each of the ser er nodes, the changes are made to the working copy local to each server node. At step 1620. the updated term list is installed. At this point, the updated term list is not yet available for public use in the sense that it is published. However, a new version of the Term lists has been created which includes the updated information as supplied in the transfer step 1614.
At step 1622, database preparation steps are performed. Step 1622 serves several purposes. One is a coordination point for the updates of the various ads, as well as the various term list identifiers. Secondly, step 1622 serves as a step within which the normalized Primary Database information is propagated from the normalized copy of the Pnmary Database to a denormalized form in the Primary database and the denormalized form in the Secondary Database. In other words, the changes which are transmitted from the BackOffice component and reflected in the normalized Pnmary Database copy are now further propagated to the denormalized Pnmary database and the denormalized Secondary database copy. Additionally, at step 1622 as part of the database preparation, the validity of the transactions and updates are verified such that at step 1626 the database knows it may fully commit to performing the update to the denormalized copies as used in performing user quenes. Steps 1624. and 1630, and, respectively , step 1626 may be performed in parallel. After the database preparation of step 1622. the ads may actually be published as in step 1624 in which the updated copies of the Consti ucted Ad Repository are actually made available for use Additionally, any updated images as stoied in the Image Repository are also available for use At step 1630, the prev lously installed identifiers included in the
Term lists, as installed in step 1620, are published in step 1630 At step 1630, the publication of the various identifiers included in the Term lists generally means that the Term lists are available for use, as by the Query Engine At step 1626, which may be performed in parallel with the steps of publishing the ads and publishing the identifiers, the database commits to performing the update
It should generally be noted that steps 1614 through 1620 are performed independently for each server node in this embodiment. Additionally, the actual amount of processing performed on the Term lists vanes in accordance with the number of updates or transactions, as will be described in conjunction with Figure 32. Refening now to Figure 32, shown is one embodiment of the various method steps for performing update steps m accordance with a particular number of update transactions as sent from the BackOffice component 818 At step 1634, a determination is made as to the number of update transactions. This determination involves a comparison with two threshold values each descπbing a particular threshold number of transactions. Generally, THRESHOLD 1 descnbes a relatively small number of transactions. In this particular embodiment, a relatively small number of updates generally refers to less than 30,000 update transactions. Also specified is a THRESHOLD 2 value which generally represents a second, larger number of transactions. In this particular embodiment, THRESHOLD 2 represents approximately half a million transactions or update entries which conesponds to approximately five to ten percent of the number of records included in the Pnmary
Database. Generally, as descnbed in conjunction with Figure 32, one of three update techniques may be applied. If the number of update transactions as determined at step 1634 is less than the THRESHOLD 1 or a relatively small number of updates, steps 1636 and 1638 are executed. In step 1636, the normalized Pnmary Database is updated. Generally, this is performed at step 1602 of Figure 31 in which the copy of the normalized Primary Database is updated in accordance ith the operational table as transfened fiom the BackOffice component indicating the actual database update operations. At step 1638. due to a relatively small number of transactions required, the actual identifiers of the Term lists are updated In other words, the Term lists aie updated as opposed to being rebuilt At step 1634, if a determination is made that the number of transactions is greatei than or equal to THRESHOLD 1, and also less than the greater threshold, THRESHOLD2, steps 1640 and 1642 are executed. At step 1640, the Primary Database is updated, as previously described in conjunction with step 1602 in which the normalized copy of the Primary Database is updated. At step 1642. all of the identifiers as included in the Term lists are rebuilt In this particular embodiment, both identifiers and markup files are rebuilt due to the use of the mark-up files by the Veπty Information Retrieval software. As previously described in conjunction with Figure 25. the Extraction Routines are executed to again produce the markup language files and vanous update records needed to update the denormalized data of the Primary Database. In step 1642, the Information Retrieval software is executed to produce entire new sets of the Term lists. Step 1642 is in contrast to step 1638. Rather than rebuild the Term lists as in step 1642, the Term lists are updated m step 1638.
If a determination is made at step 1634 that the number of update transactions is greater than or equal to the larger threshold. THRESHOLD 2, step 1644 is executed. At this point, a determination has been made that the number of update transactions is so large that it has been deemed more efficient to rebuild the entire database and associated files, rather than update or patch the existing database and associated files, as in updating the identifiers of the Term lists of step 1638.
The previously descnbed procedure of performing a multimedia data transfer is used to transfer, for example, the multimedia and text data associated with ads, as may be included in the Constructed Ad Repository 642 and Image Repository 842 of Figure 4. The granularity which indicates that an advertisement page has changed requiring the entire advertisement page to be replaced in the Constructed Ad Repository is if a single component within an ad page has changed. In this case, the entire ad page is reconstructed and replaced in the Consti ucted Ad Repositoiy 842 Foi othei systems, a different granularity of change may be used Generall) . as previously descnbed, the various markup files and Term lists aie built as needed in accordance with the number of transactions as described in conjunction with Figuie 32 The actual thieshold values may be detei mined in accordance with tuning of a paiticulai system and the size of the database the number of transactions in each particular system. In this paiticulai embodiment, the database as included in both the Front End Servei and the BackOffice component are OιacleI databases The Oracle™ piocedural language, PL/SQL, may be used to read the operational table and perform the updates as needed to the normalized form of the data as stored in the Primary Database included in the Front End Server component Similaily, the same procedural language in files may also used to update the denormalized Primary Database copy and the denormalized form of the data as stored in the Secondary Database Other embodiments may employ other techniques to update both the Pnmary and Secondary databases in accordance with a particular implementation. In this particular embodiment, the previously described incremental update procedure is one that is generally used to perform daily updates. However, in other embodiments, the same procedure may be used on a larger time period of transactions or updates. Due to the volume and size of the previously described embodiment, this procedure is one which performs well when performed on a daily basis. For other systems which may perform a similar number of transactions for a larger time penod, the previously descnbed techniques may also be used.
In this particular embodiment, as may be included in the BackOffice component, the vanous updates to a particular record or for a particular business or service may be collapsed before actually issuing the vanous database commands to perform the updates. In other words, within a certain amount of time, such as withm five hours, a single record may be inserted, deleted and modified dozens of times. The end result of these modifications for the small time interval may result in no net modification or amendment to a particular record. Thus, one optimization, as may be included m the BackOffice component in a prefened embodiment, may collapse vanous updates associated with a particular recoid 01 business before actually issuing commands which perform a database update as applied to the copies in the BackOffice 818 and Fiont End Seiver 804 components Generally, this may be determined by using a finite state machine with the states of "insert", "delete", and "modify" If the same recoid, foi example, is modified twice and then deleted, the net result is that only a "delete" database command should be issued rather than issue two updates followed by a delete
Also, in this particular embodiment, the contents of the Page Cache 848 and the Query Cache 850 are reinitialized when an update is performed, as in performing the incremental update piocedures described in conjunction with Figuies 31 and 32. The data included in the PHTML execution tree is also reinitialized
A failure may occur when performing any of the steps associated with Figures 31 and 32. If a failure occurs when performing certain steps, then a lecovery procedure may be performed. In this particular embodiment, a failure may occur for example, when using the Information Retrieval software, as depicted m conjunction with Figure 25. This may be due, for example, to a problem, such as a software bug, with the Information Retrieval software 908. For example, an enor may occur when extracting the identifiers associated with step 1610. Generally, step 1610 as previously described includes building the Term lists as determined in accordance with the number of update transactions in accordance with Figure 32. If an enor occurs, for example, when producing or rebuilding the identifiers in the Term lists as in performing step 1642 and step 1644, it may be a recoverable enor if another node has successfully built the identifier files, for example. In this instance, where there has been a successful build of the various identifiers on another server node, a recovery procedure may be to copy the updated version of the Term lists from one node to another node which has been unsuccessful in the building the Term lists. This copy may occur, for example, after a predetermined number of builds of the Term lists on a particular node have failed. In this particular embodiment, this has been determined to be a recoverable enor with which an alternative step or technique may be applied to also achieve the end result of the updated Term lists. Other embodiments of the invention may also include other alternative techniques in accordance with those steps associated with a particular system which it determines to be recoverable. In the previously described embodiment, the update techniques may be included in a distributed computing system having multiple data representations as stored in a plurality of server nodes. The foregoing techniques provide for synchronized updates of the various data stores in the plurality of server nodes. [Targeted Banner Advertisements]
User query information may be used to influence the displays shown to the user by the browser 824. In addition to displaying matching categories or business listings, as depicted in Figure 44, the information retrieval software 908 can be used to assist in selecting other information to be displayed to the user, based on the nature of the user's query.
In an embodiment of the invention, a banner ad 50 can be displayed to the user. Based on the user's query, the banner ad 50 may be targeted to characteristics of the user that are infened from the user's query. For example, an advertiser might conclude that a user who has entered a query with the category "art supplies" is interested in art, so that an advertisement for an art show or related matter would be an appropriate banner ad 50.
Banner ads 50 can also be targeted geographically, so that ads for businesses from a selected geographical area can be associated with search queries that include that geographical area as a search term. It should be understood that a system for targeting banner ads using user queries can use a range of information retrieval techniques, such as the Verity techniques described above in connection with processing of information retrieval requests using the term lists 836. However, in an embodiment, a separate banner ad retrieval program 909 is part of the query engine 862.
Initialization steps that permit execution of a banner ad retrieval program 909 are set forth in a flow chart 52 on Figure 68. Upon initialization, at a step 54, the system initiates the banner ad retrieval software 909. At a step 56, the banner ad retrieval software 909, in a manner similar to the information retrieval software 908, uses extraction routines to access markup language files and extract data. The banner ad retrieval software then generates banner ad term lists 837. At a step 66, the banner ad retrieval software retrieves a list of all yellow pages categories. In an embodiment, the categories are all of the available categones of business listings such as all available yellow pages categones Next, at a step 68 the system establishes a set of supei -categones The super categones may consist of a sub-set ot the categones oi other categones The supei categones are preferably smallei in numbei than the categoπes as the super-categories w ill be used to simply assignment of taigeted banner ads to paiticulai user queries and lesults ot the quenes Next, the system may map categories to supei categones m a step 70 The mapping at the step 70 many be a many-to-many mapping A vanety of techniques may be used to map categones to super-categories One such technique uses a combination of automatic and manual mapping Steps for accomplishing such a technique are set forth in a flow chart 73 depicted in Figure 69 First at a step 104, it is detei mined for a fust yellow pages category whether the categoiy is to be manually assigned If so, then at a step 106 the categoiy is assigned to a supei -categorv This may be accomplished by usei input in a conventional form Next, at a step 108, it is determined whether any unassigned categories remain If at the step 108 additional categones remain, then contiol returns to the step 104, where it is determined whether the next category is to be manually assigned If at the step
108 no categoπes remain to be assigned, then control is control is returned, as represented by off-page connector B, to the flow chart 52 of Figuie 68
If at the step 104 it is determined that the category will not be assigned manually, then it is determined, at a step 110, whether there remain any additional categoπes to be assigned If so, then at a step 112, the category is skipped and processing proceeds to the next category at the step 104 Thus, all categoπes that are to be assigned manually may be assigned pπoi to automatic assignment of categories
If at the step 110 it is detennined that no additional categories exist, then all categones to be assigned manually have been assigned, and control proceeds to a step 114, where the system returns to the first category that was not manually assigned, and it is determined whether the category will be assigned automatically based on the manual assignments If at the step 114 it is determined that the category will be assigned automatically based on the manual assignments, then, at a step 116, the system may compare terms that appear in the category to terms that appear in each of the manually assigned categoπes The system may thus obtain a ranking of the manually assigned categories in oider of the degree of co-occuπence of terms Next, at a step 1 18. the system may assign the same super-categoiy as as assigned the highest-ranked of the manually assigned categories Next, at a step 120. the sv stem may determine whether theie are any additional categories If not, then control passes, as depicted by off-page connector B, to the flow chart 52 of Figure 68. If additional categories remain, then control pioceeds to the step 114 for the next category.
If at the step 114 for a particular category it is determined that a category will not be automatically assigned based on the manual assignments, then at a step 122 a determination is made whether additional categories remain to be assigned If so. then at a step 124 processing skips to the next categorv and control is returned to the step 114 for the next category. Thus, after manual assignment of all categories that are to be manually assigned is complete at the steps 104 through 106, then all categoπes that are to be automatically assigned based on the manual assignments may be completed at the steps 115 through 118 before control proceeds to the step 126. At the step 126, processing returns to the first remaining category that was not previously assigned At a step 128 the system may determine certain statistics regarding the co-occunence of terms between the category and one of the super-categones (perhaps also including the terms in the categories assigned to the super-categories). A vanety of co-occunence techniques can be used. At a step 130 the system may assign the category to the super-category for which the highest co-occunence is found. At a step 132 it is detennined whether additional categories remain to be assigned. If not, then control proceeds, represented by off-page connectoi B, to the flow chart 52 of Figure 68 If so, then control proceeds to the step 126 for processing of the next un-assigned category. Although an embodiment of a technique for mapping categoπes to super-categones is disclosed herein, it should be understood that other techniques are available. For example, manual mapping could be executed after all automatic mapping is completed, or the system could rely entirely on automatic mapping.
Once control has returned to the flow chart 52 of Figure 68, meaning that all yellow pages categoπes have been mapped to a super-category, at a step 77 the banner ad retrieval software 909 may index the vanous super-categones in a banner ad term list 837. The banner ad term list 837 may take the form of a linked list of the supei -categones, with each element in the list consisting of all of the teims that appear in the supei -category, as well as all of the terms that appear in each of the categories that was matched to the supei - category It should be understood that these terms may be expanded, as described in connection with Figuie 40 above, so that synonyms and related teims are also stored with each super-category element Storage of these terms may be in a hierarchical structure that is capable of execution using PHTML scnpts or similai techniques
Next, at a step 72 the system may match one oi moie banner advertisements to each super-category. Thus, if that super-category is found to be the appropriate super-category, the matching banner ad or ads will be displayed.
At any time after initialization of the system, the system may generate a banner ad for display to the user. The banner ads may be stored on a server, which an embodiment is a separate banner ad server 809. Depending on the desires of the host, the banner ads may be either conventional banner ads or targeted banner ads In the case of conventional banner ads, the banner ad server 809 may store the banner ads in a conventional manner and cycle between different ads according to a predetermined routine, such as a round- robin routine, so that when the system calls for a banner ad (such as via an appropriate URL for the banner ad server), the cunent banner ad is sent to the front end server 804 for further processing and display to the user in a banner on the user's browser 824. If a targeted banner ad is desired, then the banner ad retrieval software 909 may be initiated. Steps that may accomplished by an embodiment of the banner ad retπeval software 909 are depicted in a flow chart 132 as shown in Figure 70. First, at a step 60, the banner ad retπeval software 909 obtains the user's query Next, at a step 62, the banner ad retπeval software obtains the categoπes that match the user's query. These categories may be the categones that are obtained by the information retneval software 909 in response to a user query. For example, if the user enters a query for "art supplies," as depicted in Figure 43, the user might retrieve a list of matching categories, such as the eight matching categoπes depicted in Figure 44. In an embodiment, the categories are those that were displayed as a results page in the flow chart 88 at the step 102 in Figure 41. That is, the categoπes are yellow pages categones of each of the business listings retrieved m the information retneval quei y that was executed by the system
Once a list of categories is obtained at the step 62, a variety of techniques could in theory be used to identify a banner ad foi the categoiy For example, an advertisement could be assigned to each category Thus, lefernng to Figuie 44, the category "Arts & Crafts" could be assigned a particular bannei ad (oi set of scrolling banner ads), while the category "Artists Mateπals & Supplies" could be assigned a different banner ad or ads This approach presents a number of problems First, the number of actual yellow pages categories is very large, more than seventeen thousand in an embodiment of the system disclosed herein, so that the process of assigning ads to categories on a one-to-one basis would be extremely time consuming and laborious. Also, because advertisements often include time-sensitive material, they are changed frequently, meaning that the ongoing process of assigning ads to category could be very difficult. Since many of the categones are quite similar to each other, as in the abov e example of "Arts & Crafts" and "Artists Mateπals & Supplies" it is instead preferable to assign ads to super-categories, as was disclosed in connection with Figure 68.
Another problem with an approach of matching advertisements directly to categories is that additional information about the user's preferences may be available from the user query. A system that relies only on the categoπes ignores any information from the user query that might permit further refinement of the advertisement selection Refernng to Figure 70, once the banner ad retrieval software 909 has obtained the terms in the user query and the terms in each of the matching categories, the terms may be weighted or normalized by the number of occunences of the terms and the number of listings m which a term occurs in a step 74.
Next, at a step 79, the banner ad retπeval software 909 may locate the particular terms that appear in the user query and in the categoπes obtained at the steps 60 and 62 in the banner ad term lists 837. Location of a relevant term list 837 may be accomplished through use of a table of pointers or other conventional technique. In the case of use of a table, the argument of the table may consist of a tokenized version of the term and the table may point to the location of the linked term list 837 for that term in the database that stores the banner ad term lists 837.
-91- Referring to Figure 71 , a structure foi a linked banner ad term list 837 is depicted, in which a linked list of super-categones is depicted One linked list may be established for each term that appears in a user's query or m a category, such as a yellow pages category, retrieved by the information retrie al softwaie 909. Thus, foi a given term, such as "restaurant," a linked list 837 of super-categones was established at the initialization step 77 depicted in the flow chart 52 of Figure 68 The linked list may link elements 74, with each element 74 conesponding to a document (a document in this case consisting of all of the words m a particular super-category, plus all words in the categories mapped to the super-category) that includes the term The elements 74 may include sub-elements, including a document identifier 76 for identifying the category and certain statistics regarding the document, including the term frequency 78, TF, which indicates the number of times the term appears in the document, and the inverse document frequency 80, EDF, which indicates the inverse of the number of times the term appears in the entire set of documents that are being searched. From the table of linked lists of super-category terms established in the step 77, the banner ad retrieval software 909 may at a step 81 rank the super-categories. In particular, the system at the step 81 may rank the documents, i.e., the super-categories, according to the appearance of the words occurring in the user query and in the categories.
The ranking may be performed by a vanety of techniques. One such technique obtains a number for each term that appears in the user query and in the categones that consists of the product of the term frequency for that term and the inverse document frequency for that term The sum of all the resulting numbers may be calculated for all super-categones, and the super-category with the highest sum may be the highest ranked document. The banner ad that was assigned to that highest ranked super-category at the step 72 of the flow chart 52 can then be displayed upon completion of the ranking step 81 of the flow chart 132.
Other techniques for weighting may also be used. For example, if a term is a high frequency term, it may not make much difference in logical significance whether the term occurs, for example, one thousand times, in the search, or whether the term occurs one million times. In order to collapse the significance of such high frequency terms, it may be desirable to use the a logarithm or related measure of the term frequency and the inverse document frequency, rather than the raw numbers. Thus, the inverse document frequency may be defined as:
EDF = log (N - IDFNlog (N) where N is the number of documents in the document set and EDF is raw inverse document frequency number. Similarly, a statistic can be used to determine the term frequency, TF. A statistic known as Robertson's term frequency for a document is defined as follows:
RTF = TF/((TF + 0.5 + 1.5(DL/ADL)) where TF is the raw frequency of a term in a document, DL is the length of the document, and ADL is the average length of a document in the search.
These statistics may be further improved by weighting other factors. For example, it is possible to weight each term that appears in one of the categories that is retrieved upon execution of a user query and to normalize the IDF and RTF statistics over the weights. Thus, if a particular category deserves a higher weight, then it might be accorded higher weight in ranking super-categories. For example, a category that is manually mapped to a super-category might be given a higher weight than a category that is automatically mapped. The user query might be given a higher or lower weight, than other information. Categories with a large number of listings may be given higher weight. In an embodiment, each category is given a weight conesponding to the number of listings that are associated with the category, normalized by dividing the total number of listings. In an embodiment, the user query terms are each given a weight of one. In the weighting process, the weight may be multiplied by the term element in performing the sum of the product of term frequency and inverse document frequency over all terms for all documents in the super- category linked list. Thus, with the weights, a normalized version of the Robertson's term frequency statistic can be obtained, permitting improved tuning of search queries beyond what is accomplished with use of the conventional Robertson's term frequency.
Upon completion of the ranking step 81, the highest ranked super-category is selected, and a banner ad that was assigned to that super-category at the step 72 of the flow chart 52 of Figure 68 is selected. The banner ad may be retrieved, such as via a URL, from the banner ad server 809, for display to the user via the browser 824.

Claims

Claims:
1. A method executed in a computei system for ranking super-categories used in performing a data query, the method comprising establishing a super-category term list for each term appearing in one of a super- category and a category of document to be searched, each element of said super-category term list including terms in the super-categoiy and terms in categories associated with that super-category; obtaining terms in a data query; obtaining terms in categories that are retrieved in response to the data query; forming a modified query consisting of said terms in the data query and said terms in the categories; weighting terms of the modified query . and ranking the super-category term lists by applying the modified query to the super- category term lists to determine the most relev ant super-category to the data query.
2. The method of Claim 1, wherein ranking the super-category term lists compπses: for the super-category term list conesponding to each term in the modified query, calculating a sum of all terms in the modified query of a product of a term frequency of each term in that super-category term list and an inverse document frequency of each term as related to all super-category term lists.
3. The method of Claim 2, wherein the term frequency is calculated based on a loganthm of the number of terms from the modified query appearing in the super-category term list.
4. The method of Claim 2, wherem the inverse document frequency is calculated based on Robertson's term frequency.
5. The method of Claim 2, wherem the ranking of documents according to an occunence of terms is weighted according to information about the terms
6. The method of Claim 5, where the weighting is based on a number of documents associated with a category
7. The method of Claim 5, wherem the term frequency and the inverse document frequency are weighted for each term in the modified query according to whether a term appears in a category that was manually mapped to a supei -category oi a category that was automatically mapped to a super-category
8. The method of Claim 1, further comprising, performing for all super-categories: calculating a product associated with each term appearing in said modified query that includes terms from said data query and terms from said categories retrieved in response to said data query, said product being formed by multiplying a term frequency for said each term by an inverse document frequency for said each term as related to a frequency of occunence of said term amongst all other terms associated with all super- category term lists; calculating a sum of said each product for each of said all super-categories; and ranking said all super-categones in accordance with said sum associated with each super-category.
9. The method of Claim 2, wherein said inverse document frequency is calculated based on a logaπthm of actual inverse document frequency defined as: log(N-actual inverse document frequency/log(N)) in which "N" represents a total number of documents over which said data query is being performed.
10. The method of Claim 14, wherein said Robertson's term frequency is defined as: TF/((TF + 0.5 + 1.5(DL/ADL)) wherem "TF" lepresents an actual term frequency associated with a document, "DL" repiesents a length of said document, and "ADL" repiesents an average length of any document a set of documents being searched
1 1. The method of Claim 17, wherein a category that is manually mapped is given a greater weight of significance than anothei category that is automatically mapped.
12. A computer program product foi ranking super-categories used in perfoim g a data query comprising- means for establishing a super-categoiy term list for each term appearing in one of a super-category and a category of a document to be searched, each element of said super- category term list including terms in the super-category and terms in categories associated with that super-category; means for obtaining terms in a data query; means for obtaining terms in categoπes retrieved in response to the data query; means for forming a modified query consisting of said terms in the data query and said terms the categories; means for weighting terms of the modified query; and means for ranking the super-category term lists by applying the modified query to the super-category term lists to determine the most relevant super-category to the data query.
13. A computer program product for ranking super-categories used performing a data query compnsmg: machine executable code for establishing a super-category term list for each term appeanng in one of a super-category and a category of a document to be searched, each element of said super-category term list including terms in the super-category and terms in categones associated with that super-category; machine executable code for obtaining terms in a data query; machine executable code for obtaining terms in categoπes retrieved in response to the data queiy, machine executable code foi forming a modified query consisting of said terms in the data query and said terms in the categories. machine executable code foi weighting terms of the modified query; and machine executable code for ranking the super-category term lists by applying the modified query to the super-category term lists to determine the most relevant super- category to the data query
14. The computer piogram product of Claim 13, wherein said machine executable code for ranking the super-category term lists further comprises: machine executable code for calculating, for the super-category term list conesponding to each term in the modified query, a sum of all terms in the modified query of a product of a term frequency of each term in that super-category term list and an inverse document frequency of each term as related to all super-category term lists.
15. The computer program product of Claim 14, further including machine executable code for calculating the term frequency based on a loganthm of the number of terms from the modified query appeanng in the super-category term list.
16. The computer program product of Claim 14, further including machine executable code for calculating the inverse document frequency based on Robertson's term frequency.
17. The computer program product of Claim 14, further including machine executable code for ranking documents according to an occunence of terms weighted according to information about the terms.
18. The computer program product of Claim 17, further including machine executable code for weighting documents based on a number of documents associated with a category.
19. The computer program product of Claim 17. furthei including machine executable code for weighting the term frequency and the inverse document frequency for each term in the modified queiy according to whether said each term appears in a category that was manually mapped to a supei -category or a categoi y that was automatically mapped to a super-category.
20. The computer program product of Claim 13. further comprising: machine executable code for calculating, for all supei -categories, a product associated with each term appearing in said modified query that includes terms from said data query and terms from said categoπes retπeved in response to said data query, said product being formed by multiplying a term frequency foi said each term by an inverse document frequency for said each term as related to a fiequency of occunence of said term amongst all other terms associated with all super-category term lists; machine executable code for calculating, for all super-categories, a sum of said each product for each of said all super-categories; and machine executable code for ranking said all super-categories in accordance with said sum associated with said each super-category.
21. The computer program product of Claim 14, further including: machine executable code for calculating said inverse document frequency based on a loganthm of actual inverse document frequency defined as: log(N-actual inverse document frequency/log(N)) in which "N" represents a total number of documents over which said data query is being performed.
22. The computer program product of Claim 16, further including: machine executable code for calculating said Robertson's term frequency that is defined as:
TF/((TF+0.5 + 1.5(DL/ADL)) wherein "TF" represents an actual term frequency associated with a document, "DL" represents a length of said document, and "ADL" lepresents an average length of any document in a set of documents being searched
23. The computer program product of Claim 19, wherein a category that is manually mapped is given a greater weight of significance than another category that is automatically mapped.
24. A method executed in a computer system for searching a first set of documents comprising: forming a modified query conesponding to terms included in a data query and terms included in categories associated w ith a second set of documents obtained by searching the data query; and ranking said first set documents using a weighting factor associated with terms of the modified query, wherem the weighting factor varies in accordance with an occunence of each term appeanng in each of said first set of documents to determine a most relevant one of said second documents to said data query.
25. The method of Claim 24, wherein said ranking said first set of documents includes performing, for each of said documents included in said first set: calculating a product associated with each term appearing in said set of terms of the modified query, said product being formed by multiplying a term frequency for said each term by an inverse document frequency for said each term as related to a frequency of occunence of said term amongst all other terms associated with said first set of documents; calculating a sum of said each product for each of said first set of documents; and ranking said first set of documents in accordance with said sum associated with said each document of said first set.
26. The method of Claim 25, wherem the term frequency is calculated based on a logaπthm of the number of terms from the set of terms appeanng in each of said first set of documents.
27 The method of Claim 25, wheiein the mveise document fiequency is calculated based on Robertson's teim frequency
28 The method of Claim 25, where the ranking of said fust set of documents according to an occunence of terms is w eighted accoiding to information about the teims
29 The method of Claim 28, " wherein the wei CghtinOg is based on a number of documents in said fust set associated with a categoiy
30 The method of Claim 28, wherein the frequency and the inverse document frequency are weighted for each term in the set of terms according to whether said each terms appears in a category that was manually mapped to one of said documents or a category that was automatically mapped to one of said documents included in said first set
31. A computer program product for searching a first set of documents compnsmg machine executable code for forming a modified query conesponding to terms included m a data query and terms included in categones associated with a second set of documents obtained by searching the data query; and machine executable code for ranking said first set documents using a weighting factor associated with terms of the modified query, wherein the weighting factor vanes in accordance with an occunence of each term appeanng in each of said first set of documents to determine a most relevant one of said second documents to said data query
32. The computer program product of Claim 31, wherem said ranking said machine executable code for ranking said first set of documents includes: machine executable code for calculating, for each of said documents included in said first set, a product associated with each term appearing in said set of terms of the modified query, said product being formed by multiplying a term frequency for said each term by an inverse document frequency for said each term as related to a frequency of occunence of said term amongst all other terms associated with said first set of documents; machine executable code for calculating, for each of said documents included in said first set, a sum of said each product for each of said first set of documents; and machine executable code for ranking said first set of documents in accordance with said sum associated with said each document of said first set
33. The computer program product of Claim 32, wherein the term frequency is calculated based on a logarithm of the number of terms from the set of terms appearing in each of said documents included in said first set.
34. A method executed in a computer system for establishing super-category term lists for use in performing a data query, comprising: obtaining categoπes of documents that may be retrieved accordance with said data query, each of the categories having at least one term; establishing super-categories for the documents; mapping each of the categoπes to a super-category, wherein at least one of said categones is mapped to a super-category automatically in accordance with one or more previously determined mappings of categories to super-categories; and establishing a super-category term list for each term appearing in a super- category or a category, each element of a list including the terms in the super-category and the terms in the categoπes that are mapped to that super-category.
35. The method of Claim 34, further compnsmg: matching advertisements to the super-categones.
36. The method of Claim 34, wherem mapping the categoπes to the super- categories includes manually assigning at least one of the categoπes to a super-category.
37. The method of Claim 36, further comprising: assigning at least one category to a super-category based on a ranking of the co- occunence frequency of terms in the category with terms that appear in a manually assigned category
38.The method of Claim 34, further comprising assigning at least one category to a supei -category based upon co-occunence of terms appearing in the categoiy and the supei -category
39. The method of Claim 34 , furthei including: mapping one categoiy to more than one super-category
40 The method of Claim 34, furthei including: mapping one super-category to more than on category
41. The method of Claim 34, wherein said mapping each of the categories to a super-category includes a combination of automatic and manual mapping
42. The method of Claim 41, furthei compnsmg: automatically assigning a category to a super-category in accordance with previous manual assignments by companng terms appearing in said category to terms that appear in each of the other categories that was previously manually assigned to a super-category.
63. The method of Claim 58, where the super-category to which said at least one category is assigned is the super-category having the highest co-occunence.
44. A computer program product for establishing super-category term lists for use in performing a data query, the computer program product compnsmg: machine executable code for obtaining categones of documents that may be retπeved m accordance with said data query, each of said categoπes having at least one term; machine executable code for establishing super-categones for the documents; machine executable code for mapping each of the categoπes to a super-category, wheiein at least one of said categones is mapped to a super-category automatically in accordance with one or moie pieviously determined mappings of categones to super- categories; and machine executable code for establishing a super-category term list for each term appeanng in one of a super-category and a category, each element of a list including the terms in the super-category and the terms the categories that are mapped to that super- category.
45. The computer program product of Claim 44, further comprising: machine executable code for matching advertisements to the super-categories
46. The computer program product of Claim 44, wherein said machine executable code for mapping the categories to super-categories includes manually assigning at least one of the categories to a super-category.
47. The computer program product of Claim 46, further compnsmg: machine executable code for assigning at least one category to a super-category based on a ranking of the co-occunence frequency of terms in the category with terms that appear in a manually assigned category.
48. The computer program product of Claim 44, further compnsmg: machine executable code for assigning at least one category to a super-category based upon co-occunence of terms appeanng in the category and the super-category.
49. The computer program product of Claim 44, further compnsmg: machine executable code for mapping one category to more than one super- category.
50. The computer program product of Claim 44, further compnsmg: machine executable code for mapping one super-category to more than one category
51 The computer program product of Claim 44, wheiem said machine executable code for mapping each of the categoπes to a supei -category includes a combination of automatic and manual mapping.
52 The computer program product of Claim 51, further comprising machine executable code for automatically assigning a category to a super-category in accordance with previous manual assignments by comparing terms said category to terms that appear in each of the other categoiy that was previously manually assigned to a super-category
53. The computer program product of Claim 48, wherem the super-category to which said at least one category is assigned is the super-category having the highest co- occunence.
54. An apparatus for establishing super-category term lists for use m performing a data query, the computer program product compnsmg: means for obtaining categones of documents that may be retneved in accordance with said data query, each of said categoπes having at least one term; means for establishing super-categones for the documents; means for mapping each of the categoπes to a super-category, where at least one of said categones is mapped to a super-category automatically in accordance with one or more previously determined mappings of categones to super-categones; and means for establishing a super-category term list for each term appeanng in one of a super-category and a category, each element of a list including the terms the super- category and the terms in the categoπes that are mapped to that super-category.
PCT/US2000/008450 1999-03-31 2000-03-30 Techniques for performing a data query in a computer system WO2000058863A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU43280/00A AU4328000A (en) 1999-03-31 2000-03-30 Techniques for performing a data query in a computer system

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US09/283,268 1999-03-31
US09/283,268 US6826559B1 (en) 1999-03-31 1999-03-31 Hybrid category mapping for on-line query tool
US09/282,730 US7047242B1 (en) 1999-03-31 1999-03-31 Weighted term ranking for on-line query tool
US09/282,730 1999-03-31

Publications (1)

Publication Number Publication Date
WO2000058863A1 true WO2000058863A1 (en) 2000-10-05

Family

ID=26961643

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/008450 WO2000058863A1 (en) 1999-03-31 2000-03-30 Techniques for performing a data query in a computer system

Country Status (3)

Country Link
US (5) US6507839B1 (en)
AU (1) AU4328000A (en)
WO (1) WO2000058863A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1665093A1 (en) * 2003-08-21 2006-06-07 Idilia Inc. System and method for associating documents with contextual advertisements
EP1808787A1 (en) 2006-01-17 2007-07-18 Sap Ag Deep enterprise search
US7853607B2 (en) 2006-08-25 2010-12-14 Sap Ag Related actions server
US8117072B2 (en) 2001-11-13 2012-02-14 International Business Machines Corporation Promoting strategic documents by bias ranking of search results on a web browser

Families Citing this family (93)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7617508B2 (en) * 2003-12-12 2009-11-10 At&T Intellectual Property I, L.P. Methods and systems for collaborative capture of television viewer generated clickstreams
US7587323B2 (en) 2001-12-14 2009-09-08 At&T Intellectual Property I, L.P. System and method for developing tailored content
BR9807467B1 (en) 1997-01-06 2010-11-16 method and system for monitoring the use of television media distribution network.
US20100257037A1 (en) * 2001-12-14 2010-10-07 Matz William R Method and system for targeted incentives
US7802276B2 (en) 1997-01-06 2010-09-21 At&T Intellectual Property I, L.P. Systems, methods and products for assessing subscriber content access
US6983478B1 (en) * 2000-02-01 2006-01-03 Bellsouth Intellectual Property Corporation Method and system for tracking network use
US8640160B2 (en) 1997-01-06 2014-01-28 At&T Intellectual Property I, L.P. Method and system for providing targeted advertisements
US8677384B2 (en) 2003-12-12 2014-03-18 At&T Intellectual Property I, L.P. Methods and systems for network based capture of television viewer generated clickstreams
US20060075456A1 (en) * 1997-01-06 2006-04-06 Gray James Harold Methods and systems for collaborative capture of television viewer generated clickstreams
US20060031882A1 (en) * 1997-01-06 2006-02-09 Swix Scott R Systems, methods, and devices for customizing content-access lists
US20060253884A1 (en) * 1997-01-06 2006-11-09 Gray James H Methods and systems for network based capture of television viewer generated clickstreams
US7020652B2 (en) 2001-12-21 2006-03-28 Bellsouth Intellectual Property Corp. System and method for customizing content-access lists
AU4328000A (en) 1999-03-31 2000-10-16 Verizon Laboratories Inc. Techniques for performing a data query in a computer system
US8572069B2 (en) * 1999-03-31 2013-10-29 Apple Inc. Semi-automatic index term augmentation in document retrieval
US8275661B1 (en) 1999-03-31 2012-09-25 Verizon Corporate Services Group Inc. Targeted banner advertisements
US6718363B1 (en) 1999-07-30 2004-04-06 Verizon Laboratories, Inc. Page aggregation for web sites
JP3855551B2 (en) * 1999-08-25 2006-12-13 株式会社日立製作所 Search method and search system
US20020059223A1 (en) * 1999-11-30 2002-05-16 Nash Paul R. Locator based assisted information browsing
US7941468B2 (en) * 1999-12-30 2011-05-10 At&T Intellectual Property I, L.P. Infringer finder
US7389239B1 (en) * 1999-12-30 2008-06-17 At&T Delaware Intellectual Property, Inc. System and method for managing intellectual property
US7127405B1 (en) * 1999-12-30 2006-10-24 Bellsouth Intellectual Property Corp. System and method for selecting and protecting intellectual property assets
US7346518B1 (en) * 1999-12-30 2008-03-18 At&T Bls Intellectual Property, Inc. System and method for determining the marketability of intellectual property assets
US7801830B1 (en) 1999-12-30 2010-09-21 At&T Intellectual Property I, L.P. System and method for marketing, managing, and maintaining intellectual property
US6845369B1 (en) * 2000-01-14 2005-01-18 Relevant Software Inc. System, apparatus and method for using and managing digital information
US7356604B1 (en) * 2000-04-18 2008-04-08 Claritech Corporation Method and apparatus for comparing scores in a vector space retrieval process
US6912525B1 (en) 2000-05-08 2005-06-28 Verizon Laboratories, Inc. Techniques for web site integration
US7249121B1 (en) * 2000-10-04 2007-07-24 Google Inc. Identification of semantic units from within a search query
US20020124011A1 (en) * 2001-03-01 2002-09-05 Baxter Robert W. Methods, systems, and computer program products for communicating with a controller using a database interface
US20020124056A1 (en) * 2001-03-01 2002-09-05 International Business Machines Corporation Method and apparatus for modifying a web page
US7627588B1 (en) 2001-05-07 2009-12-01 Ixreveal, Inc. System and method for concept based analysis of unstructured data
US7536413B1 (en) 2001-05-07 2009-05-19 Ixreveal, Inc. Concept-based categorization of unstructured objects
USRE46973E1 (en) 2001-05-07 2018-07-31 Ureveal, Inc. Method, system, and computer program product for concept-based multi-dimensional analysis of unstructured information
US7194483B1 (en) 2001-05-07 2007-03-20 Intelligenxia, Inc. Method, system, and computer program product for concept-based multi-dimensional analysis of unstructured information
US7212979B1 (en) 2001-12-14 2007-05-01 Bellsouth Intellectuall Property Corporation System and method for identifying desirable subscribers
US9967633B1 (en) 2001-12-14 2018-05-08 At&T Intellectual Property I, L.P. System and method for utilizing television viewing patterns
US20110178877A1 (en) * 2001-12-14 2011-07-21 Swix Scott R Advertising and content management systems and methods
US7086075B2 (en) 2001-12-21 2006-08-01 Bellsouth Intellectual Property Corporation Method and system for managing timed responses to A/V events in television programming
US20050071863A1 (en) * 2001-12-21 2005-03-31 Matz William R. System and method for storing and distributing television viewing patterns form a clearinghouse
US8086491B1 (en) 2001-12-31 2011-12-27 At&T Intellectual Property I, L. P. Method and system for targeted content distribution using tagged data streams
US8589413B1 (en) 2002-03-01 2013-11-19 Ixreveal, Inc. Concept-based method and system for dynamically analyzing results from search engines
US8352499B2 (en) * 2003-06-02 2013-01-08 Google Inc. Serving advertisements using user request information and user information
US7209915B1 (en) * 2002-06-28 2007-04-24 Microsoft Corporation Method, system and apparatus for routing a query to one or more providers
US7152059B2 (en) * 2002-08-30 2006-12-19 Emergency24, Inc. System and method for predicting additional search results of a computerized database search user based on an initial search query
US7076497B2 (en) * 2002-10-11 2006-07-11 Emergency24, Inc. Method for providing and exchanging search terms between internet site promoters
US20030088553A1 (en) * 2002-11-23 2003-05-08 Emergency 24, Inc. Method for providing relevant search results based on an initial online search query
US6882356B2 (en) * 2003-02-11 2005-04-19 Eastman Kodak Company Method and apparatus for watermarking film
US10475116B2 (en) * 2003-06-03 2019-11-12 Ebay Inc. Method to identify a suggested location for storing a data entry in a database
US9341485B1 (en) * 2003-06-19 2016-05-17 Here Global B.V. Method and apparatus for representing road intersections
US9547994B2 (en) * 2003-10-01 2017-01-17 Kenneth Nathaniel Sherman Progressive reference system, method and apparatus
US6917758B1 (en) * 2003-12-19 2005-07-12 Eastman Kodak Company Method of image compensation for watermarked film
US7814105B2 (en) * 2004-10-27 2010-10-12 Harris Corporation Method for domain identification of documents in a document database
US7788590B2 (en) * 2005-09-26 2010-08-31 Microsoft Corporation Lightweight reference user interface
US7992085B2 (en) * 2005-09-26 2011-08-02 Microsoft Corporation Lightweight reference user interface
EP1952280B8 (en) 2005-10-11 2016-11-30 Ureveal, Inc. System, method&computer program product for concept based searching&analysis
US20070130153A1 (en) * 2005-12-02 2007-06-07 Palm, Inc. Techniques to communicate and process location information from communications networks on a mobile computing device
US7870031B2 (en) * 2005-12-22 2011-01-11 Ebay Inc. Suggested item category systems and methods
US7676485B2 (en) 2006-01-20 2010-03-09 Ixreveal, Inc. Method and computer program product for converting ontologies into concept semantic networks
US9129252B2 (en) * 2006-03-31 2015-09-08 At&T Intellectual Property I, L.P. Potential realization system with electronic communication processing for conditional resource incrementation
US8884972B2 (en) * 2006-05-25 2014-11-11 Qualcomm Incorporated Graphics processor with arithmetic and elementary function units
US20080071864A1 (en) * 2006-09-14 2008-03-20 International Business Machines Corporation System and method for user interest based search index optimization
US8380175B2 (en) * 2006-11-22 2013-02-19 Bindu Rama Rao System for providing interactive advertisements to user of mobile devices
US11256386B2 (en) 2006-11-22 2022-02-22 Qualtrics, Llc Media management system supporting a plurality of mobile devices
US8478250B2 (en) 2007-07-30 2013-07-02 Bindu Rama Rao Interactive media management server
US8700014B2 (en) 2006-11-22 2014-04-15 Bindu Rama Rao Audio guided system for providing guidance to user of mobile device on multi-step activities
US10803474B2 (en) 2006-11-22 2020-10-13 Qualtrics, Llc System for creating and distributing interactive advertisements to mobile devices
US20080133498A1 (en) * 2006-12-05 2008-06-05 Yahoo! Inc. Search Category Commercialization Index
US20080148311A1 (en) * 2006-12-13 2008-06-19 Tischer Steven N Advertising and content management systems and methods
US20080167943A1 (en) * 2007-01-05 2008-07-10 O'neil Douglas R Real time pricing, purchasing and auctioning of advertising time slots based on real time viewership, viewer demographics, and content characteristics
US20090083141A1 (en) * 2007-09-25 2009-03-26 Ari Craine Methods, systems, and computer program products for detecting and predicting user content interest
US7958136B1 (en) 2008-03-18 2011-06-07 Google Inc. Systems and methods for identifying similar documents
US8359302B2 (en) * 2008-07-02 2013-01-22 Adobe Systems Incorporated Systems and methods for providing hi-fidelity contextual search results
WO2010101671A1 (en) 2009-01-16 2010-09-10 New York University Automated real-time particle characterization and three-dimensional velocimetry with holographic video microscopy
US8849790B2 (en) * 2008-12-24 2014-09-30 Yahoo! Inc. Rapid iterative development of classifiers
US9245243B2 (en) 2009-04-14 2016-01-26 Ureveal, Inc. Concept-based analysis of structured and unstructured data using concept inheritance
WO2010140195A1 (en) * 2009-06-05 2010-12-09 株式会社 東芝 Video editing device
US9185163B2 (en) 2011-04-08 2015-11-10 Microsoft Technology Licensing, Llc Receiving individual documents to serve
US8990612B2 (en) 2011-04-08 2015-03-24 Microsoft Technology Licensing, Llc Recovery of a document serving environment
US9158767B2 (en) * 2011-04-08 2015-10-13 Microsoft Technology Licensing, Llc Lock-free indexing of documents
US20140046945A1 (en) * 2011-05-08 2014-02-13 Vinay Deolalikar Indicating documents in a thread reaching a threshold
CN103324633A (en) * 2012-03-22 2013-09-25 阿里巴巴集团控股有限公司 Information publishing method and device
KR101636902B1 (en) * 2012-08-23 2016-07-06 에스케이텔레콤 주식회사 Method for detecting a grammatical error and apparatus thereof
US8862609B2 (en) 2012-09-28 2014-10-14 International Business Machines Corporation Expanding high level queries
US9600529B2 (en) * 2013-03-14 2017-03-21 Wal-Mart Stores, Inc. Attribute-based document searching
US10438254B2 (en) 2013-03-15 2019-10-08 Ebay Inc. Using plain text to list an item on a publication system
RU2610280C2 (en) 2014-10-31 2017-02-08 Общество С Ограниченной Ответственностью "Яндекс" Method for user authorization in a network and server used therein
RU2580432C1 (en) 2014-10-31 2016-04-10 Общество С Ограниченной Ответственностью "Яндекс" Method for processing a request from a potential unauthorised user to access resource and server used therein
US10060749B2 (en) 2015-02-19 2018-08-28 Here Global B.V. Method and apparatus for creating a clothoid road geometry
US9858487B2 (en) 2015-02-19 2018-01-02 Here Global B.V. Method and apparatus for converting from an analytical curve road geometry to a clothoid road geometry
DK3414517T3 (en) 2016-02-08 2021-12-13 Univ New York HOLOGRAPHIC CHARACTERIZATION OF PROTEIN UNITS
US11360958B2 (en) * 2017-09-29 2022-06-14 Apple Inc. Techniques for indexing and querying a set of documents at a computing device
US10949391B2 (en) * 2018-08-30 2021-03-16 International Business Machines Corporation Automatically identifying source code relevant to a task
US11543338B2 (en) 2019-10-25 2023-01-03 New York University Holographic characterization of irregular particles
US11948302B2 (en) 2020-03-09 2024-04-02 New York University Automated holographic video microscopy assay

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5206949A (en) * 1986-09-19 1993-04-27 Nancy P. Cochran Database search and record retrieval system which continuously displays category names during scrolling and selection of individually displayed search terms
US5321833A (en) * 1990-08-29 1994-06-14 Gte Laboratories Incorporated Adaptive ranking system for information retrieval
US5371807A (en) * 1992-03-20 1994-12-06 Digital Equipment Corporation Method and apparatus for text classification
US5544360A (en) * 1992-11-23 1996-08-06 Paragon Concepts, Inc. Method for accessing computer files and data, using linked categories assigned to each data file record on entry of the data file record
US5835087A (en) * 1994-11-29 1998-11-10 Herz; Frederick S. M. System for generation of object profiles for a system for customized electronic identification of desirable objects

Family Cites Families (174)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4003024A (en) 1975-10-14 1977-01-11 Rockwell International Corporation Two-dimensional binary data enhancement system
IL58119A (en) 1979-08-27 1983-03-31 Yeda Res & Dev Histogram image enhancement system
US4719642A (en) 1985-02-27 1988-01-12 Scientific Atlanta, Inc. Error detection and concealment using predicted signal values
US5187747A (en) 1986-01-07 1993-02-16 Capello Richard D Method and apparatus for contextual data enhancement
US5408655A (en) * 1989-02-27 1995-04-18 Apple Computer, Inc. User interface system and method for traversing a database
US5181162A (en) 1989-12-06 1993-01-19 Eastman Kodak Company Document management and production system
US5404514A (en) 1989-12-26 1995-04-04 Kageneck; Karl-Erbo G. Method of indexing and retrieval of electronically-stored documents
US5369761A (en) 1990-03-30 1994-11-29 Conley; John D. Automatic and transparent denormalization support, wherein denormalization is achieved through appending of fields to base relations of a normalized database
US5274802A (en) 1991-02-22 1993-12-28 Gte Mobilnet Incorporated Method for restoring lost databases by comparing existing database and generic database, and generating cellular switch commands to update the generic database
US5398335A (en) 1992-09-01 1995-03-14 Lewis; Eric R. Virtually updating data records by assigning the update fractional addresses to maintain an ordinal relationship without renumbering original records
JP3174168B2 (en) 1992-10-01 2001-06-11 富士通株式会社 Variable replacement processor
US5418961A (en) * 1993-01-12 1995-05-23 International Business Machines Corporation Parallel tables for data model with inheritance
US5497491A (en) 1993-01-26 1996-03-05 International Business Machines Corporation System and method for importing and exporting data between an object oriented computing environment and an external computing environment
JP2583386B2 (en) 1993-03-29 1997-02-19 日本電気株式会社 Keyword automatic extraction device
US7251637B1 (en) * 1993-09-20 2007-07-31 Fair Isaac Corporation Context vector generation and retrieval
US5619709A (en) * 1993-09-20 1997-04-08 Hnc, Inc. System and method of context vector generation and retrieval
US5832476A (en) 1994-06-29 1998-11-03 Hitachi, Ltd. Document searching method using forward and backward citation tables
US6061515A (en) 1994-07-18 2000-05-09 International Business Machines Corporation System and method for providing a high level language for mapping and accessing objects in data stores
US5715443A (en) 1994-07-25 1998-02-03 Apple Computer, Inc. Method and apparatus for searching for information in a data processing system and for providing scheduled search reports in a summary format
EP0792493B1 (en) 1994-11-08 1999-08-11 Vermeer Technologies, Inc. An online service development tool with fee setting capabilities
US6029195A (en) 1994-11-29 2000-02-22 Herz; Frederick S. M. System for customized electronic identification of desirable objects
US5831606A (en) 1994-12-13 1998-11-03 Microsoft Corporation Shell extensions for an operating system
US6110228A (en) 1994-12-28 2000-08-29 International Business Machines Corporation Method and apparatus for software maintenance at remote nodes
US5625767A (en) 1995-03-13 1997-04-29 Bartell; Brian Method and system for two-dimensional visualization of an information taxonomy and of text documents based on topical content of the documents
US5855015A (en) 1995-03-20 1998-12-29 Interval Research Corporation System and method for retrieval of hyperlinked information resources
US5659732A (en) 1995-05-17 1997-08-19 Infoseek Corporation Document retrieval over networks wherein ranking and relevance scores are computed at the client for multiple database documents
US6181867B1 (en) 1995-06-07 2001-01-30 Intervu, Inc. Video storage and retrieval system
US5724571A (en) 1995-07-07 1998-03-03 Sun Microsystems, Inc. Method and apparatus for generating query responses in a computer-based document retrieval system
US5717924A (en) 1995-07-07 1998-02-10 Wall Data Incorporated Method and apparatus for modifying existing relational database schemas to reflect changes made in a corresponding object model
US6199082B1 (en) 1995-07-17 2001-03-06 Microsoft Corporation Method for delivering separate design and content in a multimedia publishing system
US5907837A (en) 1995-07-17 1999-05-25 Microsoft Corporation Information retrieval system in an on-line network including separate content and layout of published titles
US6650761B1 (en) 1999-05-19 2003-11-18 Digimarc Corporation Watermarked business cards and methods
US6026388A (en) * 1995-08-16 2000-02-15 Textwise, Llc User interface and other enhancements for natural language information retrieval system and method
US5941947A (en) 1995-08-18 1999-08-24 Microsoft Corporation System and method for controlling access to data entities in a computer network
US6067552A (en) * 1995-08-21 2000-05-23 Cnet, Inc. User interface system and method for browsing a hypertext database
US5659742A (en) * 1995-09-15 1997-08-19 Infonautics Corporation Method for storing multi-media information in an information retrieval system
JPH0991358A (en) * 1995-09-28 1997-04-04 Fujitsu Ltd Device and method for providing information
US5734887A (en) 1995-09-29 1998-03-31 International Business Machines Corporation Method and apparatus for logical data access to a physical relational database
US5764906A (en) 1995-11-07 1998-06-09 Netword Llc Universal electronic resource denotation, request and delivery system
US5682484A (en) 1995-11-20 1997-10-28 Advanced Micro Devices, Inc. System and method for transferring data streams simultaneously on multiple buses in a computer system
JP3040945B2 (en) 1995-11-29 2000-05-15 松下電器産業株式会社 Document search device
US5754840A (en) * 1996-01-23 1998-05-19 Smartpatents, Inc. System, method, and computer program product for developing and maintaining documents which includes analyzing a patent application with regards to the specification and claims
AT403491B (en) 1996-02-15 1998-02-25 Wimmer Alois Ing CONCRETE CRUSHERS with CUTTING SHEARS
US5926811A (en) * 1996-03-15 1999-07-20 Lexis-Nexis Statistical thesaurus, method of forming same, and use thereof in query expansion in automated text searching
US6035330A (en) 1996-03-29 2000-03-07 British Telecommunications World wide web navigational mapping system and method
US5721897A (en) 1996-04-09 1998-02-24 Rubinstein; Seymour I. Browse by prompted keyword phrases with an improved user interface
US6182083B1 (en) 1997-11-17 2001-01-30 Sun Microsystems, Inc. Method and system for multi-entry and multi-template matching in a database
US5768581A (en) 1996-05-07 1998-06-16 Cochran; Nancy Pauline Apparatus and method for selecting records from a computer database by repeatedly displaying search terms from multiple list identifiers before either a list identifier or a search term is selected
US5826261A (en) * 1996-05-10 1998-10-20 Spencer; Graham System and method for querying multiple, distributed databases by selective sharing of local relative significance information for terms related to the query
US6148289A (en) 1996-05-10 2000-11-14 Localeyes Corporation System and method for geographically organizing and classifying businesses on the world-wide web
US5898780A (en) 1996-05-21 1999-04-27 Gric Communications, Inc. Method and apparatus for authorizing remote internet access
US6101515A (en) 1996-05-31 2000-08-08 Oracle Corporation Learning system for classification of terminology
CA2257309C (en) 1996-06-07 2002-06-11 At&T Corp. Internet file system
US5915249A (en) * 1996-06-14 1999-06-22 Excite, Inc. System and method for accelerated query evaluation of very large full-text databases
US5920859A (en) * 1997-02-05 1999-07-06 Idd Enterprises, L.P. Hypertext document retrieval system and method
US6581056B1 (en) 1996-06-27 2003-06-17 Xerox Corporation Information retrieval system providing secondary content analysis on collections of information objects
US5809502A (en) 1996-08-09 1998-09-15 Digital Equipment Corporation Object-oriented interface for an index
US6189019B1 (en) 1996-08-14 2001-02-13 Microsoft Corporation Computer system and computer-implemented process for presenting document connectivity
US5920854A (en) * 1996-08-14 1999-07-06 Infoseek Corporation Real-time document collection search engine with phrase indexing
US6353822B1 (en) 1996-08-22 2002-03-05 Massachusetts Institute Of Technology Program-listing appendix
US5819291A (en) 1996-08-23 1998-10-06 General Electric Company Matching new customer records to existing customer records in a large business database using hash key
US5870740A (en) 1996-09-30 1999-02-09 Apple Computer, Inc. System and method for improving the ranking of information retrieval results for short queries
US6038561A (en) * 1996-10-15 2000-03-14 Manning & Napier Information Services Management and analysis of document information text
US5899999A (en) 1996-10-16 1999-05-04 Microsoft Corporation Iterative convolution filter particularly suited for use in an image classification and retrieval system
JP3598742B2 (en) 1996-11-25 2004-12-08 富士ゼロックス株式会社 Document search device and document search method
US5802527A (en) 1996-12-31 1998-09-01 Mci Communications Corporation Data enhancement engine
US6009459A (en) 1997-01-10 1999-12-28 Microsoft Corporation Intelligent automatic searching for resources in a distributed environment
US6006230A (en) 1997-01-15 1999-12-21 Sybase, Inc. Database application development system with improved methods for distributing and executing objects across multiple tiers
US5924105A (en) 1997-01-27 1999-07-13 Michigan State University Method and product for determining salient features for use in information searching
GB9701866D0 (en) 1997-01-30 1997-03-19 British Telecomm Information retrieval
JP3499105B2 (en) * 1997-03-03 2004-02-23 株式会社東芝 Information search method and information search device
US5950198A (en) 1997-03-24 1999-09-07 Novell, Inc. Processes and apparatuses for generating file correspondency through replication and synchronization between target and source computers
US5895470A (en) 1997-04-09 1999-04-20 Xerox Corporation System for categorizing documents in a linked collection of documents
US5933849A (en) 1997-04-10 1999-08-03 At&T Corp Scalable distributed caching system and method
JP3001460B2 (en) 1997-05-21 2000-01-24 株式会社エヌイーシー情報システムズ Document classification device
US6578113B2 (en) 1997-06-02 2003-06-10 At&T Corp. Method for cache validation for proxy caches
US6098066A (en) 1997-06-13 2000-08-01 Sun Microsystems, Inc. Method and apparatus for searching for documents stored within a document directory hierarchy
JPH113307A (en) 1997-06-13 1999-01-06 Canon Inc Information processor and its method
US6415250B1 (en) 1997-06-18 2002-07-02 Novell, Inc. System and method for identifying language using morphologically-based techniques
US5937402A (en) 1997-06-19 1999-08-10 Ontos, Inc. System for enabling access to a relational database from an object oriented program
US6233575B1 (en) * 1997-06-24 2001-05-15 International Business Machines Corporation Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values
US6128613A (en) 1997-06-26 2000-10-03 The Chinese University Of Hong Kong Method and apparatus for establishing topic word classes based on an entropy cost function to retrieve documents represented by the topic words
JP3607462B2 (en) 1997-07-02 2005-01-05 松下電器産業株式会社 Related keyword automatic extraction device and document search system using the same
US5933822A (en) 1997-07-22 1999-08-03 Microsoft Corporation Apparatus and methods for an information retrieval system that employs natural language processing of search results to improve overall precision
US6055528A (en) 1997-07-25 2000-04-25 Claritech Corporation Method for cross-linguistic document retrieval
US5956039A (en) 1997-07-25 1999-09-21 Platinum Technology Ip, Inc. System and method for increasing performance by efficient use of limited resources via incremental fetching, loading and unloading of data assets of three-dimensional worlds based on transient asset priorities
US5937392A (en) 1997-07-28 1999-08-10 Switchboard Incorporated Banner advertising display system and method with frequency of advertisement control
US6073140A (en) 1997-07-29 2000-06-06 Acxiom Corporation Method and system for the creation, enhancement and update of remote data using persistent keys
US6167404A (en) 1997-07-31 2000-12-26 Avid Technology, Inc. Multimedia plug-in using dynamic objects
US6078916A (en) * 1997-08-01 2000-06-20 Culliss; Gary Method for organizing information
US6282542B1 (en) 1997-08-06 2001-08-28 Tachyon, Inc. Distributed system and method for prefetching objects
US6092061A (en) 1997-08-15 2000-07-18 International Business Machines Corporation Data partitioning by co-locating referenced and referencing records
US6081774A (en) 1997-08-22 2000-06-27 Novell, Inc. Natural language information retrieval system and method
US5983216A (en) * 1997-09-12 1999-11-09 Infoseek Corporation Performing automated document collection and selection by providing a meta-index with meta-index values indentifying corresponding document collections
US6018733A (en) 1997-09-12 2000-01-25 Infoseek Corporation Methods for iteratively and interactively performing collection selection in full text searches
US5845278A (en) * 1997-09-12 1998-12-01 Inioseek Corporation Method for automatically selecting collections to search in full text searches
US5956722A (en) * 1997-09-23 1999-09-21 At&T Corp. Method for effective indexing of partially dynamic documents
US6009410A (en) 1997-10-16 1999-12-28 At&T Corporation Method and system for presenting customized advertising to a user on the world wide web
US6269368B1 (en) * 1997-10-17 2001-07-31 Textwise Llc Information retrieval using dynamic evidence combination
US5987457A (en) 1997-11-25 1999-11-16 Acceleration Software International Corporation Query refinement method for searching documents
US6094649A (en) 1997-12-22 2000-07-25 Partnet, Inc. Keyword searches of structured databases
EP1389013A1 (en) * 1997-12-26 2004-02-11 Matsushita Electric Industrial Co., Ltd. Video clip identification system unusable for commercial cutting
US6298356B1 (en) 1998-01-16 2001-10-02 Aspect Communications Corp. Methods and apparatus for enabling dynamic resource collaboration
IL123129A (en) 1998-01-30 2010-12-30 Aviv Refuah Www addressing
US6028605A (en) 1998-02-03 2000-02-22 Documentum, Inc. Multi-dimensional analysis of objects by manipulating discovered semantic properties
US6182133B1 (en) 1998-02-06 2001-01-30 Microsoft Corporation Method and apparatus for display of information prefetching and cache status having variable visual indication based on a period of time since prefetching
JPH11327717A (en) 1998-03-16 1999-11-30 Digital Vision Laboratories:Kk Information output device and information offering system
US6032145A (en) 1998-04-10 2000-02-29 Requisite Technology, Inc. Method and system for database manipulation
US6122647A (en) 1998-05-19 2000-09-19 Perspecta, Inc. Dynamic generation of contextual links in hypertext documents
US6098064A (en) 1998-05-22 2000-08-01 Xerox Corporation Prefetching and caching documents according to probability ranked need S list
US6334145B1 (en) 1998-06-30 2001-12-25 International Business Machines Corporation Method of storing and classifying selectable web page links and sublinks thereof to a predetermined depth in response to a single user input
US6327574B1 (en) 1998-07-07 2001-12-04 Encirq Corporation Hierarchical models of consumer attributes for targeting content in a privacy-preserving manner
US6178418B1 (en) 1998-07-28 2001-01-23 Noetix Corporation Distributed data warehouse query and resource management system
US6405188B1 (en) 1998-07-31 2002-06-11 Genuity Inc. Information retrieval system
US6356899B1 (en) 1998-08-29 2002-03-12 International Business Machines Corporation Method for interactively creating an information database including preferred information elements, such as preferred-authority, world wide web pages
US6334131B2 (en) 1998-08-29 2001-12-25 International Business Machines Corporation Method for cataloging, filtering, and relevance ranking frame-based hierarchical information structures
US6047210A (en) 1998-09-03 2000-04-04 Cardiac Pacemakers, Inc. Cardioverter and method for cardioverting an atrial tachyarrhythmia while maintaining atrial pacing
US6157930A (en) 1998-09-24 2000-12-05 Acceleration Software International Corporation Accelerating access to wide area network information in mode for showing document then verifying validity
US6363373B1 (en) 1998-10-01 2002-03-26 Microsoft Corporation Method and apparatus for concept searching using a Boolean or keyword search engine
JP2000112990A (en) 1998-10-08 2000-04-21 Canon Inc Text retrieval device, effective word frequency preparation device, text retrieval method, effective word frequency preparation method and recording medium
US6363378B1 (en) 1998-10-13 2002-03-26 Oracle Corporation Ranking of query feedback terms in an information retrieval system
US6360215B1 (en) * 1998-11-03 2002-03-19 Inktomi Corporation Method and apparatus for retrieving documents based on information other than document content
US6347312B1 (en) 1998-11-05 2002-02-12 International Business Machines Corporation Lightweight directory access protocol (LDAP) directory server cache mechanism and method
EP1135723A4 (en) 1998-11-30 2005-02-16 Siebel Systems Inc Development tool, method, and system for client server applications
US6286000B1 (en) 1998-12-01 2001-09-04 International Business Machines Corporation Light weight document matcher
US6338059B1 (en) 1998-12-17 2002-01-08 International Business Machines Corporation Hyperlinked search interface for distributed database
US6513031B1 (en) 1998-12-23 2003-01-28 Microsoft Corporation System for improving search area selection
US6295529B1 (en) * 1998-12-24 2001-09-25 Microsoft Corporation Method and apparatus for indentifying clauses having predetermined characteristics indicative of usefulness in determining relationships between different texts
US6389412B1 (en) 1998-12-31 2002-05-14 Intel Corporation Method and system for constructing integrated metadata
US6209038B1 (en) 1999-01-13 2001-03-27 International Business Machines Corporation Technique for aggregate transaction scope across multiple independent web requests
US6308168B1 (en) 1999-02-09 2001-10-23 Knowledge Discovery One, Inc. Metadata-driven data presentation module for database system
US6581038B1 (en) * 1999-03-15 2003-06-17 Nexcura, Inc. Automated profiler system for providing medical information to patients
US6393427B1 (en) * 1999-03-22 2002-05-21 Nec Usa, Inc. Personalized navigation trees
US6631496B1 (en) 1999-03-22 2003-10-07 Nec Corporation System for personalizing, organizing and managing web information
US6421683B1 (en) 1999-03-31 2002-07-16 Verizon Laboratories Inc. Method and product for performing data transfer in a computer system
US6408294B1 (en) 1999-03-31 2002-06-18 Verizon Laboratories Inc. Common term optimization
US6393415B1 (en) 1999-03-31 2002-05-21 Verizon Laboratories Inc. Adaptive partitioning techniques in performing query requests and request routing
US7047242B1 (en) 1999-03-31 2006-05-16 Verizon Laboratories Inc. Weighted term ranking for on-line query tool
US6374241B1 (en) 1999-03-31 2002-04-16 Verizon Laboratories Inc. Data merging techniques
US6397228B1 (en) 1999-03-31 2002-05-28 Verizon Laboratories Inc. Data enhancement techniques
AU4328000A (en) 1999-03-31 2000-10-16 Verizon Laboratories Inc. Techniques for performing a data query in a computer system
US6493721B1 (en) 1999-03-31 2002-12-10 Verizon Laboratories Inc. Techniques for performing incremental data updates
US6484161B1 (en) 1999-03-31 2002-11-19 Verizon Laboratories Inc. Method and system for performing online data queries in a distributed computer system
US6496843B1 (en) 1999-03-31 2002-12-17 Verizon Laboratories Inc. Generic object for rapid integration of data changes
US6826559B1 (en) 1999-03-31 2004-11-30 Verizon Laboratories Inc. Hybrid category mapping for on-line query tool
US6578078B1 (en) 1999-04-02 2003-06-10 Microsoft Corporation Method for preserving referential integrity within web sites
US6269361B1 (en) 1999-05-28 2001-07-31 Goto.Com System and method for influencing a position on a search result list generated by a computer network search engine
US6490719B1 (en) 1999-07-26 2002-12-03 Gary Thomas System and method for configuring and executing a flexible computer program comprising component structures
US6718363B1 (en) 1999-07-30 2004-04-06 Verizon Laboratories, Inc. Page aggregation for web sites
US6353825B1 (en) 1999-07-30 2002-03-05 Verizon Laboratories Inc. Method and device for classification using iterative information retrieval techniques
US6567854B1 (en) 1999-10-21 2003-05-20 Genuity Inc. Internet service delivery via server pushed personalized advertising dashboard
US6574631B1 (en) 2000-08-09 2003-06-03 Oracle International Corporation Methods and systems for runtime optimization and customization of database applications and application entities
EP1195676A3 (en) 2000-10-03 2007-03-28 Microsoft Corporation Architecture for customizable applications
US6978419B1 (en) 2000-11-15 2005-12-20 Justsystem Corporation Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments
US20020099583A1 (en) 2001-01-24 2002-07-25 Matusek Lawrence W. Architecture and techniques for providing product configurations to an enterprise resource planner
US20030149578A1 (en) 2001-06-01 2003-08-07 Vientity Private Limited Intelligent procurement agent
US20030093433A1 (en) 2001-11-14 2003-05-15 Exegesys, Inc. Method and system for software application development and customizible runtime environment
US6954901B1 (en) 2001-12-13 2005-10-11 Oracle International Corporation Method and system for tracking a user flow of web pages of a web site to enable efficient updating of the hyperlinks of the web site
JP3791908B2 (en) 2002-02-22 2006-06-28 インターナショナル・ビジネス・マシーンズ・コーポレーション SEARCH SYSTEM, SYSTEM, SEARCH METHOD, AND PROGRAM
US6737994B2 (en) 2002-05-13 2004-05-18 International Business Machines Corporation Binary-ordered compression for unicode
US6886010B2 (en) 2002-09-30 2005-04-26 The United States Of America As Represented By The Secretary Of The Navy Method for data and text mining and literature-based discovery
JP4828091B2 (en) 2003-03-05 2011-11-30 ヒューレット・パッカード・カンパニー Clustering method program and apparatus
US7080089B2 (en) 2003-03-12 2006-07-18 Microsoft Corporation Customization of process logic in a software system
GB2399427A (en) 2003-03-12 2004-09-15 Canon Kk Apparatus for and method of summarising text
US7409422B2 (en) 2003-08-21 2008-08-05 Microsoft Corporation Declarative page view and click tracking systems and methods
US7801887B2 (en) 2004-10-27 2010-09-21 Harris Corporation Method for re-ranking documents retrieved from a document database
KR20070084004A (en) 2004-11-05 2007-08-24 가부시키가이샤 아이.피.비. Keyword extracting device
US7467349B1 (en) 2004-12-15 2008-12-16 Amazon Technologies, Inc. Method and system for displaying a hyperlink at multiple levels of prominence based on user interaction
US7630980B2 (en) 2005-01-21 2009-12-08 Prashant Parikh Automatic dynamic contextual data entry completion system
US7451124B2 (en) 2005-05-12 2008-11-11 Xerox Corporation Method of analyzing documents
JP4756953B2 (en) 2005-08-26 2011-08-24 富士通株式会社 Information search apparatus and information search method
US7657506B2 (en) 2006-01-03 2010-02-02 Microsoft International Holdings B.V. Methods and apparatus for automated matching and classification of data
US7720837B2 (en) 2007-03-15 2010-05-18 International Business Machines Corporation System and method for multi-dimensional aggregation over large text corpora
US20110004588A1 (en) * 2009-05-11 2011-01-06 iMedix Inc. Method for enhancing the performance of a medical search engine based on semantic analysis and user feedback

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5206949A (en) * 1986-09-19 1993-04-27 Nancy P. Cochran Database search and record retrieval system which continuously displays category names during scrolling and selection of individually displayed search terms
US5321833A (en) * 1990-08-29 1994-06-14 Gte Laboratories Incorporated Adaptive ranking system for information retrieval
US5371807A (en) * 1992-03-20 1994-12-06 Digital Equipment Corporation Method and apparatus for text classification
US5544360A (en) * 1992-11-23 1996-08-06 Paragon Concepts, Inc. Method for accessing computer files and data, using linked categories assigned to each data file record on entry of the data file record
US5835087A (en) * 1994-11-29 1998-11-10 Herz; Frederick S. M. System for generation of object profiles for a system for customized electronic identification of desirable objects

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8117072B2 (en) 2001-11-13 2012-02-14 International Business Machines Corporation Promoting strategic documents by bias ranking of search results on a web browser
EP1665093A1 (en) * 2003-08-21 2006-06-07 Idilia Inc. System and method for associating documents with contextual advertisements
EP1665093A4 (en) * 2003-08-21 2006-12-06 Idilia Inc System and method for associating documents with contextual advertisements
US7774333B2 (en) 2003-08-21 2010-08-10 Idia Inc. System and method for associating queries and documents with contextual advertisements
US8024345B2 (en) 2003-08-21 2011-09-20 Idilia Inc. System and method for associating queries and documents with contextual advertisements
EP2397954A1 (en) * 2003-08-21 2011-12-21 Idilia Inc. System and method for associating queries and documents with contextual advertisements
EP1808787A1 (en) 2006-01-17 2007-07-18 Sap Ag Deep enterprise search
US7853607B2 (en) 2006-08-25 2010-12-14 Sap Ag Related actions server

Also Published As

Publication number Publication date
AU4328000A (en) 2000-10-16
US6507839B1 (en) 2003-01-14
US7725424B1 (en) 2010-05-25
US6496818B1 (en) 2002-12-17
US6850935B1 (en) 2005-02-01
US8095533B1 (en) 2012-01-10

Similar Documents

Publication Publication Date Title
US7047242B1 (en) Weighted term ranking for on-line query tool
US6421683B1 (en) Method and product for performing data transfer in a computer system
US6826559B1 (en) Hybrid category mapping for on-line query tool
US6643640B1 (en) Method for performing a data query
US6496843B1 (en) Generic object for rapid integration of data changes
US6493721B1 (en) Techniques for performing incremental data updates
US6408294B1 (en) Common term optimization
US6374241B1 (en) Data merging techniques
US6484161B1 (en) Method and system for performing online data queries in a distributed computer system
US6397228B1 (en) Data enhancement techniques
WO2000058863A1 (en) Techniques for performing a data query in a computer system
US6615209B1 (en) Detecting query-specific duplicate documents
US7783626B2 (en) Pipelined architecture for global analysis and index building
US8849693B1 (en) Techniques for advertising in electronic commerce
US7424486B2 (en) Selection of search phrases to suggest to users in view of actions performed by prior users
US6694323B2 (en) System and methodology for providing compact B-Tree
Mandhani et al. Query caching and view selection for XML databases
US11163802B1 (en) Local search using restriction specification
US7373341B2 (en) Computer readable medium, method and apparatus for preserving filtering conditions to query multilingual data sources at various locales when regenerating a report
US7987165B2 (en) Indexing system and method
US8332422B2 (en) Using text search engine for parametric search
US7533136B2 (en) Efficient implementation of multiple work areas in a file system like repository that supports file versioning
EP1590745A2 (en) A system and method for providing content warehouse
US8275661B1 (en) Targeted banner advertisements
US7024405B2 (en) Method and apparatus for improved internet searching

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP