WO2005081133A2 - Improved processor for a markup based metalanguage, such as xml - Google Patents

Improved processor for a markup based metalanguage, such as xml Download PDF

Info

Publication number
WO2005081133A2
WO2005081133A2 PCT/IB2005/000197 IB2005000197W WO2005081133A2 WO 2005081133 A2 WO2005081133 A2 WO 2005081133A2 IB 2005000197 W IB2005000197 W IB 2005000197W WO 2005081133 A2 WO2005081133 A2 WO 2005081133A2
Authority
WO
WIPO (PCT)
Prior art keywords
metalanguage
sequence
lexicon
file
data
Prior art date
Application number
PCT/IB2005/000197
Other languages
French (fr)
Other versions
WO2005081133A3 (en
Inventor
Robert Cheslow
Harry W. Loveland
Original Assignee
Agilience
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agilience filed Critical Agilience
Publication of WO2005081133A2 publication Critical patent/WO2005081133A2/en
Publication of WO2005081133A3 publication Critical patent/WO2005081133A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/835Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Definitions

  • Metalanguages like XML are very powerful, and have a constantly increasing development. Thus, it is now contemplated to create XML-based databases.
  • the present invention provides advances towards better solutions.
  • a metalanguage processor comprising:
  • a parser capable of decomposing a metalanguage file into source file segments, in accordance with the syntax of the metalanguage, - a lexicon, comprising a representation of a set of strings, said representation being searchable, using a search string, to get a unique identifier for that search string, and
  • a converter capable of deriving a search string from a source file segment, and of interacting with the lexicon for determining an identifier for that search string
  • said converter building a representation of the metalanguage file as a stored sequence of data blocks, each data block corresponding to a respective one of the source file segments, and each data block being based on the unique identifier determined in the lexicon for the search string derived from the corresponding source file segment.
  • the invention also encompasses the converter or tokenizer for use with such a parser, while constructing the lexicon and sequence in memory and/or in a disk or other mass storage device.
  • a method of processing a metalanguage file comprising : a. parsing the metalanguage file, for identifying therein successive source file segments in accordance with the syntax of the metalanguage, b. maintaining a lexicon, forming a directly searchable representation of strings, in correspondence with a unique identifier for each string, c. converting a search string derived from a source file segment into a corresponding identifier, using the lexicon, and d. progressively building a sequence of data blocks, each data block corresponding to a respective one of the source file segments, and each data block being based on the unique identifier determined in the lexicon for a search string derived from the corresponding source file segment.
  • a database for storing data representing a metalanguage file comprising:
  • a lexicon comprising a representation of a set of strings, said representation being directly searchable, using an identifier, to obtain a unique corresponding string, the set of strings in the lexicon covering substantially all the meaningful string data in the metalanguage file, - a sequence of data blocks individually based on identifiers being searcheable in the lexicon, the order of the data blocks in said sequence being related to the order of corresponding strings in the metalanguage file.
  • FIG. 1 is a block diagram of a computer station in which this invention may be performed ;
  • - Figure 2 is a block diagram of an embodiment of an XML source document processor
  • - Figure 3 is a tree representation of an exemplary XML source document (shown in detail in Exhibit E2) ;
  • FIGS. 4 and 4A are exemplary flow charts of the conversion of a text XML source document into compact storage data representing that document ;
  • FIG. 5 is an exemplary flow chart of the determination of node levels from the compact storage data ;
  • FIG. 6 is an exemplary flow chart of the determination of the "scope" of a node from the compact storage data ;
  • FIG. 7 is an exemplary flow chart of the determination of the scopes for a plurality of nodes, from the compact storage data ;
  • FIG. 8 is an exemplary flow chart of an optional operation of posting which may be performed on the compact storage data ;
  • FIG. 9 is a block diagram of an embodiment of a query processor
  • FIGS. 10 and 10A show simplified data structures forming a so-called "ternary trie"
  • FIG. 11 is a schematic diagram illustrating an example of inverted index and posting file
  • - Figure 12 is a diagram illustrating priority queue and priority queue nodes
  • FIGS 13, 14 and 15 are diagrams illustrating operation of functions of an API included in an embodiment of query processor.
  • a portion of the disclosure of this patent document contains material which may be subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright and/or author' s rights whatsoever.
  • Exhibit El contains a basic XML expression, and developments thereon ;
  • Exhibit E2 contains an exemplary XML document ;
  • Exhibit E3 contains tables illustrating certain aspects of the scanning of the exemplary
  • Exhibit E4 contains tables illustrating conditions for use in regenerating an XML document.
  • Exhibit E5 describes design and implementation details of an exemplary embodiment.
  • references to the Exhibits may be made directly by the Exhibit or Exhibit section identifier: for example, El .1 directly designate section 1.1 in Exhibit El (the prefix designating the Exhibit may be omitted if there is no ambiguity).
  • the Exhibits are placed apart for the purpose of clarifying the detailed description, and of enabling easier reference. They nevertheless form an integral part of the description of the present invention.
  • the Exhibits comprise tables.
  • the tables may have a leftmost column headed Ref. This column is intended only to receive references for facilitating discussion in the written specification.
  • FIG. 1 represents an example of the hardware of such computer systems.
  • the hardware comprises :
  • processor or CPU 11 e.g. an Intel or AMD model
  • program memory 12 e.g. an EPROM, a RAM, or Flash memory
  • working memory 13 e.g. a RAM of any suitable technology
  • mass memory 14 e.g. one or more hard disks
  • a display 15 e.g. a monitor
  • a user input device e.g. a keyboard and/or a mouse
  • Network interface device 21 connected to a commumcation medium 20, which is in communication with other computers.
  • Network interface device 21 may be of the type of Ethernet, or of the type of ATM.
  • Medium 20 may be based on wire cables, fiber optics, or radio-communications, for example.
  • Bus systems may include a processor bus, e.g. PCI, connected via appropriate bridges to, e.g. an ISA or a SCSI bus.
  • processor bus e.g. PCI
  • bridges e.g. an ISA or a SCSI bus.
  • the system of Figure 1 may be above provided with a programming environment having the capability to process objects, e.g. C++, C# or Java, and also to support a parser of the metalanguage being used, e. g. XML.
  • objects e.g. C++, C# or Java
  • parser of the metalanguage being used e. g. XML.
  • the markup tag is the string 'price'.
  • the head mark-up may comprises the attribute name 'currency', with its attribute value being here the string "euro”.
  • the string data is the value 22.5. It is related to the markup tag 'price', and/or
  • XML documents are based on character strings. Their content is delimited by mark-ups, recognizable from a ⁇ string ... > ... ⁇ /string> construction, where:
  • the XML document may be cut-off into "words".
  • word is called a “token string” in this specification.
  • a given XML document may be biunivocally associated with one or more tree structure topologies.
  • an XML document normally has a main markup section (a container looking like " ⁇ begindoc> ⁇ substance of document ⁇ ⁇ /enddoc>").
  • Such a document may thus be associated with a single tree structure.
  • Ni Exhibit El
  • the tag is the object name, contained in the "element”, i.e. at the beginning of the head markup ("price", in the example).
  • Price Associated to the object or node are (if any) :
  • Table El.2 in Exhibit El shows an exemplary table of XML types. This table is a significant ingredient of this invention, in that the types contained therein make it possible to deal with the commonly available XML documents.
  • the table shows the type names, starting with "TT_” (for token type). It also shows a corresponding unique code (typeld), her an integer (valid both in decimal or hexa, since the maximum is 9).
  • Last line shows "TT_CDATAPWS", in which PWS stands for punctuation and white space.
  • Ten (10) types are being currently defined in table El .2. This is based on the common XML standard : apart from the types being reserved for the XML specific constructions, all raw data are in character string form ("CD AT A"). Assuming further types are introduced in the future, then a higher number of types may be defined, e.g. if it would be desired to differentiate the raw data between usual software data types, like "string”, “integer”, “boolean”, etc ... Further types might also be used to handle the comment structures of XML, having the format " ⁇ !- ... ->". However, such comment structures are ignored in the exemplary embodiment described hereinafter. Additionally, it may be interesting to add one or more types covering a string followed by a space ; this is very helpful, both in terms of memory occupation and search efficiency.
  • the second line of section E 1.1 in Exhibit E 1 shows the type name (in accordance with table E 1.2) for each significant token in expression Ni.
  • the type may be directly derived from the syntax rules of XML.
  • the XML expressions are thus decomposed in their significant tokens.
  • Each token is associated with a pair of data (X, Y), where X is the token type, and Y is the token string.
  • Y is the "cleaned” token string, i.e. the portion of the token string obtained after removing the XML symbols, and other non meaningful symbols, as it will be explained.
  • the cleaned string for " ⁇ /price>" is "price".
  • the token type X is converted into a token type code g(X), named typeld.
  • This function g(X) may be simply viewed as an inspection (lookup) of table El.2, based on the type as determined by an XML parser, in accordance with the XML syntax rules, some of which are also recalled in table El .2.
  • Note certain XML rules are not reflected in Table El .2, e.g. the following optional notation for an empty node : ⁇ er ⁇ ptynode/> instead of ⁇ emptynode> ⁇ /e ptynode> Such a case may be dealt with by adding appropriate additional conditions in the type determination.
  • the cleaned token string Y is converted into a token string code h(Y), named stringld.
  • This function h(Y) is "unique". "Unique” means here that the stringld shall correspond to any occurrence of the string Y, however to no different string, within the context of interest.
  • the function h(Y) may be viewed as follows :
  • h(Y) is the stringld delivered by the lexicon for the new token string Y.
  • h(Y) is the existing corresponding stringld.
  • the lexicon may be viewed as a table, one may use the known techniques to generate unique identifiers in a table. The uniqueness may be simply obtained by checking that the calculated identifier for a new Y is different from all existing stringlds in the lexicon.
  • Lexld identifier shall correspond to any occurrence of the pair of data (X, Y), however to no different pair of data, within the context of interest. Pairs of data are "different” if they differ by the token type X, or by the token string Y, or both.
  • Function f() combines g(X) and h(Y). Any function making such a combination while preserving the unique character of the result may be used. A mere concatenation of the strings g(X) and h(Y) is a simple and very convenient way to implement function f().
  • Section El.4 shows a linear sequence of Lexlds, noted seq(Ni), which entirely represents expression Ni (the punctuation in El .4 is for clarity only, and may not exist in the data).
  • an XML source document 20 is submitted to a data processor 30, which will perform various functions, comprising a native and compact storage of the XML data in a very efficient and convenient form.
  • Data processor 30 comprises an XML parser 31.
  • an XML parser is capable of scanning an XML document, and finding therein the specific symbols of the XML metalanguage.
  • Data processor 30 also comprises a tokenizer 32, a lexicon 33, a sequencer
  • Tokenization The calculation of a unique identifier stringld from a string Y is hereinafter termed "tokenization”.
  • the corresponding calculation unit or process is called a “tokenizer”.
  • each stringld is stored in lexicon 33.
  • the tokenizer may be viewed as creating and maintaining a lexicon mapping associating a unique integer stringld to each unique cleaned token string Y.
  • the associated typeld (derived from X) may not need to be stored in lexicon 33, since it is necessary only in the sequencer 34, to build a sequence stored in sequence store 35. However, it may be useful that lexicon 33 physically stores the whole Lexld, i.e. both the stringld and the typeld. This makes it possible to search in the lexicon for both the cleaned string Y and its type X.
  • Data processor 30 may further comprise an indexer 36, and index tables 37.
  • indexer 36 comprises a posting mechanism
  • index tables 37 comprise posting lists (both will be described hereinafter).
  • an XML document may be decomposed into a tree structure of XML obj ects or nodes, each of which is similar to the above described node expression Ni
  • Figure 3 shows a tree structure corresponding to the nodes of the exemplary XML document of Exhibit E2.
  • Each node circle in Figure 3 contains only the "element” tag of the node (the attributes and CDATA, not shown, may be viewed as attached to the node object). Also, each circle is labeled with the "ref ' of its "element” tag, as it appears in Table E3.0 of Exhibit E3 (to be described).
  • Table E3.0 shows the parsing of the XML document.
  • the parser is capable, in known fashion, to isolate individual XML tokens, based on the syntax rules and the typical XML dedicated symbols.
  • the tokens individually appear one by one as reflected in the second column of table E3.0.
  • the head and tail XML symbols of each token, if any, are separated (at least logically) as noted in the third and fourth columns of table E3.0, respectively.
  • the string contained within the token cleaned by removal of its possible head and tail XML symbols, is reflected in the fifth column.
  • the possible indentations in the source file may be kept, thus making it possible to store and regenerate the source file exactly.
  • each and every digit is considered, including white spaces.
  • the white spaces and other punctuation signs are processed one by one. Note, however, that the first and last white space of a block of raw data (CDATA), if any, have not been processed in this illustration, to avoid unduly long tables in the Exhibits.
  • the sixth column indicate the presence and nature of such punctuation and whitespace characters, where appropriate.
  • the typelds 1, 4, 5, 6 and 7 may be readily determined from the XML head symbols of the token, if any.
  • the typeld 2 may be determined from the fact the immediately preceding token with typeld
  • the rest is raw data or CDATA, and basically receives typeld 8, as reflected in column CND4 of Table E3.0. - preferably, the non-alphanumeric characters, i.e. punctuation and white spaces are specially marked by receiving typeld 9.
  • Column CND5 reflects this by indicating 1 to be added to the integer 8 in column CND4, where such non-alphanumeric characters are met.
  • the XML file heading (refs 1-6) may be processed the same way. However, since its location and syntax are predetermined, its processing may be simply directly hard coded.
  • the whole process may be easily hardcoded, using e.g. IF ... ELSE ... ENDIF statements, or similar programming structures, e.g. CASE structures. This is faster than looking-up in a table.
  • table E3.1 shows an exemplary way of how the lexicon may be progressively constructed.
  • the lexicon itself comprises the two rightmost columns of table E3.1. as shown in bold characters.
  • Lexld is the concatenation of the typeld and the stringld, both in hexadecimal.
  • the format e.g. the number of "0" in the stringld
  • the format is illustrative only. It is intended to form the Lexld as an integer of fixed length. Also the space between the 1 st digit and the rest is for illustration only, and will not exist in the computer' s memory.
  • Table E3.1 only shows the first occurrence of a given (string, type) pair. This is why Table E3.1 ends at re/43, while Table E3.0 ends at ref 55.
  • table E3.2 shows how the sequence is progressively constructed for the XML source file.
  • each token is associated to a unique sequence identifier or seqld.
  • a sequence is constructed, in which each token is represented by its Lexld, and associated with a corresponding seqld.
  • Lexld may be repeated in the sequence, as it appears e.g. for seqlds 8 and
  • the sequence as stored is simply an ordered list of Lexlds of fixed length, again more simply a concatenation of the lexids.
  • the seqld of each Lexld in the sequence is virtual : it is just the offset of the Lexld, from the beginning of the ordered list of Lexlds.
  • the seqlds may be stored as integers, e.g. in native hexadecimal format.
  • the flow chart of figure 4 shows how the above operations may be performed upon scanning (or parsing) an XML source file, for converting it into a corresponding coded compact "native format".
  • Parsing begins at 400, and a first token, determined using current XML file parsing techniques, is considered as the current token Ti at 402. Its type X(Ti) is determined. If not meaningful (e.g. a comment), this token is ignored and the next meaningful token becomes the current token Ti (operation 480). Else, token Ti is processed to obtain a string Y(Ti), cleaned from the XML syntax symbols, using known techniques, like the ones described in connection with Table El .2.
  • the lexicon contains the whole Lexlds, i.e. both the correspondence between the cleaned strings Y(Ti) and the stringlds, and the correspondence between the token type X(Ti) and the typeld.
  • the Lexicon 33 ( Figure 2) is searched for finding a Lexld which corresponds to the pair Pi of X(Ti) and Y(Ti).
  • Lexicon 33 If found (410), the corresponding Lexld in Lexicon 33 is retained for the pair Pi, i.e. token Ti. Else (412), a new entry is added in Lexicon 33, with a new unique Lexld. In fact, the tokenizer 32 interacts with Lexicon 33 to build:
  • function h() is a function delivering a unique new stringld, e.g. from inspecting the Lexicon.
  • typeld corresponding to g[X(Ti)], with function g() being implemented by code derived from table El .2. (In fact, the typeld may be determined at any time between operation 402 and operation 414).
  • operation block 420 may be performed to make one or more other "in-line processing". Examples of this will be described hereinafter.
  • Figure 4A shows an alternative embodiment, in which the lexicon contains the stringlds only.
  • the string Y(Ti) is searched in the lexicon 33 ( Figure 2). If the string Y(Ti) is not found in the lexicon 33, then (412, Figure 4A) the tokenizer 32 interacts with Lexicon 33 to build a new stringld corresponding to h[Y(Ti)], as described above.
  • the typeld may be determined at any time between operation 402 and operation 413.
  • Metalanguages like XML use sections, called “elements” or “nodes” or also "XML objects” in the case of XML. They may also have strict rules of nesting sections within each other, i.e. such sections must be either strictly embedded within each other, or juxtaposed. XML goes even farther, in imposing that the "begin” and “end” markups of the elements are also themselves strictly embedded, while HTML is more tolerant.
  • a tree structure may be associated with a given metalanguage file.
  • the nodes or elements may be defined using a path notation, like "/bib/book” .
  • the path separator here is "/”, and, when cited in this specification, a path is provided with string delimiters (”) for clarity.
  • node tag shows where the beginning element ("+") of a node is seen, and where the end of a node is seen ("-").
  • the flow-chart of figure 5 illustrates this mechanism.
  • NL_0 which may be the true node level of the element (absolute level) ; alternatively, NL_0 may be 0, in which case relative node levels will be obtained.
  • NL_X is set to NL_0.
  • Block 520 is the place where to perform a node level related operation, if desired, for example creating a software object for a new node if a typeld " 1 " has just been met. (If the operation depends upon the previous node level, it would be placed between 504 and 506, for example filling-in a software object with an attribute name, and attribute value, or
  • operation 530 detects if a given end condition is met (e.g. having descended the tree by two levels); if yes, this is the end 540; else, control is returned to operation 504. Note the end condition may depend upon the operation made at 520, if any. It will be understood that this makes it possible to use any portion of the sequence as a basis for virtually any software processing.
  • the software processing will operate the same as with the XML source file, however using the more compact and easily searchable representation being proposed.
  • FIG. 5 may be directly embedded within those of any process scanning the sequence, whether totally or partially. Note also that the sequence might be scanned backwards. If so, the starting point 502 should preferably be the end tag of a node.
  • operation 502 may look for the next correct starting point, in either direction.
  • CurPath is concatenated with "/", plus the cleaned string of the current item (done at 508 or 520);
  • CurPath is reduced by removing the last concatenated item (done at 512 or 520).
  • the above path construction mechanism may give absolute paths, e.g. when the whole file or the whole sequence is scanned, or the path of the starting node is known. It may be used to get relative paths as well, e.g. when starting from an intermediate node whose path is already known for other reasons.
  • Table E3.2 shows the paths, using the tag names as they appear in the XML document. This is for facilitating the understanding. In fact, in the compact storage as described, the representation is different, as it uses the Lexlds. This should be kept in mind when reading this specification. For example : "/bib/book” stands for '1 0000007 /l 0000008' (the white spaces and slash are for clarity only, and are not stored). Note also that the leading '1 ' of the typeld is not necessary to define the path unambiguously, to the extent the string refers to an element node.
  • tables E3.0, E3.1 , E3.2 and E3.3 may be conducted in parallel, when initially scanning an XML source file, or when scanning the sequence.
  • operation 624 scans down the sequence until a Lexld having typeld 4 is found (here 40000008) at the same node level (the mechanism of figure 5 may be used to track the node level during scanning). This stops here at seqld 30.
  • the scope of that occurrence of "/bib/book" is from 8 to 30 in terms of seqlds.
  • ScanlScopeQ This function, called ScanlScopeQ in box 620, may be used for several purposes, e.g. when processing queries.
  • scope determined in accordance with figure 6 may be stored as a scope length, associated to the node.
  • the simplest is to associate the scope length with the heading seqld (or seqOffset) of the node element.
  • scope storage is in the form of pairs (seqld, scopeLen), or (seqOffset, scopeLen). If desired, such node lengths or scopes might be stored during the XML file scanning of Figure 4, any time after operation 420, where both the Lexld and seqld of a node are known. Deciding scope storage when parsing the XML file is a matter of compromise between memory occupation and utility. Thus, in most cases, scope storage will not be performed initially for all nodes.
  • scope storage is performed later, when processing queries. Whenever a node length is determined during the processing of a query, one may keep (“cache") a representation of this node in the form of the seqld of its head element, and of its length, i.e. the pair (seqld, scopeLen) or (seqOffset, scopeLen).
  • this flow chart defines a function which may be called Scope (Lexld).
  • Scope Scope
  • stringld the node element is simply defined by its stringld.
  • the Lexld is built by combining the typeld ("1") with the stringld, according to function f(). Then the function Scope(LexId) is called. This works basically with a stringld referring to an element node; however, in other cases, one may scan the sequence, upwards or downwards, to reach the next element node.
  • Operation 730 of figure 7 searches for the Lexld in the sequence (subject to additional conditions, if desired). If a seqld is found (732), the situation of operation 620 in figure 6 is met again, and the ScanlScopeQ process of figure 6 may be executed, as described above. By returning control to operation 730, this may be repeated to find further node elements having the Lexld, if desired. If nothing is found, or if it is not desired to continue the search, this is the end 740.
  • a Scope (xmlString) function may be used.
  • a query refers to "author”. It will arrive at operation 700 in figure 7.
  • operation 710 will look in the lexicon for "author” and find stringld "1 0000010” (adding the typeld 1).
  • the rest of the process is executed from operation 720, as described above (however with "author” instead of "book” in the illustration).
  • the correspondence between the pairs of data X (token type), Y (cleaned token string) and their respective typeld, stringld may be stored in the lexicon.
  • the lexicon uses a memory space, and preferably has indexing mechanisms enabling a fast implementation of at least some of the following functions, which comprise :
  • the Lexicon looks like a table. Although a table may be used where small sizes of XML files are considered, accelerating mechanisms are desirable to improve performance in terms of response time.
  • An indexing mechanism producing index tables may be used, e.g. with at least one index for the strings and the types (for the XML-to-Id group of functions), and at least one index for the Lexlds (for the Id-to-XML group of functions).
  • a new stringld entered in the Lexicon is numbered in sequence with the existing stringlds. This is mainly for explanation purposes. More sophisticated data representations may be used to facilitate fast search of strings in the Lexicon.
  • Tries are a general-purpose data structure of the dictionary type, that is supporting the three main operations
  • the bst-trie can be represented as a ternary tree where the search on letters is conducted like in a standard binary search tree, while the tree descent is performed by following an escape pointer upon equality of letters.
  • the performances of tries depend upon the probabilistic properties of the strings processed. More precisely, we shall work with two types of models: The models for the infinite strings inserted in the tries.
  • the "list-trie” may be used when storing data in memory (for example when processing requests, as it will be described hereinafter).
  • the second mentioned article describes a C++ implementation, which may be used in accordance with this invention.
  • the ternary trie may be applied here to build a bi-directional fast lexicon.
  • Bi-directional means that the lexicon is searchable both : - using an input string, to determine whether it is in the lexicon, and, if so, to recover its stringld or Lexld, and
  • a ternary trie may be viewed as a ternary tree, arranged in a specific manner.
  • a ternary trie is based on an alphabet (set of characters, including symbols), and a predefined way of sorting that alphabet. This might be e.g. the ASCII characters, sorted according to their ASCII binary code.
  • the principle of a ternary trie is as follows:
  • each of the cell has the possibility of pointing to a respective node at the immediately lower level ;
  • the lefthand cell may point to a lower level node, or not (in which case it is shown shaded); - similarly the right-hand cell may point to a lower level node, or not (in which case it is shown shaded).
  • Each leaf node may be (virtually) attached with an id corresponding uniquely to the string it represents, as shown in figure 10 using formal expressions like id ⁇ ook).
  • z ⁇ f is a shortened notation for a lexicon identifier (whether it is a stringld, a typeld, or a concatenation of both, i.e. a Lexld).
  • the id of a leaf node may be defined as a digital description of the path through the ternary tree to reach that leaf node. This has the advantage that, starting from an id, one simply has to follow the path defined by that id through the tree, while concatenating the characters met a the central cell where appropriate, until the leaf node, inclusive, whose character(s) is(are) also concatenated..
  • Figure 10A differs from Figure 10 by the fact, that the above described process continues until there is at most one character in the leaf node. Also, path identifying digits (to be described) are shown on the links between the nodes. For example, “book” is now obtained by: “b” + “" + “o”+”o”+ “k” instead of "b” + "” + “ook”
  • leaf nodes in Figure 10A are, further preferably, provided with a three cell structure. This has the advantage that, when a previously converted file is converted again after having been updated, the original ids may be left unchanged.
  • the structure of the tree depends upon the selection of the character being put in each central cell. This pertains to the art of "balancing trees", which is well known to those skilled in the art.
  • Index tables may be added, if desired.
  • a total of 8 bits is sufficient (as a "three-cell node data") to represent a three cell node, or a leaf node (which is the same or simpler).
  • the tree storage should also define the link between each of the three cells, and the corresponding next cells in the lower level. This may be facilitated to some extent by using predefined rules for the arrangement of the tree and nodes representation in memory and/or in a file. For the rest , additional information may be stored in each of the three cells, to form a pointer to the memory location of the next node, as required. 2,7
  • the id simply has to describe the path within the ternary tree.
  • Three possibilities exist at each level ; however, only the central one is a progress, processing the first character.
  • the three possibilities may be coded on 2 bits, e.g. 1 (binary 01) for left, 2 (binary 10) for central, 3 (binary 1 l)for right, with binary "00" being reserved.
  • ids with 28 bytes may be sufficient for the large majority of the situations encountered (only very long strings may raise problems). Since the depth in the tree may vary, the ids may be completed with void characters, e.g. "00", until the predefined length of 28 bytes is reached. Where 28 bytes would not been sufficient, an "escape mechanism" may be used in known fashion to expand the representation of an id on two words of 28 bits, or more.
  • the ternary tries just stores the node cells (not the id(xx) in italics). Then:
  • nodes when building the ternary tree, nodes may be added as necessary to match a new string being entered; - when executing a query, a result found/not found is obtained;
  • the bytes in the id are used one by one to follow a path in the tree, the central characters being met are concatenated one by one until the leaf node is met, and its central character, if any, is also concatenated.
  • the result is the string corresponding to the id.
  • the filling void characters used to give the id a fixed length are ignored.
  • book As a search string, "book” may be submitted to the tree of Figure 10 A. It follows the path : 2,3,2,2. Thus, the id for book may be the integer string 2322. In fact, the id is preferably 23222, with the last 2 reflecting the fact the final "k" is in the central cell of the leaf node. By so doing, the id will not change if e.g. the right cell of that leaf node is later "opened” to include e.g. the string "boolean” in the lexicon.
  • Posting relies on the preselection of certain node paths, as being of particular importance or significance, e.g. because they are more likely than other node paths to be involved in a majority of queries.
  • the selection may be made by the user (e.g. the author of an XML file, or a reader), or according to predefined conditions relating to the structure of the XML file, or both.
  • the storage of the posting information may conveniently be done by attaching a list of seqlds (or seqOffsets) to each Lexld being met.
  • a list of seqlds or seqOffsets
  • the posting of a given path e.g. for "/bib/book”
  • each list Li is attached to a respective Lexld in the lexicon.
  • a list Li may be formed of concatenated character strings, separated with a suitable delimiter, with the string between delimiters representing a seqld (more generally a pointer to a seqld). Since the seqlds or seqOffsets have a fixed length, a plain concatenated list with no delimiters may operate.
  • operation 812 For each position in the sequence (i.e. each seqld) being seen during that scanning, operation 812 will take the Eex/ being there. Then, operation 812 determines if a list already exists for the current LexId_X(at seqId_X). If not, a new list is created in the set of lists for the Lexld ⁇ ., at 813. Then, in either case (814), the list for the Lexld is appended with the seqldJP (here 8) of the head element, as it was stored at operation 808. Operation 816 skips to the next LexId_X in the sequence.
  • This inner loop stops when operation 810 sees a typeld "4" at node level NLjO. Node level tracking is applied in this flow chart, in accordance with the process of Figure 5. This has not been shown to avoid an unduly complicated drawing. In fact, steps 506 - 512 of figure 5 are inserted between steps 816 and 818 of figure 8.
  • operation 820 marks the end of the scope of the current node occurrence, and control is given back to operation 804, searching for another occurrence of LexId_P at node level NL_0 in the rest of the sequence (outer loop). Node level tracking is applied during this search.
  • the "posting" associates several seqld to a given Lexld.
  • this may be conveniently stored as a list Li, in which the seqlds are concatenated (with suitable separators, as necessary).
  • the machine representation of the path may be the concatenation of the Lexlds without separator, since the Lexlds have a fixed length. This results in : ⁇ LexId(bib)LexId(boo , set SI ⁇
  • the plurality of set of lists may be stored as a posting file, having adequate indexes for obtaining a fast response. It is currently preferred to arrange the posting file as a balanced b-tree.
  • indexes maybe created with reference to attribute names in the posting list, and also with implicit reference to the corresponding attribute values. An example of this will now be briefly described.
  • a "sorted index" associated to the path ' /bib/book/@yea is constructed as follows:
  • a 'sorted index' may comprise a list of offsets (or Seqld) associated with a path and an attribute name.
  • the list of offsets reflects the presence of values for the attribute being considered, and is sorted in accordance with the attribute values (which are accessible from the offsets). This is not limited to XML attributes and may also be used for other XML types, e.g. CDATA.
  • step 420 of Figure 4 comprises : - storing the current seqld as a seqldJP, when a posted node is met at its posted level NL_0,
  • seqld may be an integer whose size (number of bytes) is related to the size of the count of the tokens in the XML file.
  • scopeLen may be another integer whose size is related to the maximum length of a node scope in the whole XML file. This is again related to the size of the count of the tokens in the XML file, if we assume that there is a "root" node covering the whole file. However, the scope of this root node is known and need not be stored. Only the scope of lower level node need be stored. Thus, the size of scopeLen is in fact significantly lower than the size of the file.
  • Such information may be easily stored as a cache, using adequate data structures.
  • sequence store 34 and the 33 provide for a native and compact storage of the XML data contained in the source file: there is a biunivocal and direct correspondence between it and the original XML document.
  • the XML source document may be regenerated, totally or in part, from the native storage as described. "Totally” means here as far as the useful content is concerned.
  • the comments possibly contained in the XML source document are not transferred in the native storage, and thus cannot be regenerated, in the embodiment being described.
  • adding a "comment" type in the table of types E2.2 would permit to also include comments in the native storage, and accordingly, to regenerate them.
  • a suitable XML heading is entered in the file, using the beginning of the sequence file (refs 1-6 in table E3.2), if desired (no heading is required in many cases, e.g. when answering a query) ;
  • the source string corresponding to the stringld is found in the Lexicon 33 ; this source string may be noted Com '(stringld), or, in short, ConvQ.
  • a token heading may be defined as shown in table E4.1 , to be inserted before the ConvQ. For example, if typeld isl, the token heading is " ⁇ ", ant the concatenated string " ⁇ " + ConvQ is appended to the regen.xml file.
  • the token may have to be modified depending upon the type of the next token in the sequence. This is shown in Table E4.2. For example, if a node token (typeld 1) is not immediately followed by an attribute name (typeld 2), then, a ">" is immediately added. This will work e.g. for " ⁇ author>" in the source file.
  • tables E4.1 and E4.2 might be used as look-up tables, for the current and next seqlds (and corresponding Lexlds), when scanning the sequence.
  • rules are predetermined, they are preferably hard coded, using e.g. IF-ELSE-ENDIF statements, and/or "CASE" structures, for faster operation.
  • the XML source file may be entirely regenerated, by entirely scanning the sequence store 34, and progressively appending strings in the regen.xml file (or a precursor of it in RAM).
  • the corresponding ConvQ is appended, preceded by a head, depending upon the conditions shown in table E4.1., and followed by a tail depending upon the conditions shown in table E4.1. and/or upon the conditions shown in table E4.2.
  • TablesE4.1 and E4.2 may be viewed as a state diagram, which may be implemented in an XML State machine : the XML state machine may walk through the XML object sequence in accordance with the authorized transitions and actions of Tables E4.1 and E4.2.
  • the segments of interest are generally delimited by the node mark-ups, e. g. from “ ⁇ book” to " ⁇ /book>". As noted, this is called an XML object or element.
  • the mechamsm as proposed will mostly be used to compare XML objects of any size, i.e. having any desired level of embedded objects.
  • XML data Two popular standards of queries for XML files are XPATH and XQUERY. These are commonly processed by a "query parser" whose role is to analyze the query, and to convert it into standard query expressions.
  • XML data are stored in a relational database
  • XPATH or XQUERY form would be converted into an SQL expression, for application to the relational database.
  • the Query Processing comprises, for an input XML query 90 :
  • the preprocessor will search LID1 in the sequence. This will output the seqlds (positions) 8 and 31, and node level 1.
  • posting lists and associated sorting indexes may be built to match the logic of the various possible constructs of requests existing e.g. in XPATH or other query languages.
  • LexId(bo ⁇ k), Lexld(tit ⁇ e), etc which the lexicon converts back to original XML language book, title, etc
  • the Lexicon will be called to convert the Lexlds having refs 16-24 in Table E3.1, back into their original strings.
  • Exhibit E5 taken in conjunction with figures 11, 12, 13, 14 and 15, contains a description of the design and implementation details of an exemplary embodiment of this invention.
  • the QUERY parser basically creates an internal representation of the query, using the Lexicon 33.
  • the parser may supports the XPATH syntax, XQUERY, and other query formats.
  • An exemplary XQUERY may be as follows:
  • This query is similar to the above Xpath query, and may be processed the same way.
  • the XQUERY request language may be viewed as similar to XPATH, plus the possibility of processing additional statements like those in the so-called "FLWOR” group, i.e. : FOR , LET, WHERE, ORDER BY, RETURN
  • sequence, the lexicon, and the posting lists with their sorting indexes may be processed using the sequence, the lexicon, and the posting lists with their sorting indexes.
  • the processing may use dedicated functions, which may be built-in the system, user-defined, or even plugged-in, as desired. It must be emphasized that the sequence, the lexicon, and the posting lists with their sorting indexes (or a portion of them) are a powerful tool enabling to adapt the system to various extensions of existing query languages, and/or to new query languages, as appropriate.
  • a query normally has several statements or components involving various nodes. Together, they define conditions (a filter) in the set of information contained in the XML data. The query may also define the nature and format of the results, unless a default format for the results is accepted.
  • posting lists All what may be obtained from posting lists will be firstly considered.
  • the query is then ordered, i.e. rearranged so that the most restrictive condition is considered first. As illustrated above, this can be made by considering the most constrained one of the tokens, in the posting lists (or using other tables of index, if any).
  • the Query Resolution may use conventional techniques of result combinations in query resolution engines.
  • a query may be decomposed into a plurality of alternative sub-queries (combined together by a logical "OR"), while the query conditions in each sub-query are conjunctive
  • the sub-queries may be ordered, with a view to begin with the most restrictive (“constrained”) ones of the conditions. Then, the conditions may be resolved one by one, so that the set of compact data to be finally examined in detail becomes smaller and smaller. This is done within each conjunctive sub-query. Then, the results are gathered together ("ORed”), and possible duplicates are eliminated.
  • the above mechanisms may apply to all metalanguages having tags or markups that convey some particular meaning or syntax information. They have particular interest with metalanguages whose rules result in a metalanguage file having one or more associated tree structures. Examples are XML, SGML and their derived languages. However, this description is not limited to these languages. For example, the query results may be delivered in a different language, e.g. HTML, as well.
  • Meta-languages like XML may be viewed in different ways. Initially, they have been presented as texts or documents, which are supplemented with add-ons (tagged mark-ups), for being easily dealt with by a computer in a standard fashion. However, since it is readily understandable by a computer, a metalanguage may be presented as a programming language as well, at least partially. This does not preclude the application of the teachings described in this specification.
  • an inverted index is a data structure (and file structure) that is used to support efficient querying on a large data set.
  • the data set itself is assumed to contain a large number of documents each containing some number of tokens.
  • the query operation returns the set of documents that contain a given token.
  • tokens are identified by a unique LexID and documents by a unique value called the PostingsID.
  • the PostingsID is a 32 bit quantity and can be generically interpreted as a location.
  • the PostingsID can represent a document ID, an element ID (in the XML case) or a file offset but the meaning of the PostingsID is opaque to the inverted index implementation, it only assumes that a given PostingsID "contains" some number of LexIDs.
  • the inverted index was designed to index XML data and it supports that by using 4 bits of the LexID as token type.
  • token types could be element start, element end, CDATA, etc.
  • the remaining 28 bits allow indexing 268,435,455 unique tokens.
  • non XML data or non typed data the type can be set to 0.
  • Fast index build time Hard to quantify but something linear based on data set size.
  • Fast access time Querying the postings data for a given token requiring one or no disk accesses.
  • the LexID is a 32 bit quantity that is represented as a 4 bit token type field and a 28 bit index field.
  • the LexID is assigned to a token by the Lexicon when the token is inserted into the Lexicon and the index value itself is unique across all LexIDs.
  • the type field is normally used to represent XML types and really isn't used by the Inverted Index. It uses only the index field to access LexIDs in the Inverted Index.
  • the Lexicon assigns the index values of a LexID sequentially, starting with 0, as tokens are inserted into the Lexicon and except in cases where a token has been removed from the Lexicon/Inverted Index it assumes that they are contiguous numbers with the highest index value belonging to the last token inserted.
  • the Inverted Index Array is logically a memory resident linear array that contains references to postings data for each LexID either on disk, in memory, or both.
  • the array is represented by an ExpandableArray object which grows dynamically without reallocating or copying memory.
  • B-Tree index One of the rationales for using a B-Tree is that it is relatively easy to swap the external nodes (the ones that actually contain data) to and from disk and that only the internal nodes have to be memory resident. Performing queries on an existing B-Tree index then requires at most one disk access. However that query only returns the disk location of the postings data and it may require another disk access to retrieve the actual postings data. The difficulty is in building the B-Tree index. Experience shows that building a large B- Tree index in a relatively small memory footprint requires a lot (need some quantity here) of disk swapping.
  • Each element of the Index Array is 16 bytes long and is represented by the InvertedlndexNode object described below. So, an Inverted Index containing one million unique tokens requires 16 MB of memory (or disk space when saved to disk.) The index is always memory resident in entirety while running and is saved to disk if modified in response to a save request or when the Invertedlndex object is closed.
  • An ExpandableArray object is used to represent the Invertedlndex Array and it logically appears to be a linear array. It starts with no memory allocated and as items are inserted it "grows" in 64KB increments. It doesn't reallocate and copy memory as it grows but instead allocates a new buffer which is added to a list.
  • the API maps an index value to the appropriate buffer location.
  • InvertedlndexNode structure is used both when the Inverted Index is memory resident or when it has been saved to disk.
  • the fields can have somewhat different meanings when accessing an existing index as opposed to building a new one. It is also slightly different if the postings data is stored in the postings file or if the postings are resident in the index itself.
  • the postingsCnt is a 31 bit quantity specifying the number of PostingsIDs that are stored on disk for this node.
  • the high order bit, the location bit, is cleared to indicate that the postings data is stored on the disk rather than in the index itself.
  • RecCnt is the number of records allocated for the postings data and RecAddr is the record address of the postings data on disk.
  • PriorityQueueNode is a pointer to a PriorityQueueNode struct and is the location of the postings if they are also resident in memory. When this InvertedlndexNode is written to disk PriorityQueueNode is set to NULL.
  • LexIDs tokens that occur in a large number of locations.
  • high frequency LexIDs are just "noise" and do not provide any useful information when searching data.
  • the six highest frequency tokens can be expected to account for 20% of all token occurrences. In English text these are usually the words: the, of, to, and, in, is.
  • Postings data is written to the Postings file using large, sequential, buffered writes when creating the Postings file and accessed using random access reads when querying postings data.
  • Space is allocated in the postings file in records of 16 bytes (4 PostingsIDs) the primary reason for this is to allow file sizes larger than the 2GB normally supported by Windows. The goal was to able to support 64 bit disk addresses and thus larger file sizes without having the requirement to store 64 bit addresses in memory resident structures. So, when using a 16 byte record as the allocation unit the 32 bit RecAddr value stored in the InvertedlndexNode structure actually allows accessing a file up to 64GB in size.
  • PostinglDs within the Postings file and in the Inverted Index are stored contiguously and are sorted in ascending order for each LexID.
  • the Sample Inverted Index and Postings shown in Figure 11 shows postings stored both in the Inverted Index and in Postings file.
  • Index 1 has three postings stored in the index itself. Note that the high bit of PostingsCnt is cleared.
  • Index 3 is a case of a high frequency LexID, it has one postings stored in the index and that posting has a value of Oxfffffff, the NullPostingsID.
  • Priority Heap When postings data is loaded from disk to perform queries or updates, the memory used is allocated from a Priority Heap object.
  • the goal of the Priority Heap is to maintain only the most recently used postings data in a finite, relatively small amount of memory.
  • it's necessary to allocate memory from a "full" heap enough of the least recently used memory allocations are freed so that there is enough space available to make the new allocation.
  • freeing memory from the heap postings data that has been modified is written back to disk.
  • the current implementation allocates memory from the system heap and tracks the amount of memory allocated.
  • a limit or maximum memory value is passed as a parameter to the Postings object when it is created and once the limit is reached memory is deallocated as needed.
  • Previous implementations used a "private heap" which in Win32 is created using a HeapCreate() call. In Win32 there is a limit to the largest memory buffer that can be allocated from a "private heap” and that limit is slightly less than 512KB. That would mean a maximum of approx. 128,000 postings entries for a given LexID and while it isn't likely to hit that limit it is conceivable in some situations so the implementation was changed to use the system heap.
  • the Priority Heap contains a Priority Queue which is a linked list of Priority Queue Nodes.
  • the Priority Queue Node is used as a header to the allocated memory buffer and contains pointers to the other nodes in the Priority Queue. Each time a node is accessed, either for queries or updates, it is moved to the front of Priority Queue so that at any given time the most recently used nodes are at the front of the list.
  • the Priority Queue itself consists of pointers to the first and last nodes in the queue and an API for moving nodes within the queue and removing nodes from the queue.
  • Each PriorityQueueNode contains previous and next pointers. It also contains the index number of the InvertedlndexNode that it belongs to and a size field which specifies the number of bytes allocated for postings data.
  • the dirtyFlag if set, indicates that the postings data has been modified, either by updates or deletions, since it was loaded into memory. If it is set then the postings data needs to be written back to disk and the InvertedlndexNode updated when the node is deallocated. Otherwise the node can simply be be deallocated without any other action.
  • FIG. 12 shows elements 0, 3, and 5 of the inverted index with allocated Priority Queue Nodes.
  • the next node to be deallocated (if necessary) is the one associated with element 5, the last one in the queue. If the dirty flag is set for that node then a Store request is made to the InvertedlndexNode for element 5 and it will write the updated postings to disk before the memory for the Priority Queue Node is freed.
  • the Postings API supports three primary types of operations: index creation, index updates, and index queries.
  • Index Creation refers to the process of building a new index for a data set and the functions to create it are invoked by the parser while the data set is being processed.
  • a sequence file which represented the entire data set as tokenized XML data.
  • the format of this sequence file was proprietary and using it forced some inter dependencies between the Inverted Index and the parsing process. To remove dependence on the file format and to achieve independence from the parsing process the API has been modified. However, it might be desirable, for performance reasons, to provide alternate APIs that accept some type of sequence file as input.
  • the dwMaxMemory parameter specifies the amount of memory buffer space to use while processing postings data. This is space used in addition to the Inverted Index Array and is deallocated once the postings file is complete.
  • the saveThreshold parameter is used to remove high frequency postings from the index. If a LexID occurs in over a specified percentage of PostingsIDs (the default is 80%) then the postings for that LexID are not included in the postings file.
  • the product of the BuildlndexAddPosting ( ) phase are PostingsID frequency counts for each LexID and a temporary, intermediate sequence file.
  • the frequency counts are stored in a LexID 's InvertedlndexNode struct and reflect the number of PostingsIDs that are associated with that LexID or in other words the number of PostingsIDs that contain a given LexID.
  • the frequency counts are used to determine memory allocation requirements when later processing the postings data.
  • the intermediate sequence file contains a series of PostingsIDs each followed by the set of LexIDs "contained" by that PostingsID. Duplicate LexIDs for a given PostingsID are not saved so the file only contains the set of unique LexIDs associated with a PostingsID.
  • Figure 13 shows a sample BuildlndexAddPosting () phase, the generated frequency counts, and the generated intermediate sequence file. What is important to note is how duplicate LexIDs for a given PostingsID are "collapsed" both in the frequency counts and in the intermediate sequence file. This is potentially a significant amount of data. A test case containing 115 million tokens (508210 of them unique) having a Zipf like distribution and spread across 115,000 PostingsIDs had 54 million duplicate LexIDs which would have been removed in this step. E5.4.1.2.2 BuildlndexEnd Phase
  • the first step is to create a number of segmented sequence files where each segmented file contains all of the postings data for a range of LexIDs that can be processed completely using the memory buffer which was specified in the dwMaxMemory parameter.
  • the process is to iterate through the InvertedlndexArray adding frequency values to determine sequential series of LexIDs that can be processed with the buffer.
  • the result is a temporary data structure containing a list of segments each designated by a starting and ending LexID.
  • the first segment isn't actually written to a segmented file instead the data for the first segment is processed and written to the postings file while the other segmented files are being written.
  • the contents of the segmented files will be similar to the contents of the Intermediate Sequence File shown in Figure 13 but each one will contain only the PostingsIDs and LexIDs that belong to the segment.
  • the intermediate sequence file is then deleted.
  • Each segmented file is then read, singly, and the postings contained in the file are written to the postings file.
  • First space is allocated from the postings buffer for the segment then the PostingsIDs are moved into a LexIDs buffer.
  • the postings data is written to the postings file once the complete segment has been processed and the segmented files is deleted.
  • the first step is to write the intermediate sequence file (n blocks), then read the intermediate sequence file (n blocks) while writing n-1 segmented files (each one block in size), and then to read n-1 segmented files.
  • the first segmented file for the first segment is never actually created the data for it is processed while reading the intermediate sequence file. So the total amount of disk I/O is (4n)-2 blocks.
  • Index Updates are operations that allow adding postings to or deleting postings from an existing inverted index.
  • the AddPo sting ( ) call can also be used to create a new index but is considerably slower than using the Index Creation routines: BuildlndexStart ( ) , BuildlndexAddPosting ( ) , and BuildlndexEnd ( ) .
  • Index Creation routines BuildlndexStart ( ) , BuildlndexAddPosting ( ) , and BuildlndexEnd ( ) .
  • the AddPosting ( ) call first checks the InvertedlndexNode for the given LexID. If it is a low frequency node and there is space for it in the InvertedlndexNode then it simply inserts it (in order) into the InvertedlndexNode.
  • a PriorityQueueNode structure is allocated and the PriorityQueueNode placed at the head of the PriorityQueue.
  • the space allocated is large enough to accommodate the new postingsID plus the existing postings.
  • the new PostingsID is inserted into the PriorityQueueNode. Since the PostingsIDs are stored contiguously in order this may require a memory move to place it in the correct location.
  • the existing postings data is currently memory resident then its PriorityQueueNode is moved to the head of the PriorityQueue and the PostingsID is inserted as above. If the PriorityQueueNode is full then the existing one is reallocated which requires a memory allocation and a copy. Currently the size is increased in increments of 4 PostingsIDs. Previously, we have allocated using a power of two (doubling each time) but at that time AddPostingO was being used to build the index. Since we are adding to an existing index it probably isn't appropriate to increase the size in such large chunks.
  • GetPostingsCount O simply returns the number of postingsIDs currently associated with a given LexID. This value is returned from the Inverted Index Array so it doesn't require a disk access.
  • the function GetPos tings ( ) returns a pointer to a list of the PostingsIDs associated with a given LexID.
  • the PostingsIDs are returned in ascending order and the variable referenced by the pCount pointer will contain the number of postings in the list.
  • FreePo s tings () should be called to free the memory allocated by the call to GetPostings () .
  • Heap's Law can be used to predict vocabulary size of a text collection and it defines vocabulary growth (the number of unique tokens) as a function of the overall text size (the total tokens) and is defined as:
  • V Kn ⁇
  • V is size of the vocabulary or number of unique tokens
  • n is the size of the text or data set in tokens
  • K and ⁇ are empirically derived for a given data set. Typical values for K are 10 - 100 and ⁇ are 0.4 - 0.6.
  • Table E5.2 shows calculated V values for some sub collections of the Text Retrieval Conference (TREC) database. These are sets of English language text collections that have had the values of K and ⁇ calculated.
  • the row with n 1,000,000,000 is equivalent to a dataset containing 10GB of XML data and shows an expected range of unique tokens from 232,016 to 1,178,418.
  • the current implementation can handle a 64 GB file size so the file size itself shouldn't be a problem. It is also configurable (at compile time) to support larger files if necessary. If the dwMaxMemory buffer (see BuildlndexEnd Phase above) is set to 64MB and given a 25 GB intermediate sequence file size then we would need 400 segmented sequence files open simultaneously which is within operating systems (Linux and Win32) limits. There is also some memory overhead because if we assume a 64KB buffer per file then there is an additional 25 MB of file buffer space required.

Abstract

A metalanguage processor, comprising a parser capable of decomposing a metalanguage file into source file segments, in accordance with the syntax of the metalanguage, a lexicon, comprising a representation of a set of strings, said representation being searchable, using a search string, to get a unique identifier for that search string, and a converter, capable of deriving a search string from a source file segment, and of interacting with the lexicon for determining an identifier for that search string, said converter building a representation of the metalanguage file as a stored sequence of data blocks, each data block corresponding to a respective one of the source file segments, and each data block being based on the unique identifier determined in the lexicon for the search string derived from the corresponding source file segment.

Description

Improved processor for a markup based metalanguage, such as XML
Computer technology now uses metalanguages, examples of which are XML (extensible
Mark-up Language), SGML, and their precursors or extensions.
The reader will be assumed to be familiar with the extended Markup Language (XML) specification, available from W3C at http://www.w3.org/TR. and/or from the booklet "XML Pocket Reference", Robert ECKSTEIN, O'REILLY, U.S.A, Second edition, April
2001.
Metalanguages like XML are very powerful, and have a constantly increasing development. Thus, it is now contemplated to create XML-based databases.
However, the counterpart of the power of such metalanguages is an inherent slowness in processing. Despite various attempts and proposals concerning databases, no adequate solution has been found but :
- either constructing the database from portions of an XML document, and coping with the corresponding slowness in the database functions, which have to "understand" the XML document ;
- or constructing the database from XML language converted into the conventional data representations of a relational database. This keeps the usual data base processing velocity for whatever is done in the database itself. However, the slowness is again present whenever the database representation has to be converted in XML, and/or conversely.
The present invention provides advances towards better solutions.
In accordance with a first aspect of this invention, there is proposed a metalanguage processor, comprising :
- a parser capable of decomposing a metalanguage file into source file segments, in accordance with the syntax of the metalanguage, - a lexicon, comprising a representation of a set of strings, said representation being searchable, using a search string, to get a unique identifier for that search string, and
- a converter (or "tokenizer"), capable of deriving a search string from a source file segment, and of interacting with the lexicon for determining an identifier for that search string,
- said converter building a representation of the metalanguage file as a stored sequence of data blocks, each data block corresponding to a respective one of the source file segments, and each data block being based on the unique identifier determined in the lexicon for the search string derived from the corresponding source file segment.
It will be appreciated that metalanguage parsers already exist, e.g. for XML. Thus, the invention also encompasses the converter or tokenizer for use with such a parser, while constructing the lexicon and sequence in memory and/or in a disk or other mass storage device.
In accordance with a second aspect of this invention, there is proposed a method of processing a metalanguage file, the method comprising : a. parsing the metalanguage file, for identifying therein successive source file segments in accordance with the syntax of the metalanguage, b. maintaining a lexicon, forming a directly searchable representation of strings, in correspondence with a unique identifier for each string, c. converting a search string derived from a source file segment into a corresponding identifier, using the lexicon, and d. progressively building a sequence of data blocks, each data block corresponding to a respective one of the source file segments, and each data block being based on the unique identifier determined in the lexicon for a search string derived from the corresponding source file segment.
In accordance with another aspect of this invention, there is proposed a database for storing data representing a metalanguage file, said database comprising:
- a lexicon, comprising a representation of a set of strings, said representation being directly searchable, using an identifier, to obtain a unique corresponding string, the set of strings in the lexicon covering substantially all the meaningful string data in the metalanguage file, - a sequence of data blocks individually based on identifiers being searcheable in the lexicon, the order of the data blocks in said sequence being related to the order of corresponding strings in the metalanguage file.
Still other features and combinations thereof may be found hereinafter, as well as in the appended claims.
Other alternative features and advantages of the invention will appear in the detailed description below and in the appended drawings, in which :
- Figure 1 is a block diagram of a computer station in which this invention may be performed ;
- Figure 2 is a block diagram of an embodiment of an XML source document processor; - Figure 3 is a tree representation of an exemplary XML source document (shown in detail in Exhibit E2) ;
- Figures 4 and 4A are exemplary flow charts of the conversion of a text XML source document into compact storage data representing that document ;
- Figure 5 is an exemplary flow chart of the determination of node levels from the compact storage data ;
- Figure 6 is an exemplary flow chart of the determination of the "scope" of a node from the compact storage data ;
- Figure 7 is an exemplary flow chart of the determination of the scopes for a plurality of nodes, from the compact storage data ; - Figure 8 is an exemplary flow chart of an optional operation of posting which may be performed on the compact storage data ;
- Figure 9 is a block diagram of an embodiment of a query processor;
- Figures 10 and 10A show simplified data structures forming a so-called "ternary trie";
- Figure 11 is a schematic diagram illustrating an example of inverted index and posting file; - Figure 12 is a diagram illustrating priority queue and priority queue nodes;
- Figures 13, 14 and 15 are diagrams illustrating operation of functions of an API included in an embodiment of query processor. A portion of the disclosure of this patent document contains material which may be subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright and/or author' s rights whatsoever.
Additionally, the detailed description is supplemented with the following Exhibits:
- Exhibit El (in short El) contains a basic XML expression, and developments thereon ;
- Exhibit E2 contains an exemplary XML document ; - Exhibit E3 contains tables illustrating certain aspects of the scanning of the exemplary
XML document, for converting it into compact storage data ;
- Exhibit E4 contains tables illustrating conditions for use in regenerating an XML document.
- Exhibit E5 describes design and implementation details of an exemplary embodiment.
In the foregoing description, references to the Exhibits may be made directly by the Exhibit or Exhibit section identifier: for example, El .1 directly designate section 1.1 in Exhibit El (the prefix designating the Exhibit may be omitted if there is no ambiguity). The Exhibits are placed apart for the purpose of clarifying the detailed description, and of enabling easier reference. They nevertheless form an integral part of the description of the present invention.
This applies to the drawings as well.
Now, making reference to software entities imposes certain conventions in notation. For example, in the detailed description, Italics (or the quote sign ") may be used when deemed necessary for clarity. However, in code examples:
- quote symbols (' or ", depending upon the context) are used when required in accordance with the rules of writing code, i.e. for string values. They do not form part of the string.
- when used on strings, the "+" operator designates the concatenation of the strings.
- an expression framed with square brackets, e.g. |j?roperty=value]* is optional and may be repeated if followed by * . The Exhibits comprise tables. The tables may have a leftmost column headed Ref. This column is intended only to receive references for facilitating discussion in the written specification.
This invention may be implemented in a computer system, or in a network comprising computer systems. Figure 1 represents an example of the hardware of such computer systems. The hardware comprises :
- a processor or CPU 11, e.g. an Intel or AMD model;
- a program memory 12, e.g. an EPROM, a RAM, or Flash memory; - a working memory 13, e.g. a RAM of any suitable technology;
- a mass memory 14, e.g. one or more hard disks;
- a display 15, e.g. a monitor;
- a user input device 15, e.g. a keyboard and/or a mouse;
- optionally, a network interface device 21 connected to a commumcation medium 20, which is in communication with other computers. Network interface device 21 may be of the type of Ethernet, or of the type of ATM. Medium 20 may be based on wire cables, fiber optics, or radio-communications, for example.
Data may be exchanged between the components of figure 1 through a bus system 10, represented as a single bus for simplification of the drawing. Bus systems may include a processor bus, e.g. PCI, connected via appropriate bridges to, e.g. an ISA or a SCSI bus.
For performing this invention, the system of Figure 1 may be above provided with a programming environment having the capability to process objects, e.g. C++, C# or Java, and also to support a parser of the metalanguage being used, e. g. XML.
Although this invention may be applicable to various metalanguages, the foregoing discussion will focus on the application to XML. This, of course, is not limiting.
In Exhibit El , the first line of section El .1 shows a typical simple XML node expression Ni
(the concept of node will be described hereinafter). Note that the spaces in expression Ei are for convenience in representation only, and do not exist in an XML file. The structure of the expression Ei is as follows : - '<price ...>' is the head markup, also termed "element tag";
- '</price>' is the end markup, or endtag ;
- the markup tag is the string 'price'.
- within the head markup, i.e. between its "<" and ">", one may insert one or more statements having the form attribute name= " attribute Value" In expression Ni, the head mark-up may comprises the attribute name 'currency', with its attribute value being here the string "euro".
Between the head and end markups, one may insert :
- any sequence of string data ("raw data"), associated to the markup tag ("price"). Here, the string data is the value 22.5. It is related to the markup tag 'price', and/or
- another node expression having the same format as Ni.
Generally, XML documents are based on character strings. Their content is delimited by mark-ups, recognizable from a <string ... > ... </string> construction, where:
- string represents the markup tag (name),
- <string ... > is the beginning of a marked up section, associated to the tag string, - as noted, <string for <string>, if there is no attribute), is the "element",
- <l string ... > is the end of that marked up section.
The contents or XML documents are thus delimited by :
- reserved XML symbols, like "<", "</", ">", - software language symbols, like "=" (equals), or the string delimiters (" or '), and
- the usual white spaces and punctuation symbols used in regular text documents.
Based on these delimiters, the XML document may be cut-off into "words". Such a word is called a "token string" in this specification.
Certain characters may be considered as not significant, generally because they are here for syntax purposes and/or for user's comfort only. An example of this is the "=" sign between ' currency' and "euro" : it is not a significant token, since equality is implicitly expected here. In the languages of interest, e.g. XML, the rules governing the imbrication of the mark-ups are rather strict. A "begin" mark-up and the corresponding "end" mark-up define an "object" in the whole document. The rule is as follows : Such "objects" must be either strictly embedded within each other, or juxtaposed.
This results into the following consequence, known from those skilled in the art : a given XML document may be biunivocally associated with one or more tree structure topologies. In fact, an XML document normally has a main markup section (a container looking like "<begindoc> {substance of document} </enddoc>"). Such a document may thus be associated with a single tree structure.
Since a tree structure has nodes, it is often convenient to consider the XML document as a tree structure topology, whose "nodes" are the tagged objects. The above discussed expression Ni (Exhibit El) is an example of an XML object or node. The tag is the object name, contained in the "element", i.e. at the beginning of the head markup ("price", in the example). Associated to the object or node are (if any) :
- the attribute expressions) contained within the head markup,
- the raw data contained within the marked up section, i.e. between the head markup and the end markup - a subobject or subnode (child).
The applicant company has observed that the token may belong to a restricted number of types. Table El.2 in Exhibit El shows an exemplary table of XML types. This table is a significant ingredient of this invention, in that the types contained therein make it possible to deal with the commonly available XML documents. The table shows the type names, starting with "TT_" (for token type). It also shows a corresponding unique code (typeld), her an integer (valid both in decimal or hexa, since the maximum is 9). Last line shows "TT_CDATAPWS", in which PWS stands for punctuation and white space.
Ten (10) types are being currently defined in table El .2. This is based on the common XML standard : apart from the types being reserved for the XML specific constructions, all raw data are in character string form ("CD AT A"). Assuming further types are introduced in the future, then a higher number of types may be defined, e.g. if it would be desired to differentiate the raw data between usual software data types, like "string", "integer", "boolean", etc ... Further types might also be used to handle the comment structures of XML, having the format "<!- ... ->". However, such comment structures are ignored in the exemplary embodiment described hereinafter. Additionally, it may be interesting to add one or more types covering a string followed by a space ; this is very helpful, both in terms of memory occupation and search efficiency.
The second line of section E 1.1 in Exhibit E 1 shows the type name (in accordance with table E 1.2) for each significant token in expression Ni. The type may be directly derived from the syntax rules of XML.
In accordance with a first aspect of this invention, the XML expressions are thus decomposed in their significant tokens. Each token is associated with a pair of data (X, Y), where X is the token type, and Y is the token string. Preferably, Y is the "cleaned" token string, i.e. the portion of the token string obtained after removing the XML symbols, and other non meaningful symbols, as it will be explained. For example, the cleaned string for "</price>" is "price".
Coding a token
The token type X is converted into a token type code g(X), named typeld. This function g(X) may be simply viewed as an inspection (lookup) of table El.2, based on the type as determined by an XML parser, in accordance with the XML syntax rules, some of which are also recalled in table El .2. Note certain XML rules are not reflected in Table El .2, e.g. the following optional notation for an empty node : <erαptynode/> instead of <emptynode></e ptynode> Such a case may be dealt with by adding appropriate additional conditions in the type determination.
The cleaned token string Y is converted into a token string code h(Y), named stringld. This function h(Y) is "unique". "Unique" means here that the stringld shall correspond to any occurrence of the string Y, however to no different string, within the context of interest. The function h(Y) may be viewed as follows :
- any newly encountered cleaned string is entered in a Lexicon, where it receives a new unique identifier stringld. Thus, h(Y) is the stringld delivered by the lexicon for the new token string Y. - if string Y is already present in the Lexicon, then h(Y) is the existing corresponding stringld.
Since the lexicon may be viewed as a table, one may use the known techniques to generate unique identifiers in a table. The uniqueness may be simply obtained by checking that the calculated identifier for a new Y is different from all existing stringlds in the lexicon.
Otherwise, a different identifier is tried, until its uniqueness is ensured. Another interesting embodiment will be described hereinafter.
In the examples, it is assumed that the new stringlds are delivered sequentially by the lexicon. This is for explanation, since more sophisticated techniques may be used, as discussed hereinafter.
Now, a unique identifier Lexld is calculated for the pair of data X and Y. This may be expressed by the formula : LexIdQC, Y) = f(g(X), h(Y)) = f(typeld , stringld)
"Unique" again means here that the Lexld identifier shall correspond to any occurrence of the pair of data (X, Y), however to no different pair of data, within the context of interest. Pairs of data are "different" if they differ by the token type X, or by the token string Y, or both.
Function f() combines g(X) and h(Y). Any function making such a combination while preserving the unique character of the result may be used. A mere concatenation of the strings g(X) and h(Y) is a simple and very convenient way to implement function f().
Reverting to the expression or node Ni of Exhibit E 1 , the calculation of Lexlds Ui for each token is illustrated at section El.3. Section El.4 shows a linear sequence of Lexlds, noted seq(Ni), which entirely represents expression Ni (the punctuation in El .4 is for clarity only, and may not exist in the data). The block diagram of an embodiment converting XML source data into a code representation in accordance with this invention will now be described with reference to Figure 2.
In Figure 2, an XML source document 20 is submitted to a data processor 30, which will perform various functions, comprising a native and compact storage of the XML data in a very efficient and convenient form.
Data processor 30 comprises an XML parser 31. As known, an XML parser is capable of scanning an XML document, and finding therein the specific symbols of the XML metalanguage. Data processor 30 also comprises a tokenizer 32, a lexicon 33, a sequencer
34, and a sequence store 35.
The calculation of a unique identifier stringld from a string Y is hereinafter termed "tokenization". The corresponding calculation unit or process is called a "tokenizer".
Once determined, each stringld is stored in lexicon 33. In short, the tokenizer may be viewed as creating and maintaining a lexicon mapping associating a unique integer stringld to each unique cleaned token string Y.
The associated typeld (derived from X) may not need to be stored in lexicon 33, since it is necessary only in the sequencer 34, to build a sequence stored in sequence store 35. However, it may be useful that lexicon 33 physically stores the whole Lexld, i.e. both the stringld and the typeld. This makes it possible to search in the lexicon for both the cleaned string Y and its type X.
Data processor 30 may further comprise an indexer 36, and index tables 37. In a currently preferred embodiment, indexer 36 comprises a posting mechanism, and index tables 37 comprise posting lists (both will be described hereinafter).
The native storage of XML data will now be shown in more detail in the case of an XML sample document (named for convenience sample.xml), reproduced in Exhibit E2. The sample document has been selected to represent the main possible situations encountered in XML documents. However, to avoid redundancies, it is considerably simplified. Also, it is far from reflecting the usual size of XML documents.
In Exhibit E2, note the < .' [CD AT A [ <Web> ]]>. The pair in italics is a (TT_CDATASTART, TT CDATAEND).
By framing the expression "<web>", it forces it to status CDATA, instead of being interpreted as an "element" tag.
It will be understood that an XML document may be decomposed into a tree structure of XML obj ects or nodes, each of which is similar to the above described node expression Ni
(Exhibit El .1). Figure 3 shows a tree structure corresponding to the nodes of the exemplary XML document of Exhibit E2. Each node circle in Figure 3 contains only the "element" tag of the node (the attributes and CDATA, not shown, may be viewed as attached to the node object). Also, each circle is labeled with the "ref ' of its "element" tag, as it appears in Table E3.0 of Exhibit E3 (to be described).
The principles of processing the exemplary XML document will now be described with reference to Exhibit E3.
Token isolation
Table E3.0 shows the parsing of the XML document. As noted, the parser is capable, in known fashion, to isolate individual XML tokens, based on the syntax rules and the typical XML dedicated symbols. After the possible indentations in the source XML document have been removed, the tokens individually appear one by one as reflected in the second column of table E3.0. The head and tail XML symbols of each token, if any, are separated (at least logically) as noted in the third and fourth columns of table E3.0, respectively. The string contained within the token , cleaned by removal of its possible head and tail XML symbols, is reflected in the fifth column.
If desired, the possible indentations in the source file may be kept, thus making it possible to store and regenerate the source file exactly. Within the raw data, each and every digit is considered, including white spaces. In fact, the white spaces and other punctuation signs are processed one by one. Note, however, that the first and last white space of a block of raw data (CDATA), if any, have not been processed in this illustration, to avoid unduly long tables in the Exhibits.
The sixth column indicate the presence and nature of such punctuation and whitespace characters, where appropriate.
Token type determination
Now comes the determination of the type of each token. By way of illustration, it is described by application of five conditions, CND1 through CND5, as noted in the corresponding columns.
The table of types El .2 further shows how the head and tail XML symbols are specific to each type:
- as reflected in column CND1 of table E3, the typelds 1, 4, 5, 6 and 7 may be readily determined from the XML head symbols of the token, if any. - The typeld 2 may be determined from the fact the immediately preceding token with typeld
1 (a node element token) has been left "open", i.e. its trailing ">" has not yet been seen. The condition that a node token is left open is reflected by a "1" in the "open head" column of table E.3.0 If so, the token in the row following the "open node" is an attribute name, and has typeld 2. (An attribute name may also be identified by the fact it is followed by "="). Otherwise, the sign "=", if any, is not stored. This is reflected in column CND2 of Table
E3.0.
- A token immediately following an attribute name is an attribute value, and thus receives typeld 3. The string delimiters are ignored, i.e. cleaned out. This is reflected in column CND3 of Table E3.0.
- The rest is raw data or CDATA, and basically receives typeld 8, as reflected in column CND4 of Table E3.0. - preferably, the non-alphanumeric characters, i.e. punctuation and white spaces are specially marked by receiving typeld 9. Column CND5 reflects this by indicating 1 to be added to the integer 8 in column CND4, where such non-alphanumeric characters are met.
This makes it possible to determine the typeld all cases, as reflected in the last column of table E3.0.
The XML file heading (refs 1-6) may be processed the same way. However, since its location and syntax are predetermined, its processing may be simply directly hard coded.
More generally, the determination of typelds follows simple conditions on the current token, the preceding one, and the fact the current node "element" has been left open.
Since all these conditions are predetermined by the XML syntax, the whole process may be easily hardcoded, using e.g. IF ... ELSE ... ENDIF statements, or similar programming structures, e.g. CASE structures. This is faster than looking-up in a table.
Lexicon
Reference is now made to table E3.1, which shows an exemplary way of how the lexicon may be progressively constructed. The lexicon itself comprises the two rightmost columns of table E3.1. as shown in bold characters.
After the type of a token has been determined, its "cleaned string" is searched in the lexicon.
Assuming first the string is not found, it is given a new unique stringld. The two penultimate columns of table E3.1 show e.g. that stringlds 7 and 8 are given to strings bib and book, respectively (the XML file heading is shown as processed the same way, although it may be treated separately and/or differently, as already noted).
Then, a Lexld is constructed. In the example, the Lexld is the concatenation of the typeld and the stringld, both in hexadecimal. The format (e.g. the number of "0" in the stringld) is illustrative only. It is intended to form the Lexld as an integer of fixed length. Also the space between the 1 st digit and the rest is for illustration only, and will not exist in the computer' s memory.
When the currently searched string is found in the lexicon, it receives the existing stringld. Accordingly, Table E3.1 only shows the first occurrence of a given (string, type) pair. This is why Table E3.1 ends at re/43, while Table E3.0 ends at ref 55.
It will be appreciated that the operations performed in tables E3.0 and E3.1 may be conducted in parallel, when scanning an XML source file.
Sequencer
Now, table E3.2 shows how the sequence is progressively constructed for the XML source file.
Starting from the beginning of the source file, each token is associated to a unique sequence identifier or seqld. Thus, a sequence is constructed, in which each token is represented by its Lexld, and associated with a corresponding seqld.
Note that a given Lexld may be repeated in the sequence, as it appears e.g. for seqlds 8 and
31. By contrast, the associated seqld is unique. The second occurrence (or more) of a Lexld is reflected in the column "Dupl." of table E3.2. This is for illustration only, and may not be reflected in the computer's memory.
Preferably, the sequence as stored is simply an ordered list of Lexlds of fixed length, again more simply a concatenation of the lexids. In this case, the seqld of each Lexld in the sequence is virtual : it is just the offset of the Lexld, from the beginning of the ordered list of Lexlds. Thus, there is no need to store the seqlds themselves in the sequence. If required, (e.g. for posting) the seqlds may be stored as integers, e.g. in native hexadecimal format.
Appropriate data compression techniques may be used to optimize the storage of the sequence. Parsing an XML source file
The flow chart of figure 4 shows how the above operations may be performed upon scanning (or parsing) an XML source file, for converting it into a corresponding coded compact "native format".
Parsing begins at 400, and a first token, determined using current XML file parsing techniques, is considered as the current token Ti at 402. Its type X(Ti) is determined. If not meaningful (e.g. a comment), this token is ignored and the next meaningful token becomes the current token Ti (operation 480). Else, token Ti is processed to obtain a string Y(Ti), cleaned from the XML syntax symbols, using known techniques, like the ones described in connection with Table El .2.
Now, X(Ti) and Y(Ti) form a pair Pi (406).
In Figure 4, the lexicon contains the whole Lexlds, i.e. both the correspondence between the cleaned strings Y(Ti) and the stringlds, and the correspondence between the token type X(Ti) and the typeld. At 408 (Figure 4), the Lexicon 33 (Figure 2) is searched for finding a Lexld which corresponds to the pair Pi of X(Ti) and Y(Ti).
If found (410), the corresponding Lexld in Lexicon 33 is retained for the pair Pi, i.e. token Ti. Else (412), a new entry is added in Lexicon 33, with a new unique Lexld. In fact, the tokenizer 32 interacts with Lexicon 33 to build:
- a stringld corresponding to h[Y(Ti)], as described above. As noted, function h() is a function delivering a unique new stringld, e.g. from inspecting the Lexicon.
- a typeld corresponding to g[X(Ti)], with function g() being implemented by code derived from table El .2. (In fact, the typeld may be determined at any time between operation 402 and operation 414).
- the Lexld from : Lexld = f { g[X(Ti)], h[Y(Ti)] } = f(typeld, stringld)
As described above, function f() may be the concatenation of g() with h(), or conversely (in fact any character combination of g() with h() permitting fast recovery of g() and/or h() from the resulting Lexld may be suitable). Then occurs the operation of the sequencer at 416. It simply adds the Lexld in the sequence. This associates the Lexld being added to a new sequence number seqld. As noted, the seqld may be virtual, i.e. not physically stored in the sequence. If so, the seqld may ve viewed (and conveniently represented) as a physical offset (seqOffset) of the current Lexld, from the beginning of the sequence. With LexIdLen denoting the fixed length of the Lexlds, one has: seqOffset = seqld x LexIdLen
Although the physical offset (seqOffset) is more practical, the corresponding, possibly virtual, seqld will be used in this description, to enhance its intelligibility.
Optionally, operation block 420 may be performed to make one or more other "in-line processing". Examples of this will be described hereinafter.
Then, the next token is processed (480), unless the end of the XML source file is reached (460), in which case this is the end (490) of the conversion of the XML file.
Figure 4A shows an alternative embodiment, in which the lexicon contains the stringlds only. In this case, at 408, the string Y(Ti) is searched in the lexicon 33 (Figure 2). If the string Y(Ti) is not found in the lexicon 33, then (412, Figure 4A) the tokenizer 32 interacts with Lexicon 33 to build a new stringld corresponding to h[Y(Ti)], as described above.
In either case, operation 413 (figure 4 A) builds a Lexld for token Ti, according to : Lexld = f { g[X(Ti)], h[Y(Ti)] } = f(typeld, stringld) As noted, the typeld may be determined at any time between operation 402 and operation 413.
The rest of the flow-chart of figure 4 A is the same as in Figure 4.
Level tracking in the sequence
Metalanguages like XML use sections, called "elements" or "nodes" or also "XML objects" in the case of XML. They may also have strict rules of nesting sections within each other, i.e. such sections must be either strictly embedded within each other, or juxtaposed. XML goes even farther, in imposing that the "begin" and "end" markups of the elements are also themselves strictly embedded, while HTML is more tolerant.
Then, as noted, a tree structure may be associated with a given metalanguage file. Also the nodes or elements may be defined using a path notation, like "/bib/book" . The path separator here is "/", and, when cited in this specification, a path is provided with string delimiters (") for clarity.
Thus, it is of interest to be able to track the node (or nesting) level of the data blocks contained in the sequence. This tracking may be done relatively, or absolutely. It will now be described in connection with table E3.3 in Exhibit E3.
The column "node tag" shows where the beginning element ("+") of a node is seen, and where the end of a node is seen ("-"). When scanning the source file to establish the seqlds (or later when scanning the seqlds), the path to the item having the current seqld is easily defined, as shown in column "node path".
The flow-chart of figure 5 illustrates this mechanism. At initial operation 500, one stands somewhere in the sequence, at an element, i.e. the beginning of a node. The node level is NL_0, which may be the true node level of the element (absolute level) ; alternatively, NL_0 may be 0, in which case relative node levels will be obtained. At 502, NL_X is set to NL_0.
Then the sequence is scanned down from the current position (504). Whenever a Lexld having typeld 1 is met (506), NL_X is incremented (508). Whenever a Lexld having typeld 4 is met (510), NL_X is decremented (512).
Block 520 is the place where to perform a node level related operation, if desired, for example creating a software object for a new node if a typeld " 1 " has just been met. (If the operation depends upon the previous node level, it would be placed between 504 and 506, for example filling-in a software object with an attribute name, and attribute value, or
CDATA). This flow chart may also be used in a posting mechanism, to be described below. Then operation 530 detects if a given end condition is met (e.g. having descended the tree by two levels); if yes, this is the end 540; else, control is returned to operation 504. Note the end condition may depend upon the operation made at 520, if any. It will be understood that this makes it possible to use any portion of the sequence as a basis for virtually any software processing. The software processing will operate the same as with the XML source file, however using the more compact and easily searchable representation being proposed.
The operations of figure 5 may be directly embedded within those of any process scanning the sequence, whether totally or partially. Note also that the sequence might be scanned backwards. If so, the starting point 502 should preferably be the end tag of a node.
Generally, if the starting point is not correct, operation 502 may look for the next correct starting point, in either direction.
Use of the sequence
Turning back to Exhibit E3.3 (rightmost columns), the above mechanism is illustrated to reconstruct the actual paths of each node in the tree. This may be obtained as follows: -an empty string CurPath is built at the beginning of the scanning (operation 502);
- whenever a beginning node is seen ("+1"), CurPath is concatenated with "/", plus the cleaned string of the current item (done at 508 or 520);
- whenever an end node is seen ("-1"), CurPath is reduced by removing the last concatenated item (done at 512 or 520).
Note that the above path construction mechanism may give absolute paths, e.g. when the whole file or the whole sequence is scanned, or the path of the starting node is known. It may be used to get relative paths as well, e.g. when starting from an intermediate node whose path is already known for other reasons.
Note also that the Table E3.2 shows the paths, using the tag names as they appear in the XML document. This is for facilitating the understanding. In fact, in the compact storage as described, the representation is different, as it uses the Lexlds. This should be kept in mind when reading this specification. For example : "/bib/book" stands for '1 0000007 /l 0000008' (the white spaces and slash are for clarity only, and are not stored). Note also that the leading '1 ' of the typeld is not necessary to define the path unambiguously, to the extent the string refers to an element node.
Again, the operations performed in tables E3.0, E3.1 , E3.2 and E3.3 may be conducted in parallel, when initially scanning an XML source file, or when scanning the sequence.
Scope of a node
Reference is now made to the flow chart of figure 6.
Assume first that the system has an element node defined by its seqOffset or seqld in the sequence (620), for example "/bib/book" defined by seqld 8. Operation 622 goes to that seqld, where it finds Lexld 10000008 in the example.
Now, starting from seqld 8, operation 624 scans down the sequence until a Lexld having typeld 4 is found (here 40000008) at the same node level (the mechanism of figure 5 may be used to track the node level during scanning). This stops here at seqld 30. Thus, the scope of that occurrence of "/bib/book" is from 8 to 30 in terms of seqlds.
Whether the end position, i.e.30, is or is not within the scope is a pure matter of convention. It will not affect the system operation, provided the convention is consistently enforced.
This function, called ScanlScopeQ in box 620, may be used for several purposes, e.g. when processing queries.
Where applicable, the scope determined in accordance with figure 6 may be stored as a scope length, associated to the node. The simplest is to associate the scope length with the heading seqld (or seqOffset) of the node element.
In the example, for the "/bib/book" which has heading seqld 8, the scope length scopeLen is 30-8 = 22 (or 23 if the endtag at seqld 30 were included). Thus, scope storage is in the form of pairs (seqld, scopeLen), or (seqOffset, scopeLen). If desired, such node lengths or scopes might be stored during the XML file scanning of Figure 4, any time after operation 420, where both the Lexld and seqld of a node are known. Deciding scope storage when parsing the XML file is a matter of compromise between memory occupation and utility. Thus, in most cases, scope storage will not be performed initially for all nodes.
In accordance with the currently preferred embodiment, scope storage is performed later, when processing queries. Whenever a node length is determined during the processing of a query, one may keep ("cache") a representation of this node in the form of the seqld of its head element, and of its length, i.e. the pair (seqld, scopeLen) or (seqOffset, scopeLen).
Scopes of a plurality of nodes
Reference is now made to the flow chart of figure 7.
When starting from operation 720, this flow chart defines a function which may be called Scope (Lexld). In addition, one may want to use a function Scope (stringld). In this case, the node element is simply defined by its stringld. The Lexld is built by combining the typeld ("1") with the stringld, according to function f(). Then the function Scope(LexId) is called. This works basically with a stringld referring to an element node; however, in other cases, one may scan the sequence, upwards or downwards, to reach the next element node.
Turning back to operation 720, a node element is assumed to be defined here by its Lexld, not knowing where it is in the sequence. Operation 730 of figure 7 searches for the Lexld in the sequence (subject to additional conditions, if desired). If a seqld is found (732), the situation of operation 620 in figure 6 is met again, and the ScanlScopeQ process of figure 6 may be executed, as described above. By returning control to operation 730, this may be repeated to find further node elements having the Lexld, if desired. If nothing is found, or if it is not desired to continue the search, this is the end 740.
Assume now that a node element is more precisely defined by its path, expressed in terms of Lexlds (or stringlds). Now, the node level has to be considered. Thus the process described in connection with Figure 5 and Table E3.2 is used to reach the element at the desired path, e.g. "/bib/book". Then, the process of figure 7 is used, calling the process of figure 6 as necessary, while tracking the node level.
Considering now a query, there, the tag or path is input in XML language. A Scope (xmlString) function may be used. Suppose a query refers to "author". It will arrive at operation 700 in figure 7. Then operation 710 will look in the lexicon for "author" and find stringld "1 0000010" (adding the typeld 1). Then, the rest of the process is executed from operation 720, as described above (however with "author" instead of "book" in the illustration). The process of figure 7 will call the process of figure 6 at 620, as necessary. It will find that the scope of the first occurrence of "author" is 24-16 = 8.
The process is the same if the input query defines a path "/bib/book/author", instead of the tag "author" alone, except that the search at 710 will follow the path : search for"bib", then for "book" within scope of "bib", then for "author" within scope of "book". This is an application of the scope determination.
Implementation
As indicated, the correspondence between the pairs of data X (token type), Y (cleaned token string) and their respective typeld, stringld may be stored in the lexicon. The lexicon uses a memory space, and preferably has indexing mechanisms enabling a fast implementation of at least some of the following functions, which comprise :
* in an XML-to-Id group:
- stringΙd(cleαnedString) - typeId(tokenType)
- LexId(tokenType, cleαnedString)
* and/or, conversely, in an Id-to-XML group:
- cleαnedString(stringΙd)
- tokenType (typeld) - cleαnedString(LexId)
- tokenType (Lexld) As shown in Table E3.1 , the Lexicon looks like a table. Although a table may be used where small sizes of XML files are considered, accelerating mechanisms are desirable to improve performance in terms of response time. An indexing mechanism producing index tables may be used, e.g. with at least one index for the strings and the types (for the XML-to-Id group of functions), and at least one index for the Lexlds (for the Id-to-XML group of functions).
In the above illustrative examples, a new stringld entered in the Lexicon is numbered in sequence with the existing stringlds. This is mainly for explanation purposes. More sophisticated data representations may be used to facilitate fast search of strings in the Lexicon.
However, a significant improvement of performance may be obtained by storing the lexicon in a form enabling powerful and fast string-based searches, including those using wildcards. Currently preferred is the so-called "ternary trie", which offers a highly efficient dynamic dictionary structure for strings and textual data. The principles have been described inter alia in the following publications:
- Jon L. Bentley and Robert Sedgewick, "Algorithms for Sorting and Searching Strings", Proc. 8th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), January 1997.
- Jon Bentley and Robert Sedgewick, "Ternary Search Trees", Dr. Dobb's Journal, April 1998.
- Julien Clement, Philippe Flajolet and Brigitte Vallee, "The analysis of Hybrid trie structures", Proc. 9th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), January 1998.
The last mentioned article explains the possibilities as follows :
Tries are a general-purpose data structure of the dictionary type, that is supporting the three main operations
Insert, Delete and Query. To see how they are defined, let A= a^λr be an alphabet and 5 be a set of strings defined over A. The trie associated to S is recursively defined by the rule
Figure imgf000023_0001
where S \ a refers to the contents of S consisting of strings that start with α and stripped of their initial letter, and the recursion stops as soon as S contains one element.
Searching a trie Tfor a key w just requires tracing a path down the trie as follows: at depth i, the ith digit of w is used to orientate the branching. (Insertions and deletions are handled in the same way.) To complete the description, we need to specify which search structure is used to choose the correct sub-trie within a node. The main possibilities are: 1. the "array-trie" which uses an array of pointers to sub-tries. This solution is relevant for small alphabets only, otherwise too many empty pointers are created; 2. the "list-trie" that remedies the high-storage requirement of the "array-trie" by using a linked list of sub-tries instead. The drawback is a higher cost for the traversal; 3. the "bst-trie" which uses binary-search trees (bst) as a trade-off between the time efficiency of arrays and the space efficiency of lists.
In particular, the bst-trie can be represented as a ternary tree where the search on letters is conducted like in a standard binary search tree, while the tree descent is performed by following an escape pointer upon equality of letters. We shall refer to this data structure as a ternary search trie or tst. An example trie with its basic representation and the equivalent ternary search trie over the alphabet A={a,b,c} is represented on fig. 1 and 2. As is well known, the performances of tries depend upon the probabilistic properties of the strings processed. More precisely, we shall work with two types of models: The models for the infinite strings inserted in the tries. These models depend upon the number of strings inserted — either a fixed number n or the output of a Poisson random variable P(ri) — and also on the way the characters are emitted after one another — either independently or with some memory scheme such as a Markov process or a continued fraction. The models for the finite keys inserted in the tst nodes. Examples of such models are the multi-set model {α^,,^^,...,a,nr}, the Poisson model P(n,pJ) or the Bernoulli model β(«,^>;). It should be emphasized that since the infinite strings are drawn independently, their ith letters are also independent, which is matched by the previous models.
At present, the applicant finds it appropriate to use the '"bst-trie" when the data are to be stored in a disk or other mass storage medium, e.g. when processing the lexicon. However, the "list-trie" may be used when storing data in memory (for example when processing requests, as it will be described hereinafter).
The second mentioned article describes a C++ implementation, which may be used in accordance with this invention.
Other implementing code may also be found on the site of the National Institute of Standards and Technology, at http://www.rust.gov/dads/ΗTML/ternarySearchTree.htιnl
The ternary trie may be applied here to build a bi-directional fast lexicon. Bi-directional means that the lexicon is searchable both : - using an input string, to determine whether it is in the lexicon, and, if so, to recover its stringld or Lexld, and
- using an id, to re-build the corresponding string.
This will be illustrated with reference to Figures 10 and 10 A. With a view to avoid Figures 10 and 10A being too complicated, the lexicon has been restricted to a small subset of the strings existing in the example of Exhibit E2, namely (delimiters added): exSubSet= {"1", "1992", "1994", "bib", "book"}
A ternary trie may be viewed as a ternary tree, arranged in a specific manner. In short, a ternary trie is based on an alphabet (set of characters, including symbols), and a predefined way of sorting that alphabet. This might be e.g. the ASCII characters, sorted according to their ASCII binary code. The principle of a ternary trie is as follows:
* the root and the nexus (intermediary nodes) of the tree have three cells :
- a central cell containing a character ;
- a lefthand cell dedicated to others characters which are sorted before the central character; - a righthand cell dedicated to others characters which are sorted after the central character;
- each of the cell has the possibility of pointing to a respective node at the immediately lower level ;
- the central cell must point to a lower level node ;
- the lefthand cell may point to a lower level node, or not (in which case it is shown shaded); - similarly the right-hand cell may point to a lower level node, or not (in which case it is shown shaded).
* this is repeated on a plurality of node levels, while removing the current head character of a string being processed, when it matches the character of the current central cell. * the process may stop with a leaf node when there is no more ambiguity.
Referring to figure 10, right-hand portion:
- after "b" and "i" have been seen, one meets a leaf node with "b", which results in "b" + "i"+ "b" = "bib" ( + is the concatenation of strings)
- more on the right, "b" + "" + "ook" results in "book".
- on the left, one has : "_ι_' ι ' _ι_ " _ι_ " = "i "
"+'l'+'9'+'9'+'2' = "1992" "+'l'+'9'+'9'+'2' = "1994"
Thus, the above defined exSubSet is entirely covered.
Each leaf node may be (virtually) attached with an id corresponding uniquely to the string it represents, as shown in figure 10 using formal expressions like idφook). Note that, in Figures 10 and 10A and in their description, "z<f is a shortened notation for a lexicon identifier (whether it is a stringld, a typeld, or a concatenation of both, i.e. a Lexld).
Now, the id of a leaf node may be defined as a digital description of the path through the ternary tree to reach that leaf node. This has the advantage that, starting from an id, one simply has to follow the path defined by that id through the tree, while concatenating the characters met a the central cell where appropriate, until the leaf node, inclusive, whose character(s) is(are) also concatenated..
Figure 10A differs from Figure 10 by the fact, that the above described process continues until there is at most one character in the leaf node. Also, path identifying digits (to be described) are shown on the links between the nodes. For example, "book" is now obtained by: "b" + "" + "o"+"o"+ "k" instead of "b" + "" + "ook"
As shown, the leaf nodes in Figure 10A are, further preferably, provided with a three cell structure. This has the advantage that, when a previously converted file is converted again after having been updated, the original ids may be left unchanged.
The structure of the tree depends upon the selection of the character being put in each central cell. This pertains to the art of "balancing trees", which is well known to those skilled in the art.
Index tables may be added, if desired.
In each three-cell node:
- 2 bits are enough to represent the status (used, or not) of the lefthand and righthand cells; - representing the central character may necessitate a number of bits depending upon the size of the alphabet. Considering the ASCII table, and eliminating the computer dedicated bytes (decimal 0-31), as well as the higher portion (decimal 128-255), there remains a maximum alphabet of 96 characters. In practice, 64 characters, i.e. 6 bits, are largely sufficient, in the case where the punctuation and white spaces are separately coded.
Thus, a total of 8 bits is sufficient (as a "three-cell node data") to represent a three cell node, or a leaf node (which is the same or simpler).
The tree storage should also define the link between each of the three cells, and the corresponding next cells in the lower level. This may be facilitated to some extent by using predefined rules for the arrangement of the tree and nodes representation in memory and/or in a file. For the rest , additional information may be stored in each of the three cells, to form a pointer to the memory location of the next node, as required. 2,7
Now, the id simply has to describe the path within the ternary tree. Three possibilities exist at each level ; however, only the central one is a progress, processing the first character. The three possibilities may be coded on 2 bits, e.g. 1 (binary 01) for left, 2 (binary 10) for central, 3 (binary 1 l)for right, with binary "00" being reserved.
It has been found that ids with 28 bytes may be sufficient for the large majority of the the situations encountered (only very long strings may raise problems). Since the depth in the tree may vary, the ids may be completed with void characters, e.g. "00", until the predefined length of 28 bytes is reached. Where 28 bytes would not been sufficient, an "escape mechanism" may be used in known fashion to expand the representation of an id on two words of 28 bits, or more.
In short, the ternary tries just stores the node cells (not the id(xx) in italics). Then:
- when a string is searched, the correct path is easily found by comparing the current 1st character of the string with the central character in the current node ; the next node is defined by this comparison, and the current 1st character of the string is removed if it matches the central character of the node ;
- when building the ternary tree, nodes may be added as necessary to match a new string being entered; - when executing a query, a result found/not found is obtained;
- when it is desired to regenerate a string from the id, the bytes in the id are used one by one to follow a path in the tree, the central characters being met are concatenated one by one until the leaf node is met, and its central character, if any, is also concatenated. The result is the string corresponding to the id. Of course, the filling void characters used to give the id a fixed length are ignored.
Consider for example "book". As a search string, "book" may be submitted to the tree of Figure 10 A. It follows the path : 2,3,2,2. Thus, the id for book may be the integer string 2322. In fact, the id is preferably 23222, with the last 2 reflecting the fact the final "k" is in the central cell of the leaf node. By so doing, the id will not change if e.g. the right cell of that leaf node is later "opened" to include e.g. the string "boolean" in the lexicon.
Posting
The "posting" will now be described in more detail with reference to Table E3.4.
The so-called "Posting" relies on the preselection of certain node paths, as being of particular importance or significance, e.g. because they are more likely than other node paths to be involved in a majority of queries. The selection may be made by the user (e.g. the author of an XML file, or a reader), or according to predefined conditions relating to the structure of the XML file, or both.
In the example, this will be described for the node "book" in the path "/bib/book':
Primarily, the posting of "/bib/book" may be described as follows:
- the process of figure 7 is implemented to determine the respective scopes of all occurrences of "/bib/book" in the sequence (two occurences in the example). The processes of Figures 5 and 6 are called as necessary as subroutines. - by so doing one meets a plurality of Lexlds.
- for each Lexld being met, one stores the seqld of a reference token in each occurrence of the node path in which that Lexld appears. The simplest reference token is the head element(here "<book"). However, another token being consistently met in any node, e.g. its end tag, might also be used.
The storage of the posting information may conveniently be done by attaching a list of seqlds (or seqOffsets) to each Lexld being met. Thus, the posting of a given path (e.g. for "/bib/book") will result in an ordered set SI of lists Li, according to the following principles :
- each list Li is attached to a respective Lexld in the lexicon.
- that list Li represents seqlds attached to the nodes having the given path (e.g. "/bib/book"), which actually contain the string having that Lexld.
A list Li may be formed of concatenated character strings, separated with a suitable delimiter, with the string between delimiters representing a seqld (more generally a pointer to a seqld). Since the seqlds or seqOffsets have a fixed length, a plain concatenated list with no delimiters may operate.
To better understand how the lists Li are built, one must first consider :
- that a (virtual) window is opened in the sequence from the point (seqld 8) where a node token "/bib/book" of typeld 1 is met ("<book" under "<bib"), to the point where the corresponding XML end mark ("</book>") is met, exclusively ;
- for each token in that window, including "<book", the seqld of </book>, here 8, is concatenated to the individual list attached to the Lexld of that token.
This is reflected in Table E3.4. To facilitate understanding, all tokens have been shown in table E3.4. Those whose Lexld is a second occurrence (within the scope of "/bib/book") are marked with "[bis]". Their own seqld is in fact already represented in the list attached to the first occurrence of their Lexld. The Lexlds which have no list (or an empty list) are marked with "[nil]". In the machine, the actual posting list need not comprise the lines marked with "[bis]" or "[nil]".
The posting will now be described more formally, with reference to the flow chart of Figure 8. In this example, it is assumed that all first level nodes in the tree are posted (the root "bib" is level 0). Thus, the path of the nodes being posted may be left implicit. Here, "book" ("only") It is assumed (800) that posting is initiated for a given path, e.g. "/bib/book", that has :
- parents in the path, here "bib'Only, whose Lexld is 1 0000007
- a head tag or element with Lexld J> = 1 0000008
This corresponds to a node level NL_0 = 1 (level 0 being for "bib").
If the software environment requires it, an empty set of lists is declared at 802. In the example, this set of lists will be dedicated to nodes having the path "/bib/book".
Then, operation 804 searches for LexId P in the sequence. When found (806), the corresponding seqId_P (here 8) is stored (808). At 810, a loop variable seqId_X is set to equal seqld P. At this time, the corresponding loop variable for the Lexld is: LexId_P = 1 0000008.
Thereafter, operations 812-818 form a loop (inner loop) which will scan the sequence stepwise until a typeld = "4" is met at node level NL_0. This marks the end of the scope of the current node.
For each position in the sequence (i.e. each seqld) being seen during that scanning, operation 812 will take the Eex/ being there. Then, operation 812 determines if a list already exists for the current LexId_X(at seqId_X). If not, a new list is created in the set of lists for the Lexld ζ., at 813. Then, in either case (814), the list for the Lexld is appended with the seqldJP (here 8) of the head element, as it was stored at operation 808. Operation 816 skips to the next LexId_X in the sequence.
This inner loop stops when operation 810 sees a typeld "4" at node level NLjO. Node level tracking is applied in this flow chart, in accordance with the process of Figure 5. This has not been shown to avoid an unduly complicated drawing. In fact, steps 506 - 512 of figure 5 are inserted between steps 816 and 818 of figure 8. When operation 810 sees a typeld "4" at node level NL_0, operation 820 marks the end of the scope of the current node occurrence, and control is given back to operation 804, searching for another occurrence of LexId_P at node level NL_0 in the rest of the sequence (outer loop). Node level tracking is applied during this search.
The next occurrence will be found at seqld 31. This will result in all the Lexlds which have been seen twice to have "8, 31 " in their list, as shown in the example of table E3.4.
When no further occurrence of LexIdjP at node level NL_0 is found at 806, the posting of
"/bib/book" is terminated (830).
More generally, the "posting" associates several seqld to a given Lexld. Thus, for a given Lexld[ϊ\, this may be conveniently stored as a list Li, in which the seqlds are concatenated (with suitable separators, as necessary).
Now the above described posting for a given node path ("/bib/book"), results in a set SI of lists Li of seqlds for each Lexld. This may be viewed as an association or pair having the form: - XML expression {/bib/book , set SI }
- equivalent coded form { (/LexId(bib)/LexId(book) , set SI }
In fact the machine representation of the path may be the concatenation of the Lexlds without separator, since the Lexlds have a fixed length. This results in : { LexId(bib)LexId(boo , set SI }
As seen, a given set of lists like SI is associated to a given path. Assume there would be another book occurrence, e.g. a path/bzb/bøo£/book, where the first book is the above cited one, while the second is of different nature, e.g. a "book" being cited in the "book". Then, if posted, that other path "/bih/hook/bodk" would be associated to a different set of lists. As a consequence, where several postings are made, as it will commonly occur, the result is a plurality of sets of lists.
The plurality of set of lists may be stored as a posting file, having adequate indexes for obtaining a fast response. It is currently preferred to arrange the posting file as a balanced b-tree.
Advantageously, indexes maybe created with reference to attribute names in the posting list, and also with implicit reference to the corresponding attribute values. An example of this will now be briefly described.
Consider a path ' /bib/book/@year\ The attribute " eαr" may appear as indicated in the following table, where the column "offset" has arbitrary values :
Figure imgf000033_0001
A "sorted index" associated to the path ' /bib/book/@yea is constructed as follows:
"Sorted index" 200 100 300
This makes it possible :
1. to accelerate the processing of a request : assuming year '1993' is searched, the process may be stopped after scanning offsets 200 and 100, since all subsequent offsets have year values higher than 1994 (seen at offset 100) ; this avoids scanning all year values in the "posted" portion of the sequence corresponding to the path '/bϊb/bookP . Not e this also works when sorting strings and other data types adapted for sorting. ii. to process comparisons : assuming years lower than '1994' are searched, the process maybe stopped after scanning offsets 200 and 100, since all subsequent offsets have year values higher than 1994 (seen at offset 100). iii. to similarly improve all searches that may be helped by sorting.
In other words, a 'sorted index' may comprise a list of offsets (or Seqld) associated with a path and an attribute name. The list of offsets reflects the presence of values for the attribute being considered, and is sorted in accordance with the attribute values (which are accessible from the offsets). This is not limited to XML attributes and may also be used for other XML types, e.g. CDATA.
It will be appreciated that the process of Figure 8 is a mere scanning of the whole sequence, while storing a few variables :
* before and during the outer loop : - the Lexlds of the path being posted, or, alternatively, the Lexld of the last node in that path, plus its node level NLjO,
- a current node level variable NL_X,
* just before and during the inner loop :
- the seqld (or seqOffset) of a reference element in the node, which is the head element in the examples.
This means that, when scanning the whole sequence, any desired number of different postings (for different paths) may be performed simultaneously, provided that corresponding sets of the above variables are maintained (note that the current node level variable NL_X is common).
A good place to do the postings is during the initial scanning of the XML file, e.g. as shown in Figure 4 (enhanced with the node level tracking of Figure 5). Then, step 420 of Figure 4 (or Figure 4A) comprises : - storing the current seqld as a seqldJP, when a posted node is met at its posted level NL_0,
- executing steps 812-814 of Figure 8 at each step in the sequence, until an endtag typeld "4" is met.
The compact native storage
As described until now, this invention offers very significant advantages :
1. by entirely representing an XML document (content and tree structure, i.e. mark-ups) using pairs of integers (Lexld, seqld), subsequent searches are rendered faster, since they rely only on comparisons of integers.
2. by simply scanning the sequence, it is readily feasible to store the "length of scope" of an or each occurrence of a given node element path. This may be done for part or all of the nodes, either during the construction of the sequence, or later. Additionally, this information is simply in the form of a further pair (seqld, scopeLen), or (seqOffset, scopeLen).
As noted, seqld may be an integer whose size (number of bytes) is related to the size of the count of the tokens in the XML file. Here, scopeLen may be another integer whose size is related to the maximum length of a node scope in the whole XML file. This is again related to the size of the count of the tokens in the XML file, if we assume that there is a "root" node covering the whole file. However, the scope of this root node is known and need not be stored. Only the scope of lower level node need be stored. Thus, the size of scopeLen is in fact significantly lower than the size of the file.
Such information may be easily stored as a cache, using adequate data structures.
3. Further additionally, nodes of particular interest may be "posted" as described above. Together, the sequence store 34 and the 33 provide for a native and compact storage of the XML data contained in the source file: there is a biunivocal and direct correspondence between it and the original XML document.
It is worthwhile to stress that this native storage is performed without any need to refer to the actual meaning of the tokens. As known, the syntax and form requirements of an XML document may be defined in a DTD (Document Type Definition) associated to that document, or in an XML schema. Parsers often use this information of syntax, when scanning an XML source document. By contrast, the mechanisms proposed in this specification may operate without any connection to a DTD or an XML schema or other mechanism defining the syntax and form requirements, as used in the XML source document.
Conversely, the XML source document may be regenerated, totally or in part, from the native storage as described. "Totally" means here as far as the useful content is concerned.
For example, as noted, the comments possibly contained in the XML source document are not transferred in the native storage, and thus cannot be regenerated, in the embodiment being described. However, adding a "comment" type in the table of types E2.2 would permit to also include comments in the native storage, and accordingly, to regenerate them.
Regeneration
Reference is made to Tables E4.1 and E4.2 in Exhibit E4 (in these figures, the string delimiters are omitted to clarify the representation). The regeneration may be made as follows : - an empty file is opened ; the empty file is named regen.xml for convenience ;
- a suitable XML heading is entered in the file, using the beginning of the sequence file (refs 1-6 in table E3.2), if desired (no heading is required in many cases, e.g. when answering a query) ;
- the sequence is scanned in its order, from ref 7 ; - for each sequence item, the corresponding Lexld is obtained in sequence store 34 ; - the Lexid is separated into its first digit (typeld) and the rest (stringld) ;
- the source string corresponding to the stringld is found in the Lexicon 33 ; this source string may be noted Com '(stringld), or, in short, ConvQ.
- then, depending upon the typeld, a token heading may be defined as shown in table E4.1 , to be inserted before the ConvQ. For example, if typeld isl, the token heading is "<", ant the concatenated string "<" + ConvQ is appended to the regen.xml file.
- for certain typelds, the token may have to be modified depending upon the type of the next token in the sequence. This is shown in Table E4.2. For example, if a node token (typeld 1) is not immediately followed by an attribute name (typeld 2), then, a ">" is immediately added. This will work e.g. for "<author>" in the source file.
The contents of tables E4.1 and E4.2 might be used as look-up tables, for the current and next seqlds (and corresponding Lexlds), when scanning the sequence. However, to the extent the rules are predetermined, they are preferably hard coded, using e.g. IF-ELSE-ENDIF statements, and/or "CASE" structures, for faster operation.
Using the above defined mechanism, the XML source file may be entirely regenerated, by entirely scanning the sequence store 34, and progressively appending strings in the regen.xml file (or a precursor of it in RAM). At each seqld, the corresponding ConvQ is appended, preceded by a head, depending upon the conditions shown in table E4.1., and followed by a tail depending upon the conditions shown in table E4.1. and/or upon the conditions shown in table E4.2.
Table E4.2 further indicates "error" conditions, which should not occur. For example, an AttrName (typeld = 2) may not be immediately followed by another AttrName.
The conditions given in TablesE4.1 and E4.2 will equally work if only a "node section" cut into the compact code of an XML file is being considered. In fact, Tables E4.1 and E4.2 may be viewed as a state diagram, which may be implemented in an XML State machine : the XML state machine may walk through the XML object sequence in accordance with the authorized transitions and actions of Tables E4.1 and E4.2.
Usage of the native storage
It will be observed that individual XML expressions may be efficiently compared.
Firstly, if two expressions (e.g. El.l) contained in an XML file result into two identical sequences of the pairs (Lexld, seqld), then they are identical.
This is applicable to portions of any size in the XML file : if two different portions of an XML file result into two identical sequences of the pairs (Lexld, seqld) in the sequence 34, then they are identical.
Although this might apply to arbitrary sections in an XML file, the segments of interest are generally delimited by the node mark-ups, e. g. from "<book" to "</book>". As noted, this is called an XML object or element. Thus , the mechamsm as proposed will mostly be used to compare XML objects of any size, i.e. having any desired level of embedded objects.
In this discussion on the comparison, reference has been made to the contents of the source file. However, this is only to facilitate understanding. Comparisons only involve the sequence, and generally do not necessitate even to revert to the lexicon to find the strings associated with each Lexld.
Now, one major usage of storing XML data is to provide fast responses to various kind of queries on these data. Two popular standards of queries for XML files are XPATH and XQUERY. These are commonly processed by a "query parser" whose role is to analyze the query, and to convert it into standard query expressions.
For example, in the prior art where XML data are stored in a relational database, a query expressed in XPATH or XQUERY form would be converted into an SQL expression, for application to the relational database.
The proposed system operates differently. With reference to figure 9, the Query Processing comprises, for an input XML query 90 :
- parsing the Query to determine its components ; this is done at block 91 , using lexicon 33 ;
- pre-processing the query components ; this is done at block 92, using posting lists 37, and/or other indexes;
- the query component Resolution ; this is done at block 95, using sequence store 35; - then the query result is made available in compact code at 97;
- the query results in XML are obtained at 99, by converting the results at 97, using the regenerate fuction at 98.
The phases shown in figure 9 are logical, and thus given for explanatory purposes. As known from those skilled in the art, they may largely overlap each other, or even be merged, partially or totally.
This will be firstly described on a very simple example.
Parsing a Query (example
Assume for example an XPATH query of the form /bib /book/ [ @year= "1994 " ] which means: "find the books published in 1994 "
The query parser will first search in the lexicon the Lexld for "book", as a node. It will find LID1 = LexId(TT_element, "book") = 1 0000008 It will also search the Lexld for "year" as an attrName, and find LID2 = LexId(TT_attrname, "year") = 2 0000009 It will further search the Lexld for "1994" as an attrValue, and find LID3 = LexId(TT_attrvalue, "year") = 3 000000A
In this simple example of query, if one of the searches would have failed, this would mean that the result is "nil".
Query Preprocessing and resolution without posting (example
This is made using the sequence. The preprocessor will search LID1 in the sequence. This will output the seqlds (positions) 8 and 31, and node level 1.
Now that the seqlds for both occurrences of "book" are known, their respective scope may be determined using the mechanism of figure 7, starting from operation 720. Both scopes are scanned for possibly finding an attrName "year", followed by the attrValue 1994 (In fact, this may be made within the scanning loop of figure 6.). Only the first occurrence of "book" meets that condition.
Query Preprocessing and resolution using posting (example') Searching the LID1 in the sequence takes a time directly proportional to the length of the sequence, i.e. the length of the source XML file (in terms of number of strings). Posting high level nodes considerably reduces this time.
Now considering in the above example that "/bib/book" is posted, the posting list for "/bib/book" directly indicates (see table E3.4) :
- seqlds 8 and 31 for "/bib/book" itself (two occurrences) ;
- seqlds 8 and 31 for "year" within any occurrence of "/bib/book" ;
- seqld 31 for "1994" within any occurrence of "/bib/book".
By simply intersecting the lists of seqlds, the machine directly concludes that the only correct result is at seq_Id 31. The gain in time of response may be considerable, especially where the file of the size is large.
Still more time may be gained where a "sorting index" on attribute "year" has been used : then, the searches for LID2 and LID3 are replaced with scanning the sorting index for the path ' ' /bϊb/book/@year'' and exploring the offsets (or seqlds) in that sorting index, until a value equal to or greater than 1994 is met.
Various posting lists and associated sorting indexes may be built to match the logic of the various possible constructs of requests existing e.g. in XPATH or other query languages.
Output of Query result
Then, the result will be available, in terms of Lexlds in the sequence. The response will have the format : LexId(boόk), Lexld(tit\e), etc which the lexicon converts back to original XML language book, title, etc
Assuming now that the author's name is desired, the Lexicon will be called to convert the Lexlds having refs 16-24 in Table E3.1, back into their original strings.
Exhibit E5, taken in conjunction with figures 11, 12, 13, 14 and 15, contains a description of the design and implementation details of an exemplary embodiment of this invention.
The processing of queries will now be considered more generally.
Parsing Queries (general)
The QUERY parser basically creates an internal representation of the query, using the Lexicon 33. The parser may supports the XPATH syntax, XQUERY, and other query formats.
An exemplary XQUERY may be as follows:
<xquery> for $b in document ( 'SAMPLE. xml' ) /bib/book where $b/ [@year="1994"] return
{ $b/../title } </xquery>
This query is similar to the above Xpath query, and may be processed the same way.
The XQUERY request language may be viewed as similar to XPATH, plus the possibility of processing additional statements like those in the so-called "FLWOR" group, i.e. : FOR , LET, WHERE, ORDER BY, RETURN
These additional statements may be processed using the sequence, the lexicon, and the posting lists with their sorting indexes. The processing may use dedicated functions, which may be built-in the system, user-defined, or even plugged-in, as desired. It must be emphasized that the sequence, the lexicon, and the posting lists with their sorting indexes (or a portion of them) are a powerful tool enabling to adapt the system to various extensions of existing query languages, and/or to new query languages, as appropriate.
Query Preprocessing (general
A query normally has several statements or components involving various nodes. Together, they define conditions (a filter) in the set of information contained in the XML data. The query may also define the nature and format of the results, unless a default format for the results is accepted.
All what may be obtained from posting lists will be firstly considered. The query is then ordered, i.e. rearranged so that the most restrictive condition is considered first. As illustrated above, this can be made by considering the most constrained one of the tokens, in the posting lists (or using other tables of index, if any).
With large size XML files, it will likely be necessary to consider also lower level nodes, which are not posted. The scanning for these lower level nodes is : a. restricted to the scope of the higher level posted nodes, because these lower nodes will have their own posting list, in the higher level path being posted. b. the scope of the lower level node then has to be determined, using the process of Figures 6 and/or 7 ; once obtained, the scope may be "cached", as described ; c. then the conditions (if any) on the attribute(s) of the lower level node(s) is applied within the scope of the node.
Query Resolution The Query Resolution may use conventional techniques of result combinations in query resolution engines.
Generally, a query may be decomposed into a plurality of alternative sub-queries (combined together by a logical "OR"), while the query conditions in each sub-query are conjunctive
(combined together by a logical "AND"). The sub-queries may be ordered, with a view to begin with the most restrictive ("constrained") ones of the conditions. Then, the conditions may be resolved one by one, so that the set of compact data to be finally examined in detail becomes smaller and smaller. This is done within each conjunctive sub-query. Then, the results are gathered together ("ORed"), and possible duplicates are eliminated.
This gives the final results in the native code. These are converted back to XML, as explained above.
The above mechanisms may apply to all metalanguages having tags or markups that convey some particular meaning or syntax information. They have particular interest with metalanguages whose rules result in a metalanguage file having one or more associated tree structures. Examples are XML, SGML and their derived languages. However, this description is not limited to these languages. For example, the query results may be delivered in a different language, e.g. HTML, as well.
The above description has presented various functions applicable to the sequence, and/or to the lexicon. These functions may be applicable beyond the context in which they have been described. In particular, as noted, all functions involving a partial or total scan of the sequence may be combined to cooperate together. The correct place to put the functions in a scanning process, to combine them as desired, is considered as accessible to those skilled in the art on the basis of the information given in this specification. Also, the arrangement of the functions may depend upon the particular metalanguage being considered. More generally, many of the above mentioned features may be used in different combinations from those being described in detail.
Meta-languages like XML may be viewed in different ways. Initially, they have been presented as texts or documents, which are supplemented with add-ons (tagged mark-ups), for being easily dealt with by a computer in a standard fashion. However, since it is readily understandable by a computer, a metalanguage may be presented as a programming language as well, at least partially. This does not preclude the application of the teachings described in this specification.
Exhibit El - General
El.l - A basic XML node expression (or object) Ni
Figure imgf000046_0001
Table E1.2 - determination of type and type code (typeld)
Figure imgf000046_0002
E1.3 - coding expression Ni
Ul = LexId("TT_ELEMENT", "price") U2 = LexId("ττ_ATTNAME", "currency") U3 = LexId("TT_ATTVAL", "euro") U4 = LexId("TT_CDATA", "22") U5 = LexId("TT_CDATAPWS", ".") U6 = LexId("TT_CDATA", "5")
E1.4 - a sequence for expression Ni
Seq(Ni) = {Ul, U2, U3, U4, U5, U6} Exhibit E2 - XML Sample Document : "sampie.xmi"
<?xml version="l.0" standalone="yes" ?> <bib> <book year="1992"> <title>TCP/IP</title> <author> <last>Stevens</last> <first>w. </first> </aut or> <publisher>Addison esley</publisher> </book> <book year="1994"> <title> <! [ CDATA [<Web>] ]> </title> <author> <last>Abitboul</last> <first>Serge</first> </author> <price>24.95</price> </book> </bib>
Exhibit E3 Table E3.0 Parsing Determining types
Ref 1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
Figure imgf000048_0001
Figure imgf000048_0002
Exhibit E3 Table E3.1 Lexicon
Ref 1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
Figure imgf000049_0001
Figure imgf000049_0002
Figure imgf000049_0003
Exhibit E3 Table b3.z sequence
ref 1 2 3 4 5 6 7
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
Figure imgf000050_0001
Figure imgf000050_0002
Exhibit E3 Table E3.3 using Sequence & Lexicon Sequencer Lexicon posting book String (/bib/book) bib book 8 year 8 1992 8 title 8 TCP 8 8 IP 8 title 8 author 8 last 8 Stevens 8 last 8 first 8 w 8 8 first 8 author 8 publisher 8 Addison 8 8 Wesley 8 publisher 8 book book 31 year 31 1994 31 title 31 31 31 Web 31 31 31 title 31 author 31 last 31 Abitboul 31 last 31 first 31 Serge 31 first 31 author 31 price 31 24 31 31 95 31 price 31 book
Figure imgf000051_0001
bib
Figure imgf000051_0002
Exhibit E3 Table E3.4 POSTING" posted path Lexld_P \bib\book 1 0000008
Ref String Lexld (hexa) \ Posting list 1 bib 1 0000007 [nM] 8 book 1 0000008 8,31 9 year 2 0000009 8,31 10 1992 3 OOOOOOA 8
11 title 1 OOOOOOB 8,31 12 TCP 8 000000C 8 13 / 9 000000D 8 14 IP 8 000000E 8 15 title 4 OOOOOOF 8,31 16 author 1 0000010 8,31 17 last 1 0000011 8,31 18 Stevens 8 0000012 8 19 last 4 0000013 8,31 20 first 1 0000014 8,31 21 w 80000015 8 22 9 0000016 8,31 23 first 4 0000017 8,31 24 author 4 0000018 8,31 25 publisher 1 0000019 8 26 Addison 8 000001A 8 27 9000001 B 8 28 Wesley 8 000001C 8 29 publisher 4000001 D 8 30 book 4 000001 E [nil] 31 book 1 0000008 [bis] 32 year 2 0000009 [bis] 33 1994 3 000001 F 31 34 title 1 OOOOOOB [bis] 35 50000020 31 36 < 9 0000021 31 37 Web 8 0000022 31 38 > 9 0000023 31 39 6 0000024 31 40 title 4 OOOOOOF [bis] 41 author 1 0000010 [bis] 42 last 1 0000011 [bis] 43 Abitboul 80000025 31 44 last 4 0000013 [bis] 45 first 1 0000014 [bis] 46 Serge 8 0000026 31 47 first 40000017 [bis] 48 author 4 0000018 [bis] 49 price 1 0000027 31 50 24 8 0000028 31 51 90000016 [bis] 52 95 8 0000029 31 53 price 4000002A 31 54 book 4 000001 E [nil] 55 bib 4000002B [nil] Exhibit E4 Tools for regeneration
Table E4.1 Response to the current LexID itself
Figure imgf000053_0001
Table E4.2 Additional response (if any) considering the next LexID
Figure imgf000053_0002
Exhibit E5 - Design and Implementation of the XPeerion Inverted Index
E5.1 Introduction
This document contains a description of the design and implementation details of an inverted index or postings index that was developed for the Agilience Group's XPeerion XML data base. In general an inverted index is a data structure (and file structure) that is used to support efficient querying on a large data set. The data set itself is assumed to contain a large number of documents each containing some number of tokens. The query operation returns the set of documents that contain a given token. In this implementation tokens are identified by a unique LexID and documents by a unique value called the PostingsID. The PostingsID is a 32 bit quantity and can be generically interpreted as a location. The PostingsID can represent a document ID, an element ID (in the XML case) or a file offset but the meaning of the PostingsID is opaque to the inverted index implementation, it only assumes that a given PostingsID "contains" some number of LexIDs.
E5.2 Design Goals
The inverted index was designed to index XML data and it supports that by using 4 bits of the LexID as token type. In the XML case token types could be element start, element end, CDATA, etc. The remaining 28 bits allow indexing 268,435,455 unique tokens. For non XML data or non typed data the type can be set to 0.
Scalability. One goal was to support large sets of data, up to 10 GB of XML data, which roughly maps to a data set containing 1 billion tokens. According to Heap's Law this means a unique token set of probably 750,000 to 2 million tokens.
Fast index build time. Hard to quantify but something linear based on data set size. Fast access time. Querying the postings data for a given token requiring one or no disk accesses.
The ability to dynamically update the inverted index. Being able to add postings to or delete postings from the previously built index with good performance.
Small index size. Maintaining a small footprint for both in memory and on disk data structures.
E5.3 Data and File Structures
E5.3.1 LexID
The LexID is a 32 bit quantity that is represented as a 4 bit token type field and a 28 bit index field. The LexID is assigned to a token by the Lexicon when the token is inserted into the Lexicon and the index value itself is unique across all LexIDs. The type field is normally used to represent XML types and really isn't used by the Inverted Index. It uses only the index field to access LexIDs in the Inverted Index.
Lexld type index
The Lexicon assigns the index values of a LexID sequentially, starting with 0, as tokens are inserted into the Lexicon and except in cases where a token has been removed from the Lexicon/Inverted Index it assumes that they are contiguous numbers with the highest index value belonging to the last token inserted.
E5.3.2 Inverted Index Array
The Inverted Index Array is logically a memory resident linear array that contains references to postings data for each LexID either on disk, in memory, or both. The array is represented by an ExpandableArray object which grows dynamically without reallocating or copying memory.
Earlier implementations used a B-Tree to represent the Inverted Index but at that time LexIDs were being assigned by the parsing process (not the Lexicon) and the distribution of index values was spread widely across the index space. The array implementation is two or three times more memory efficient than the B-Tree if the LexID index values are fairly contiguous and it has faster access times.
One of the rationales for using a B-Tree is that it is relatively easy to swap the external nodes (the ones that actually contain data) to and from disk and that only the internal nodes have to be memory resident. Performing queries on an existing B-Tree index then requires at most one disk access. However that query only returns the disk location of the postings data and it may require another disk access to retrieve the actual postings data. The difficulty is in building the B-Tree index. Experience shows that building a large B- Tree index in a relatively small memory footprint requires a lot (need some quantity here) of disk swapping.
However, if the range of LexID index values are not contiguous and allocated across a large set of values then a hashing index scheme is probably more appropriate. In either case though, the array or hashing index is probably a better choice than a B-Tree.
Each element of the Index Array is 16 bytes long and is represented by the InvertedlndexNode object described below. So, an Inverted Index containing one million unique tokens requires 16 MB of memory (or disk space when saved to disk.) The index is always memory resident in entirety while running and is saved to disk if modified in response to a save request or when the Invertedlndex object is closed. E5.3.2.1 ExpandableArray
An ExpandableArray object is used to represent the Invertedlndex Array and it logically appears to be a linear array. It starts with no memory allocated and as items are inserted it "grows" in 64KB increments. It doesn't reallocate and copy memory as it grows but instead allocates a new buffer which is added to a list. The API maps an index value to the appropriate buffer location.
E5.3.2.2 Inverted Index Node
The same InvertedlndexNode structure is used both when the Inverted Index is memory resident or when it has been saved to disk. However, through the use of unions the fields can have somewhat different meanings when accessing an existing index as opposed to building a new one. It is also slightly different if the postings data is stored in the postings file or if the postings are resident in the index itself.
For example, the InvertedlndexNode for a LexID that has postings data stored in the postings file would look like this:
InvertedlndexNode PostingsCnt RecCnt RecAddr *PriorityQueueNode
The postingsCnt is a 31 bit quantity specifying the number of PostingsIDs that are stored on disk for this node. The high order bit, the location bit, is cleared to indicate that the postings data is stored on the disk rather than in the index itself.
RecCnt is the number of records allocated for the postings data and RecAddr is the record address of the postings data on disk. PriorityQueueNode is a pointer to a PriorityQueueNode struct and is the location of the postings if they are also resident in memory. When this InvertedlndexNode is written to disk PriorityQueueNode is set to NULL.
E5.3.2.2.1 Low Frequency LexIDs
In practice, the distribution of tokens often observes power laws such as Zipf s law which show that a few tokens occur with a very high frequency and that many occur with an extremely low frequency. The design of this Inverted Index leverages that by storing the large number of low frequency tokens directly in the InvertedlndexNode. Specifically, when a LexID has three or less postings then those postings are stored in the InvertedlndexNode.
To indicate this, the location bit of PostingsCnt is set and the PriorityQueueNode struct looks like this:
InvertedlndexNode PostingsCnt Postings 1 Postings2 Postings3
In this case there are no postings data for the LexID stored on disk and querying the postings for the LexID returns the data directly from the index itself so there are no disk accesses required to retrieve the data. This potentially provides a substantial performance enhancement if the frequency distribution of the tokens across PostingsIDs models a Zipfian distribution. According to Zipf s Law, one half of the unique tokens should have a frequency of one, two-thirds a frequency of two or less, and % have a frequency of 3 or less. So, if the actual frequency distribution is Zipf-like then queries for postings on % of the unique tokens would require no disk access. When a postings entry is added to a full node, one with PostingsCnt = 3, then the existing postings data along with the new entry are stored to disk, the location bit is set, and the recAddr, RecCnt will reference the postings on disk.
E5.3.2.2.2 High Frequency LexIDs
One special case is for high frequency LexIDs, tokens that occur in a large number of locations. In some cases such high frequency LexIDs are just "noise" and do not provide any useful information when searching data. In a zipf-like distribution the six highest frequency tokens can be expected to account for 20% of all token occurrences. In English text these are usually the words: the, of, to, and, in, is.
It's possible to specify a threshold percentage at the time the index is built (see BuildlndexEnd Phase below) so that LexIDs which occur in over some percentage of locations are ignored. By specifying a threshold of 80% for example the postings for the top six words are eliminated from the system. To indicate that in the InvertedlndexNode, the location bit is set to 1, the PostingsCnt is set to one and the PostingsID stored in the index is set to the NullPostingsID value (Oxffffffff).
E5.3.3 Postings File
Postings data is written to the Postings file using large, sequential, buffered writes when creating the Postings file and accessed using random access reads when querying postings data. Space is allocated in the postings file in records of 16 bytes (4 PostingsIDs) the primary reason for this is to allow file sizes larger than the 2GB normally supported by Windows. The goal was to able to support 64 bit disk addresses and thus larger file sizes without having the requirement to store 64 bit addresses in memory resident structures. So, when using a 16 byte record as the allocation unit the 32 bit RecAddr value stored in the InvertedlndexNode structure actually allows accessing a file up to 64GB in size. PostinglDs within the Postings file and in the Inverted Index are stored contiguously and are sorted in ascending order for each LexID. The Sample Inverted Index and Postings shown in Figure 11 shows postings stored both in the Inverted Index and in Postings file.
The index entry for index 0 has PostingsCnt=4 which are stored starting at RecAddr=0 in the postings file and it has one record allocated in the postings file.
Index 1 has three postings stored in the index itself. Note that the high bit of PostingsCnt is cleared.
Index 2 has 7 postings stored in 2 records in the postings file starting at RecAddr=l.
Index 3 is a case of a high frequency LexID, it has one postings stored in the index and that posting has a value of Oxffffffff, the NullPostingsID.
Index 4 has four postings stored in the postings file starting at RecAddr=3. It has two records allocated. Possibly it had 5 postings and one was deleted.
Entry 5 shows two postings stored in the postings file at RecAddr=6. Probably had four postings and two were deleted. Even though there are fewer that four postings they aren't moved back to index after deletions.
The record at RecAddr=5 is unused, unallocated and can't be reclaimed. This would occur if the index entry that initially allocated this record later grew beyond 4 postings and so the data would be relocated to the current end of the postings file at that time. The other scenario would be that two the postings for the index entry that originally allocated this record have been deleted. E5.3.4 Priority Heap
When postings data is loaded from disk to perform queries or updates, the memory used is allocated from a Priority Heap object. The goal of the Priority Heap is to maintain only the most recently used postings data in a finite, relatively small amount of memory. When it's necessary to allocate memory from a "full" heap enough of the least recently used memory allocations are freed so that there is enough space available to make the new allocation. When freeing memory from the heap postings data that has been modified is written back to disk.
The current implementation allocates memory from the system heap and tracks the amount of memory allocated. A limit or maximum memory value is passed as a parameter to the Postings object when it is created and once the limit is reached memory is deallocated as needed. Previous implementations used a "private heap" which in Win32 is created using a HeapCreate() call. In Win32 there is a limit to the largest memory buffer that can be allocated from a "private heap" and that limit is slightly less than 512KB. That would mean a maximum of approx. 128,000 postings entries for a given LexID and while it isn't likely to hit that limit it is conceivable in some situations so the implementation was changed to use the system heap.
The Priority Heap contains a Priority Queue which is a linked list of Priority Queue Nodes. The Priority Queue Node is used as a header to the allocated memory buffer and contains pointers to the other nodes in the Priority Queue. Each time a node is accessed, either for queries or updates, it is moved to the front of Priority Queue so that at any given time the most recently used nodes are at the front of the list.
E5.3.4.1 Priority Queue
The Priority Queue itself consists of pointers to the first and last nodes in the queue and an API for moving nodes within the queue and removing nodes from the queue. E5.3.4.2 Priority Queue Node
Figure imgf000062_0001
Each PriorityQueueNode contains previous and next pointers. It also contains the index number of the InvertedlndexNode that it belongs to and a size field which specifies the number of bytes allocated for postings data. The dirtyFlag, if set, indicates that the postings data has been modified, either by updates or deletions, since it was loaded into memory. If it is set then the postings data needs to be written back to disk and the InvertedlndexNode updated when the node is deallocated. Otherwise the node can simply be be deallocated without any other action.
The example in Figure 12 shows elements 0, 3, and 5 of the inverted index with allocated Priority Queue Nodes. The next node to be deallocated (if necessary) is the one associated with element 5, the last one in the queue. If the dirty flag is set for that node then a Store request is made to the InvertedlndexNode for element 5 and it will write the updated postings to disk before the memory for the Priority Queue Node is freed. E5.4 Operations
The Postings API supports three primary types of operations: index creation, index updates, and index queries.
E5.4.1 Index Creation
Index Creation refers to the process of building a new index for a data set and the functions to create it are invoked by the parser while the data set is being processed. Earlier implementations accepted as input a "sequence file" which represented the entire data set as tokenized XML data. The format of this sequence file was proprietary and using it forced some inter dependencies between the Inverted Index and the parsing process. To remove dependence on the file format and to achieve independence from the parsing process the API has been modified. However, it might be desirable, for performance reasons, to provide alternate APIs that accept some type of sequence file as input.
E5.4.1.1 Index Creation API
bool BuildlndexS ar ( ) ; bool BuildlndexAddPos ing (LexID lexID, PostingsID postingsID) ; bool BuildlndexEnd (DWORD d MaxMemory, double saveTheshold = 0.80);
After the Postings object is created, by calling the CPostings () constructor, BuildlndexStart ( ) is invoked once to initialize some data structures. The function BuildlndexAddPosting O is then called repeatedly, once for each LexID found in the data set. A requirement while building the index is that all LexIDs for a given PostingsID are added before any are added from the next PostingsID and that the PostingsID are in order. This means that all LexIDs for postingsID=0 are added before adding any for postingsID=l. The postingsIDs don't have to be sequential but do have to be in ascending order. The function BuildlndexEnd ( ) is called after all postings have been added and the result is the creation of the postings file. The dwMaxMemory parameter, given in bytes, specifies the amount of memory buffer space to use while processing postings data. This is space used in addition to the Inverted Index Array and is deallocated once the postings file is complete. The saveThreshold parameter is used to remove high frequency postings from the index. If a LexID occurs in over a specified percentage of PostingsIDs (the default is 80%) then the postings for that LexID are not included in the postings file.
E5.4.1.2 Index Creation Processing Steps
E5.4.1.2.1 BuildlndexAddPosting Phase
The product of the BuildlndexAddPosting ( ) phase are PostingsID frequency counts for each LexID and a temporary, intermediate sequence file. The frequency counts are stored in a LexID 's InvertedlndexNode struct and reflect the number of PostingsIDs that are associated with that LexID or in other words the number of PostingsIDs that contain a given LexID. The frequency counts are used to determine memory allocation requirements when later processing the postings data. The intermediate sequence file contains a series of PostingsIDs each followed by the set of LexIDs "contained" by that PostingsID. Duplicate LexIDs for a given PostingsID are not saved so the file only contains the set of unique LexIDs associated with a PostingsID.
Figure 13 shows a sample BuildlndexAddPosting () phase, the generated frequency counts, and the generated intermediate sequence file. What is important to note is how duplicate LexIDs for a given PostingsID are "collapsed" both in the frequency counts and in the intermediate sequence file. This is potentially a significant amount of data. A test case containing 115 million tokens (508210 of them unique) having a Zipf like distribution and spread across 115,000 PostingsIDs had 54 million duplicate LexIDs which would have been removed in this step. E5.4.1.2.2 BuildlndexEnd Phase
During the BuildlndexEnd ( ) phase the first step is to create a number of segmented sequence files where each segmented file contains all of the postings data for a range of LexIDs that can be processed completely using the memory buffer which was specified in the dwMaxMemory parameter. The process is to iterate through the InvertedlndexArray adding frequency values to determine sequential series of LexIDs that can be processed with the buffer. The result is a temporary data structure containing a list of segments each designated by a starting and ending LexID.
Once the segments have been determined a segmented file is created and opened for each segment. The Intermediate Sequence file is then read. The LexIDs that belong to a segment along with the "containing" PostingsID are written to the appropriate segmented seguence file. See Figure 14, which shows the generating of sequences and segmented sequence files.
The first segment isn't actually written to a segmented file instead the data for the first segment is processed and written to the postings file while the other segmented files are being written. The contents of the segmented files will be similar to the contents of the Intermediate Sequence File shown in Figure 13 but each one will contain only the PostingsIDs and LexIDs that belong to the segment.
It is also during this phase that the high frequency LexIDs are removed. While calculating the segments, the high frequency LexIDs are ignored and those counts aren't used in the memory usage calculations. Later when writing the segmented files those LexIDs are not written to the segmented files. So the segmented files contain only the data that will processed for that segment.
After the segmented files are written the intermediate sequence file is then deleted. Each segmented file is then read, singly, and the postings contained in the file are written to the postings file. First space is allocated from the postings buffer for the segment then the PostingsIDs are moved into a LexIDs buffer. The postings data is written to the postings file once the complete segment has been processed and the segmented files is deleted.
E5.4.1.3 Index Creation Performance
During Index Creation all disk I/Os are performed using large, buffered, sequential reads and writes which provide very good performance compared to small direct access I/Os and overall performance is driven by the amount of data read and written. Since we process postings data in chunks that equal the dwMaxMemory parameter passed in with the BuildlndexEnd ( ) function call we can use that value to measure the amount of disk data being accessed. If we assume that we have an x sized buffer to process data (dxMaxMemory = x) then we can measure a file in terms of x and say that a file has n x sized blocks.
In the process described above, the first step is to write the intermediate sequence file (n blocks), then read the intermediate sequence file (n blocks) while writing n-1 segmented files (each one block in size), and then to read n-1 segmented files. Note: the first segmented file for the first segment is never actually created the data for it is processed while reading the intermediate sequence file. So the total amount of disk I/O is (4n)-2 blocks.
Initially it would seem that performance could be improved to (3n)-2 by using an existing sequence file as input to the Index Creation Process. The reality is that the initial sequence file would had to have been created by the parsing process and whether it was constructed there or while indexing the overall calculation is still (4n)-2.
An earlier implementation of this Inverted Index accepted as input a sequence file generated by the parser and then read the file multiple times processing x amount of data on each pass. So on each pass n blocks are read but only one block is processed and it requires n passes to process all of the data. The total disk I/O data would be n2 + n (including writing the initial sequence file) for that method. Table E5.1 compares the two methods for different values of n and shows the total number of blocks processed as disk I/O. It's obvious that that as n gets larger (the buffer gets smaller in relation to the file size) the segmented file method scales better for larger file sizes. It also isn't affected significantly by the size of the buffer x used to process the postings data.
As an example, assume that x = 64MB and n = 16 which means that the sequence file is one GB (approx. 250,000,000 LexIDs). Using 12MB/sec. as the disk I/O speed it would take 330 seconds to create the index using segmented files as opposed to 1450 seconds when rereading the source sequence file. Reducing the buffer size x to 32 MB for the same size file (n = 32) requires 336 seconds versus 2816 seconds for the earlier method.
Table E5.1 Comparison of File Processing Methods for different values of n
Figure imgf000067_0001
E5.4.2 Index Updates
Index Updates are operations that allow adding postings to or deleting postings from an existing inverted index. The AddPo sting ( ) call can also be used to create a new index but is considerably slower than using the Index Creation routines: BuildlndexStart ( ) , BuildlndexAddPosting ( ) , and BuildlndexEnd ( ) . E5.4.2.1 Index Updates API
bool AddPosting(LexID id, PostingsID postingsID) ; void DeletePosting (LexID lexID, PostingsID postingsID) ; void DeletePostings (LexID lexID) ;
The AddPosting ( ) call first checks the InvertedlndexNode for the given LexID. If it is a low frequency node and there is space for it in the InvertedlndexNode then it simply inserts it (in order) into the InvertedlndexNode.
If the postings are currently stored on disk and and they aren't currently memory resident then a PriorityQueueNode structure is allocated and the PriorityQueueNode placed at the head of the PriorityQueue. The space allocated is large enough to accommodate the new postingsID plus the existing postings. After the existing postings data is loaded from disk into the PriorityQueueNode the new PostingsID is inserted into the PriorityQueueNode. Since the PostingsIDs are stored contiguously in order this may require a memory move to place it in the correct location.
If the existing postings data is currently memory resident then its PriorityQueueNode is moved to the head of the PriorityQueue and the PostingsID is inserted as above. If the PriorityQueueNode is full then the existing one is reallocated which requires a memory allocation and a copy. Currently the size is increased in increments of 4 PostingsIDs. Previously, we have allocated using a power of two (doubling each time) but at that time AddPostingO was being used to build the index. Since we are adding to an existing index it probably isn't appropriate to increase the size in such large chunks.
The DeletePostingO call is similar to AddPosting O in that it will load from disk if necessary but does require a memory move if the deleted posting is somewhere other than the end of the list. Als moves the PriorityQueueNode to the head of the PriorityQueue. DeletePostings () deletes all postings associated with a LexID. It only has to set postingsCnt=0 in the InvertedlndexNode and free the PriorityQueueNode if one is allocated.
E5.4.3 Index Queries
E5.4.3.1 Index Queries API
Returns the list of postings, in ascending order, for a given LexID.
int GetPostingsCount (LexID lexID) ; PostingsID *6etPostings (LexID lexID, int *pCount) ; void PreePostings (PostingsID *postingsList) ;
The function GetPostingsCount O simply returns the number of postingsIDs currently associated with a given LexID. This value is returned from the Inverted Index Array so it doesn't require a disk access.
The function GetPos tings ( ) returns a pointer to a list of the PostingsIDs associated with a given LexID. The PostingsIDs are returned in ascending order and the variable referenced by the pCount pointer will contain the number of postings in the list. FreePo s tings () should be called to free the memory allocated by the call to GetPostings () .
E5.5 Scalability
The original design goal of this Inverted Index implementation was to support the indexing of 10 GB of XML data and it appears to be able to do that. The question now is whether larger data sets, up to 100GB, can also be supported using the same algorithms. The conclusion is that it probably can as long as the larger data set doesn't contain too many unique tokens. From a performance standpoint the algorithms described here either scale linearly for larger data sets (Index Creation) or are indifferent to data set size (Updates and Queries.) The question though is as data sets get larger are the memory resident data structures too large for a reasonable memory configuration and is there adequate disk space available for the intermediate and segmented sequence files.
E5.5.1 Memory Requirements
Some assumptions: 10GB of XML data is roughly equivalent to one billion tokens (assuming 10 bytes of XML data per token) and therefore 100GB of data contains roughly 10 billion tokens. The Inverted Index Array, which is always memory resident in entirety, requires 16MB of memory per one million unique tokens or LexIDs in the system and therefore an index containing 10 million unique tokens requires 160 MB of memory. It's not clear what the acceptable upper limit is for memory usage, 160 MB may be OK for a server class machine but possibly not for a user's desktop or laptop machine. The problem then is estimating the number unique tokens that might exist for a given data set size.
E5.5.1.1 Heap's Law
Heap's Law can be used to predict vocabulary size of a text collection and it defines vocabulary growth (the number of unique tokens) as a function of the overall text size (the total tokens) and is defined as:
V = Knβ
where V is size of the vocabulary or number of unique tokens, n is the size of the text or data set in tokens, K and β are empirically derived for a given data set. Typical values for K are 10 - 100 and β are 0.4 - 0.6. Table E5.2 shows calculated V values for some sub collections of the Text Retrieval Conference (TREC) database. These are sets of English language text collections that have had the values of K and β calculated. The row with n = 1,000,000,000 is equivalent to a dataset containing 10GB of XML data and shows an expected range of unique tokens from 232,016 to 1,178,418. The row with n = 10,000,000,000 is equivalent to a 100 GB data set and has a range of 639,024 to 3,813,285 tokens.
Given the worst case value of less than 4 million tokens for 100 GB of data it's reasonable to expect to handle this amount of data because the Inverted Index Array would only require 64MB of memory. It's also important to note that these values are for very specific data sets, K and β for other data could vary significantly.
It has also been shown that the number of new tokens that occur in web pages does not appear to taper off as the volume of data increases as it does for text. Even after gigabytes of data have been processed the vocabulary continues to grow. One study found 2 billion word occurrences in 45 GB of web data and that it contained 9.74 million unique tokens in 5.5 million documents. This would suggest that the current implementation could not support large data sets (greater than 50 GB) of web based data. Table E5.2 Calculated V for some values of n on various TREC data sets
Figure imgf000071_0001
E5.5.2 Disk Requirements
We also have to look at the amount disk space, file sizes, and the number of open files required to process larger data sets. If we assume that 100GB of data consists of 10 billion tokens and that those tokens are distributed across a large set of documents or locations say for example 1000 tokens per document or location. Then the intermediate sequence file would require 40 GB for token space (10 billion tokens at 4 bytes per token) and an additional 400MB for postings ids ((10 billion / 1000) at 4 bytes per postings id). The token space would actually be smaller because duplicate tokens within a document aren't written to the intermediate file so perhaps it would be 20 to 25GB.
The current implementation can handle a 64 GB file size so the file size itself shouldn't be a problem. It is also configurable (at compile time) to support larger files if necessary. If the dwMaxMemory buffer (see BuildlndexEnd Phase above) is set to 64MB and given a 25 GB intermediate sequence file size then we would need 400 segmented sequence files open simultaneously which is within operating systems (Linux and Win32) limits. There is also some memory overhead because if we assume a 64KB buffer per file then there is an additional 25 MB of file buffer space required.
It looks like 100GB of data could be supported as long as 50 to 60 GB of temporary file space is available for intermediate and segmented files.
E5.5.3 Scaling Down
Although the focus of the design of this Inverted Index has been to support very large data sets it also will also handle small datasets in a very small memory footprint. The memory resident Inverted Index Array grows in chunks of 64KB and a small index supporting 4096 unique tokens for example would only require 64KB plus a small fixed size Priority Heap buffer. E5.6 References
[1] Allen, James. Information Retrieval, Statistics of Text. University of Massachusetts Amherst. 2003.
http://ciir.cs.umass.edu/cmpsci646/Slides/irlO%20text%20stats.pdf
[2] Li, Wentian. Random Texts Exhibit Zipf s-Law-Like Word Frequency Distribution. IEEE Transactions on Information Theory, 38(6), 1842-1845 (1992).
[3] Li, Wentian. Zipf s Law Everywhere. Glottometrics 5, 14-21, 2002.
[4] Williams, H.E. and Zobel, J. Searchable Words on the Web. 2001.

Claims

Claims
1. A metalanguage processor, comprising :
- a parser capable of decomposing a metalanguage file into source file segments, in accordance with the syntax of the metalanguage,
- a lexicon, comprising a representation of a set of strings, said representation being searchable, using a search string, to get a unique identifier for that search string, and
- a converter, capable of deriving a search string from a source file segment, and of interacting with the lexicon for determining an identifier for that search string, - said converter building a representation of the metalanguage file as a stored sequence of data blocks, each data block corresponding to a respective one of the source file segments, and each data block being based on the unique identifier determined in the lexicon for the search string derived from the corresponding source file segment.
2. The metalanguage processor of claim 1, wherein said converter is arranged to interact with the lexicon, for a given source file segment, by :
- searching the search string derived from the source file segment in the lexicon, and
- if the search is unsuccessful, entering the search string in said representation of a set of strings, in the lexicon.
3. The metalanguage processor of claim 1, wherein said converter comprises an analyzer capable of deriving a string and a type from a given source file segment, and wherein each data block in the sequence distinctively represents the string and the type of the correspond- ing source file segment.
4. The metalanguage processor of claim 3, wherein the search string represents both the string and the type derived from the source file segment.
5. The metalanguage processor of claim 3 , wherein each data block comprises a concatenation of a type related integer with an integer related to the string.
6. The metalanguage processor of claim 5, wherein the search string comprises a concatenation of the type related integer with the string derived from the source file segment.
7. The metalanguage processor of claim 1, wherein said lexicon has a search tree structure, and the identifier for a search string comprises a digital description of the path corresponding to the search string in the search tree structure.
8. The metalanguage processor of claim 6, wherein the lexicon has a ternary trie structure.
9. The metalanguage processor of claim 1, wherein said data blocks have a fixed length.
10. The metalanguage processor of claim 1, for use with a metalanguage file having sections, with said stored sequence of data blocks comprising sections corresponding to the sections in the metalanguage file, wherein the converter is further capable of building a stored representation of a correspondence between :
- an identifier, and - the position in the sequence of one or more sections containing a data block based on that identifier.
11. The metalanguage processor of claim 10, wherein sections are definable by a section class in the metalanguage file, and said converter is operable, upon designation of a section class, for building a stored representation of a correspondence between :
- an identifier, and
- the position in the sequence of one or more sections, belonging to the section class, and containing a data block based on that identifier.
12. The metalanguage processor of claim 11 , wherein said converter is operable, upon designation of a section class, for building a plurality of said stored representations, corresponding to a plurality of identifiers, respectively, said plurality of said stored representations covering substantially all the data blocks met in the sections corresponding to the section class.
13. The metalanguage processor of claim 12, for use with a metalanguage having strict rules of nesting sections within each other, wherein said section class is definable by a path.
14. The metalanguage processor of claim 13, further comprising a nesting level tracking mechanism, operable upon scanning the sequence, for maintaining a level information based on the type indicated in the successive data blocks.
15. The metalanguage processor of claim 1, for use with a metalanguage file having sections, said metalanguage processor comprising a scope mechanism, responsive to the designation of a section for evaluating the length devoted to that section in the sequence.
16. The metalanguage processor of claim 1, further comprising a reverse converter, capable of receiving a data block, and of interacting with the lexicon for building a source file segment in the metalanguage, corresponding to the data block.
17. The metalanguage processor of claim 14, wherein said reverse converter is operable on the whole sequence, for substantially re-constructing the original metalanguage file.
18. A method of processing a metalanguage file, the method comprising : a. parsing the metalanguage file, for identifying therein successive source file segments in accordance with the syntax of the metalanguage, b. maintaining a lexicon, forming a directly searchable representation of strings, in correspondence with a unique identifier for each string, c. converting a search string derived from a source file segment into a corresponding identifier, using the lexicon, and d. progressively building a sequence of data blocks, each data block corresponding to a respective one of the source file segments, and each data block being based on the unique identifier determined in the lexicon for a search string derived from the corresponding source file segment.
19. The method of claim 18, wherein
- step c. comprises creating a new lexicon entry with a new identifier if no identifier is found.
20. The method of claim 18, wherein - step c. comprises analyzing each source file segment into a string and a type, arid
- step d. comprises building a data block as a distinct representation of the string and of the type of the corresponding source file segment.
21. The method of claim 20, wherein - step d. comprises building a data block as a concatenation of a type related integer with an integer related to the string.
22. The method of claim 20, wherein each string in the lexicon comprises both a string derived from a source file segment, and a type attached to that string.
23. The method of claim 18, wherein
- step b. comprises maintaining the lexicon as a search tree structure, with an identifier representing a digital description of the path in the search tree to a corresponding string.
24. The method of claim 23, wherein
- step b. comprises maintaining the lexicon as a ternary trie structure.
25. The method of claim 24, wherein said data blocks have a fixed length.
26. The method of claim 18, for use with a metalanguage file having sections, said stored sequence of data blocks comprising sections corresponding to the sections in the metalanguage file, said method further comprising the step of : e. maintaining a representation of a correspondence between an identifier and the position in the sequence of a section containing that identifier.
27. The method of claim 18, for use with a metalanguage file having sections, said stored sequence of data blocks comprising sections corresponding to the sections in the metalanguage file,
said method further comprising the step of : e. maintaining, for selected ones of the sections, lists respectively attached to the identifiers being present in the selected sections, each list containing a representation of the locations in the sequence of the sections containing a data block based on the identifier.
28. The method of claim 18, for use with a metalanguage file having sections, said metalanguage having strict rules of nesting sections within each other, said stored sequence of data blocks comprising sections corresponding to the sections in the metalanguage file, dl . while progressively building the sequence, maintaining a nesting level information based on the type in the data blocks, d2. maintaining a representation of a correspondence between an identifier and the positions in the sequence of sections containing data blocks based on that identifier and being at the same nesting level.
29. The method of claim 18, further comprising the step of : f. regenerating a metalanguage file using the sequence and lexicon.
30. The method of claim 18, further comprising the step of : f. regenerating a portion of a metalanguage file using the sequence and lexicon.
31. A database for storing data representing a metalanguage file, said database comprising:
- a lexicon, comprising a representation of a set of strings, said representation being directly searchable, using an identifier, to obtain a unique corresponding string, the set of strings in the lexicon covering substantially all the meaningful string data in the metalanguage file,
- a sequence of data blocks individually based on identifiers being searcheable in the lexicon, the order of the data blocks in said sequence being related to the order of corresponding strings in the metalanguage file.
32. The database of claim 31, wherein said lexicon has a search tree structure, and the identifier for a string comprises a digital description of the path towards the string in the search tree structure.
33. The database of claim 32, wherein the lexicon has a ternary trie structure.
34. The database of claim 31, wherein said data blocks have a fixed length.
35. The database of claim 31, further comprising :
- database functions operable on the sequence of data blocks..
36. The database of claim 35, wherein said database functions comprise :
- a level tracking function, operable upon scanning the sequence for maintaining a level information based on a type indicated in the successive data blocks.
37. The database of claim 35, wherein said database functions comprise :
- a scope function, responsive to the indication of a position in the sequence, for evaluating the length devoted in the sequence to a section containing said sequence position.
38. The database of claim 31, further comprising : - further stored data, associated with the designation of a class of sections in the sequence,
- said further stored data comprising a subset of data associated to an identifier, and said subset comprising a representation of the position in the sequence of one or more sections, belonging to the class, and containing data blocks having that identifier.
39. The database of claim 38, wherein said further stored data, comprises a subset of data for each data block being met in a section belonging to the class.
40. The database of claim 31, further comprising a query processor, said query processor having :
- a query parser, capable of cooperating with the lexicon for converting a query directed to the metalanguage into query components using the corresponding identifiers in the lexicon, - a query resolver, capable of resolving query components, for designating sections in the sequence, using the identifiers in the sequence,
- a query combiner, capable of reaching the designated sections, for producing query component results therefrom, and combining such results of the query components into a coded overall result, and
- a reverse converter, capable of cooperating with the lexicon for converting the coded overall result back into a metalanguage.
41. The database of claim 40, further comprising : - further stored data, associated with the designation of a class of sections in the sequence,
- said further stored data comprising a subset of data associated to an identifier, and said subset comprising a representation of the position in the sequence of one or more sections, belonging to the class, and containing data blocks having that identifier, said query resolver being arranged for searching an identifier in said further stored data.
42. The database of claim 41, wherein said query resolver is arranged to solve primarily those of the query components which have been found in said further stored data.
43. The database of claim 41 , wherein said query resolver is capable of scanning the sequence only for those of the sections designated in relevant ones of the further stored data, after said resolution.
44. The database of claim 40, wherein said query combiner is arranged for delivering results extracted from the sequence, in sections determined by said query resolver.
45. The database of claim 40, wherein said reverse converter is capable of converting the coded overall result back into a metalanguage which is different from the source metalanguage.
PCT/IB2005/000197 2004-01-29 2005-01-27 Improved processor for a markup based metalanguage, such as xml WO2005081133A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US54020104P 2004-01-29 2004-01-29
US60/540,201 2004-01-29

Publications (2)

Publication Number Publication Date
WO2005081133A2 true WO2005081133A2 (en) 2005-09-01
WO2005081133A3 WO2005081133A3 (en) 2006-02-16

Family

ID=34885944

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2005/000197 WO2005081133A2 (en) 2004-01-29 2005-01-27 Improved processor for a markup based metalanguage, such as xml

Country Status (1)

Country Link
WO (1) WO2005081133A2 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030014447A1 (en) * 2001-04-23 2003-01-16 White David J. Data document generator
US6510434B1 (en) * 1999-12-29 2003-01-21 Bellsouth Intellectual Property Corporation System and method for retrieving information from a database using an index of XML tags and metafiles
US20030101169A1 (en) * 2001-06-21 2003-05-29 Sybase, Inc. Relational database system providing XML query support

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6510434B1 (en) * 1999-12-29 2003-01-21 Bellsouth Intellectual Property Corporation System and method for retrieving information from a database using an index of XML tags and metafiles
US20030014447A1 (en) * 2001-04-23 2003-01-16 White David J. Data document generator
US20030101169A1 (en) * 2001-06-21 2003-05-29 Sybase, Inc. Relational database system providing XML query support

Also Published As

Publication number Publication date
WO2005081133A3 (en) 2006-02-16

Similar Documents

Publication Publication Date Title
US11899641B2 (en) Trie-based indices for databases
US8489597B2 (en) Encoding semi-structured data for efficient search and browsing
Ferragina et al. The string B-tree: A new data structure for string search in external memory and its applications
US8156156B2 (en) Method of structuring and compressing labeled trees of arbitrary degree and shape
US6701317B1 (en) Web page connectivity server construction
JP3581652B2 (en) Data retrieval system and method and its use in search engines
Crauser et al. A theoretical and experimental study on the construction of suffix arrays in external memory
Hsu et al. Space-efficient data structures for top-k completion
Aoe et al. A trie compaction algorithm for a large set of keys
JP3263963B2 (en) Document search method and apparatus
WO2007143666A2 (en) Element query method and system
WO2005077123A2 (en) Efficient indexing of hierarchical relational database records
CN108509505B (en) Character string retrieval method and device based on partition double-array Trie
Arroyuelo et al. Space-efficient construction of Lempel–Ziv compressed text indexes
Flor A fast and flexible architecture for very large word n-gram datasets
Su-Cheng et al. Node labeling schemes in XML query optimization: a survey and trends
Bremer et al. An efficient XML node identification and indexing scheme
Dao et al. An indexing scheme for structured documents and its implementation
Hsu et al. UCIS-X: an updatable compact indexing scheme for efficient extensible markup language document updating and query evaluation
Barsky et al. Full-text (substring) indexes in external memory
JP3563823B2 (en) Document management device
Delpratt et al. Engineering succinct DOM
WO2005081133A2 (en) Improved processor for a markup based metalanguage, such as xml
Li et al. Study on efficiency of full-text retrieval based on lucene
JP2004178614A (en) Method and apparatus for document management

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase in:

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase