WO1988001774A1 - Programmable, character reduction lexical search circuit - Google Patents

Programmable, character reduction lexical search circuit Download PDF

Info

Publication number
WO1988001774A1
WO1988001774A1 PCT/US1987/002286 US8702286W WO8801774A1 WO 1988001774 A1 WO1988001774 A1 WO 1988001774A1 US 8702286 W US8702286 W US 8702286W WO 8801774 A1 WO8801774 A1 WO 8801774A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
key
state
value
character
Prior art date
Application number
PCT/US1987/002286
Other languages
French (fr)
Inventor
David W. Packard
Original Assignee
Packard David W
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Packard David W filed Critical Packard David W
Publication of WO1988001774A1 publication Critical patent/WO1988001774A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • the present invention is generally related to computer systems utilized for the storage and retrieval of textual information, and in particu ⁇ lar, to a programmable, lexical search circuit that provides for the sequential searching of a stream of computer data to locate the occurrence of any of several programmably predefined data strings.
  • Computer applications especially those involving natural language texts, very often need to search a large amount of textual data in order to locate the occurrence of one or more target patterns, such as words or phrases.
  • an inverted index is utilized, which enumerates every word in the text together with an indication of its location.
  • the inverted index is normally maintained in sorted order. While such an index can facilitate rapid search ⁇ ing, the index data itself requires a significant amount of storage space, and it must be updated whenever the text itself is changed. Moreover, it is sometimes necessary to search for parts of words (roots, suffixes) that do not occur in the simple alphabetic index.
  • Finite state automaton algorithms can be realized in software as indicated in the above cited article. There is, however, a significant execution speed penalty associated with a software implementation.
  • a purpose of the present invention is to provide an efficient, highly flexible manner of realizing in hardware the implementation of a deterministic finite automaton algorithm for the searching of data for data strings.
  • a lexical search apparatus for detecting the occurrence of any one of a plurality of predetermined strings of data values within an input stream of . data values.
  • the apparatus utilizes an input data stream translating vector to convert input stream data values into corres ⁇ ponding key data values.
  • a state transition table constructed based on the respective se ⁇ quences of key data values corresponding to the predetermined data value strings, is used to successively accumulate the input data value stream by successive state transitions repre ⁇ senting sequences of key data values.
  • a state transition to a state corresponding to a terminal data value of a predetermined data value string is detected.
  • a sequencer coordinates the conversion of the input data value stream by the translate vector and the sequential accumulation of key data values using the state transition table until a terminal key data value state is reached.
  • the transition to a terminal key data value state is reported along with the particular state reached for determining the particular one of the prede ⁇ termined data value strings found.
  • An advantage of the present invention is that it utilizes a translation step to reduce greatly the size of the memory needed to -represent the possible state transitions. Additionally, the use of the translation step in accordance with the present invention allows for the substantial simplification of the handling of sets of input characters that should be treated as identical (such as upper-case and lower-case), of sets of characters that should be ignored (such as font shifts or punctuation) , and of sets of characters that require special processing (such as end-of-block or end-of-file markers).
  • a further advantage of the present invention is that it is completely programmable and thus retains the full flexibility of software implemen ⁇ tations of deterministic finite automata.
  • the number and maximum length of target data strings are limited only by the amount of programmable memory utilized in any specific implementation.
  • Still another advantage of the present invention is that it provides for the recognition of target data string matches, unique identifica ⁇ tion reporting of the particular target string matches, and the handling of significant exception conditions. Additionally, the prevent invention provides for the substantial minimization of required host computer processing overhead during the performance of the data string searching and upon occurrence of a target string match or significant exception condition.
  • a still further advantage of the present invention is that it can be constructed out of readily available, low-cost, components, and thus provides an economical as well as reliable and fast means and method of implementing a deterministic finite automaton for searching data strings. It can be interfaced easily to small personal or professional computers and can provide an improvement in searching speed of at least an order of magnitude over methods based on software alone.
  • Figure 1 is a schematic diagram of a pre ⁇ ferred embodiment of the present invention
  • Figure 2 is a schematic diagram of another preferred embodiment of the present invention.
  • the present invention is a lexical search apparatus for detecting the occurrence of one or more target data strings within an input data string.
  • the apparatus is intended normally to function in concert with a host processor.
  • the apparatus appears to a host processor as a memory mapped I/O device with an eight-bit, bidirectional data bus and two address lines.
  • the host processor can write to this device at four different addresses with four different functions. At one designated address the host processor writes, character by character, the input data string to be searched. The remaining three addresses are used for ini ⁇ tializing and updating the contents of registers and memory within the apparatus.
  • the host proces ⁇ sor can also read status information from the apparatus.
  • the host processor When operating in search mode, the host processor writes data bytes to the apparatus at a designated address. The host processor does not need to read status information after each byte is sent. So long as there is no target data string match, the transfer occurs in a continuous stream, with the apparatus providing the normal acknowl ⁇ edgment for each byte transferred (DTACK in the case of a 68000-series host processor) .
  • the apparatus When a match (or other terminal condition) occurs, the apparatus withholds its acknowledge ⁇ ment of the character that caused the match (or other terminal condition) . This intentional failure of handshaking is sensed by the host processor or its support circuit and triggers an exception interrupt (BUS error in the case of a 68000-series host processor) thereby interrupting the transfer at this point.
  • an exception interrupt BUS error in the case of a 68000-series host processor
  • the system software in the host processor recognizes that this BUS error occurred at the port address of the apparatus and transfers control to a routine that interrogates the appara ⁇ tus.
  • the host processor address register or other circuitry that serves as a pointer for the data transfer (in an auto-increment mode) will be left pointing at the first character after the excep ⁇ tion interrupt.
  • the host processor can now read the status port to determine whether a match indeed occurred and, if so, the index of the match (five bits) and several other status bits.
  • the apparatus converts each input character into a corresponding translated input character, thereby making it possible to treat sets of input characters as equivalent.
  • the resulting set of translated input characters can normally be much smaller than the set of all possible input charac ⁇ ters.
  • the apparatus provides a highly compact storage scheme for the two-dimensional array used to implement the deterministic finite automaton to determine the next state for each combination of current state and translated input character.
  • the apparatus causes an exception interrupt with the match status bit set.
  • a search sequencer coordinates the transla ⁇ tion of input characters, the state transitions performed by the deterministic finite automaton, and the reporting of the particular target data string found or other exceptional condition.
  • the present invention may be applied to the searching of substantially any data string for the occurrence of particularly patterned data. Therefore, the following discussion of the present invention in terms of its preferred lexical search application is not to be taken as limiting the present inven ⁇ tion in any way.
  • the preferred embodiment of the present invention is shown in Figure 1. The preferred usage of this embodiment 10 is for the searching of textual strings of data characters as provided from a computer stored document.
  • the illustrated embodiment of the lexical search circuit 10 includes a memory (MEM) 14 and a sequencer (SEQ) 60.
  • the lexical search circuit 10 is substantially complete with the inclusion of index register (IR) 16, base register (BR) 18, and data buffers 12, 20, 54.
  • the lexical search circuit 10 relies on a host processor, such as a Motorola 68000, to sequentially provide the individual characters of a data string on a data I/O bus 22. With each character, the host processor preferably also provides a character available control signal, identified as lex, on the lex control line 62 to the sequencer 60.
  • the sequencer 60 provides an acknowledge control signal, identified as lexack, via the lexack control line 64 to indicate that the lexical search circuit 10 has appropriately handled this input character and is ready to accept the next.
  • the memory (MEM) 14 is a 2048 x 8-bit status random access memory device. Its address space (2048 bytes) is treated as 64 segments of 32 bytes each. The eight top segments (256 bytes) are reserved for the translation table, or vector. The remaining 56 segments can be viewed as a two-dimensional array, with the first array index ranging over the 56 possible states (numbered 0 to 55) of the deterministic finite automation implemented by the apparatus, and the second index ranging over the 32 possible translated input characters (numbered from 0 to 31) .
  • This array table stores data identifying the next state to be selected for each combination of a current state and a translated input character.
  • address lines A10, A9, and A8 In order to access the translation vector, address lines A10, A9, and A8 must be held high, with the remaining address lines selecting one of the 256 elements of the translation vector.
  • Address lines AlO through A5, in this preferred embodiment, select the current array table segment (corresponding to the state) .
  • Address lines A4 through A0 select one of the 32 bytes within the segment, as determined by the translated input character.
  • the optimum size of the memory, and its division into segments, depends on the width of the input character (in this case 8 bits) , the maximum number of distinct translated input characters to be allowed (in this case 32), and the maximum number of states to be allowed (in this case 56). Larger memory would allow the handling of longer target strings, or the recognition of strings containing more than 32 distinct characters, or both.
  • the memory size chosen for the preferred embodiment allows the apparatus to be constructed from a minimum number of low-cost components, while still providing sufficient flexibility for its anticipated uses.
  • the lexical search circuit 10 follows from the operation of the sequencer 60 acting in concert with the host processor.
  • the lexical search circuit 10 sequentially operates on input data as received from the data I/O bus 22 by way of the input character buffer 12.
  • this input data is an eight-bit wide data character.
  • the input charac ⁇ ter buffer 12 drives each input data character on to the eight least significant address inputs of the memory 14 via the input character bus 24.
  • the three most significant bit lines of the address inputs of the memory 14 are pulled to a high logic state by a passive, resistive pull-up device 56, when they are not being driven by the enabled outputs of base register 18.
  • the eight least significant bits provided via the input character buffer 12 select one of the top 256 addresses within the memory 14. These 256 locations contain the translation vector, which contains the translated input character value associated with each of the 256 possible input character values. The contents of the particular memory location selected is provided onto the memory data bus 30. The now effectively translated input character is then latched into the index register 16.
  • the present invention provides for the reduction of the input character to one of a set of translated input character values numbering preferably 32 or less.
  • the translated and reduced input character is preferably represented as a five-bit value.
  • the remaining three bits stored by the index registering 16 preferably two bits provide .status information regarding the input character that will be routed to the sequencer 50 and to a status output buffer 20.
  • the translated input character stored by the index register 16 is provided via a five-bit wide index register output bus 26 to the five least significant address inputs of the memory 14.
  • the most significant six bits required to address the memory 14 are provided from the base register 18 via its base register output bus 28.
  • the index and base registers 16, 18 together provide an eleven-bit composite address.
  • the base register 18 stores a previously established eight-bit value including a six-bit value identifying the current state of the automa ⁇ ton. This six-bit value can be regarded as the first of the two array indexes into the two-dimensional state transition array stored in memory 14.
  • the five least significant bits of the composite eleven-bit-wide address can be regarded as the second index into this two-dimensional array.
  • the first index value reflects the current state of the automaton
  • the second index value reflects the translated input character.
  • the eight-bit data from the selected location within the state transition array of the memory 14 is provided onto the memory data bus 30 whereupon it is latched into the base register 18.
  • this eight-bit data value normally includes a six-bit value identifying the next state of the automaton, and up to two bits of status output buffer 20.
  • the data value provided from the memory 14 may alternately contain a five-bit target string index to uniquely identify the particular target string matched. This target string index substitutes for the normal next state identifier in cases where the next state corre ⁇ sponds to a target string match.
  • the five output lines from the base register that would carry the five-bit value of the target string index, along with the line carrying the match status indicator, or bit, are provided to the status output buff ⁇ er 20.
  • the sequencer 60 of the lexical search circuit 10 is a programmable array logic device.
  • the inputs to the sequence 60 include the lex signal on the lex control line 62, a two-bit-wide address provided on the address lines 68, a read/write control signal provided on the read/write control line 70, an exception status report signal on control line 32, a match status report signal on control line 34, and three timing signals (Tl, T2, and T3) on the timing control lines 66. These three timing signals are referenced to the lex signal on the lex control line 62 and occur at fixed times after the asser ⁇ tion of the lex signal.
  • timing signals which are used to delimit the necessary setup and hold delay times required by the circuit, may be generated by a shift register or by a digital delay line driven by the lex signal.
  • the sequencer 60 In response to the input signals described above, the sequencer 60 generates the lexack signal on the lexack control line 64, and a variety of output enable and register clock signals on sequencer output lines 36, 38, 40, 42, 44, 46, and 50.
  • sequencer 60 begins with the provision of the lex signal on lex control line 62 to the sequencer 60 and the provision of the corresponding input character on the data I/O bus 22.
  • the sequencer 60 immediately provides a control signal on sequencer output line 36 to enable the input character buffer 12 such that the input character is driven onto the input character bus 24.
  • An enable control signal is not provided at this time on the sequencer output line 38, so that the logic state of the base register output bus 28 is controlled by the input character buffer 12 in combination with the passive resistive pull-up device 56.
  • the sequencer 60 also provided a read/write and an output enable signal, via its respective sequencer output lines 42, 40, to the memory 14.
  • the translated input character will contain a data bit at the index register storage position correspond ⁇ ing to the connection of the index register 16 to the exception status reporting line 32.
  • sequencer 60 provides an output enable signal on the sequencer output line 38 to both the index and base registers 16, 18.
  • the translated input character value is provided by the index register 16 onto the index register output bus 26.
  • the previously provided current state number, as now stored by the base register 18, is provided on the base register output bus 28.
  • the composite eleven-bit-wide address is thus formed and provided on the address inputs to the memory 14.
  • the sequencer 60 again provides the read and output enable control signals on the sequencer output lines 42, 40, respectively, with the result that the corresponding addressed location within the state transition array of the memory 14 is accessed.
  • the contents of this memory location includes the numeric value, or number, of the next state.
  • the delay delimiting assertion of timing signal T2 on the timing control lines 66 causes the sequencer 60 to provide a base register clock signal on sequencer output line 46 to latch the next state number into the base register 18. If the current input character is the terminal character in a target data string, then a data bit at the base register storage location correspond ⁇ ing to the match status reporting line 34 is present and a five-bit index uniquely identifying the data string to which the terminal input character belongs is provided in place of the six-bit next-state number. In either case, the sequencer 60 further preferably removes the output enable control signals from the memory 14 upon the latching of the base register 18.
  • timing signal T3 on the timing control lines 66 causes the sequencer 60 to determine whether any exception or match condition has occurred. If neither has occurred, the sequencer 60 simply provides the lexack signal on the lexack control line 64. In response to the lexack signal, the host processor releases the lex signal and, in turn, the sequencer 60 withdraws the index and base register 16, 18 output enable signals. If, however, an exception or match condition has occurred, the operation of the sequencer 60 differs specifically in that the lexack control signal is withheld. For a Motorola 68000 host processor, this results in a bus exception condi ⁇ tion interrupt that draws the attention of the host processor. The corresponding interrupt handling routine can examine the address at which each bus exception occurs. Assuming that this address corresponds to the lexical search circuit 10, a service routine can be executed or marked for later execution to service the lexical search circuit 10.
  • the data strings identified in Table I are preselected as the search target strings.
  • a translation vector as indicated in Table II, is constructed to reduce the full range of 256 possible input characters values to the minimum desired set of 12 translated input characters (numbered from 0 to 11) .
  • upper-case and lower-case letters are treated as equivalent, so that the input charac ⁇ ters 'a' and 'A' are equally translated to the value 3.
  • the hyphen is translated to the value 0, which illustrates the treatment of any character that should be ignored whenever it appears as an input character. All other possible input charac ⁇ ters that do not appear in any target string are translated to the value 1.
  • the state transition array shown in Table III, describes a deterministic finite automaton that recognizes the six target strings in Table I. This array can be coded by hand, or it can be constructed by a computer program. An algorithm for constructing such arrays automatically is described in the Aho article.
  • the values in the state transition array identify the next state that should be entered in response to each possible combination of current state and translated input character. Some of the values are specially flagged (marked by an aster ⁇ isk in Table III) to show that the combination of this current state and this translated input character indicates that a target string has been matched. That is, the terminal data character of a data string has been received.
  • the flagged value is the index of the matched target string as provided in the state transition array, rather than a next state number (as given in Table I) .
  • the current state number as held by the base register 18 is initial ⁇ ized to SO (state 0).
  • the initial state will switch from SO to SI on receipt of translated input character 2 ('C or 'c').
  • the initial state will switch from SO to S7 upon receipt of translated input character 5 ('F' or 'f').
  • Table III shows that, when the current state is SO, all other translated input characters result in transitions to SO (a nul transition) .
  • the receipt of the translated input character 10 ('X', 'x') will cause a transition to a specially flagged terminal state, where a target data string index of 6, corresponding to the target data string "fix” will be provided together with a status bit indicating that a match has occurred.
  • the receipt of the translated input character 11 (N", 'n') will cause a transition to a spe ⁇ cially terminal state, where a target data string index of 3, corresponding to the target data string "fin” will be provided.
  • the receipt of other translated input characters will yield state transitions as indicated in Table III.
  • the host processor If the host processor wishes to recognize overlapping target strings in the input data, it must maintain a list showing the state in which the automaton should be initialized (the number that must be written to the base register 18) when the search resumes after the processing of each match (as opposed to reinitialize to zero) . This is necessary because the present embodiment of the memory 14 does not have sufficient space in each state transition array data location to contain both a target match index and a next state number.
  • An exception input character is preferably any input character whose occurrence in the input data string signals a circumstance, other than a target data string match, that requires intervention by the host processor.
  • the data character representing an end of data file market is such an exception data character.
  • Another exception data character may be a control data character indicating a change in the data character representation of the charac ⁇ ters in the input data string. An example of this would be a change to an extended or alternate character set. Therefore, an exception condition is invoked to allow the host processor to appro ⁇ priately alter or replace the current translation vector.
  • the size of the memory 14 constrains the number, length, and complexity of the target strings that can be used as search targets. These limitations, however, can largely be overcome by appropriate software support. If it is desired to search for a very long target string, for example. the first segment of the target string can be encoded for searching by the hardware apparatus. If the apparatus successfully matches the first segment of a target string, a software search routine can take over the search by using a larger state transition array stored in main memory. Alternatively, the software can load a new state transition array into the apparatus.
  • the exception status bit is preferably sensed for by the sequencer 60 immedi ⁇ ately following each latching of the base register 18.
  • the sequencer 60 In response to a true exception status bit, the sequencer 60 withholds the lexack control signal, thereby halting the provision of subse ⁇ quent input characters by the host processor.
  • the exception status signal is further provided to the status output buffer 20 via exception status reporting line 32.
  • a second, user definable, status bit may be provided via the index register 16 on the exception status lines 34 to further identify the particular exception that occurred.
  • the host processor executes its service routine for the lexical search circuit.
  • routine can perform a read operation to obtain status informa ⁇ tion via the status output buffer 20.
  • the sequencer 60 recognizes the port select address provided onto the address lines 68 and the read signal on read/write control line 70.
  • the sequencer 60 preferably responds by providing an output enable control signal to the status output buffer 20 via sequencer output line 44, and an output enable control signal to the index and base registers 16, 18, via sequencer output line 34.
  • the state of the exception status and excep ⁇ tion identification bits, along with the five low order bits stored in the base register, are therefore driven onto the data I/O bus 22.
  • the sequencer 60 also provides the lexack signal on control line 64. At the conclusion of the read operation indicated by the withdrawal of the host processor lex control signal, from line 62, the sequencer 60 withdraws the output enable signals.
  • a target string match condition is handled in a similar manner.
  • the match status data bit is provided onto the match status reporting line 34 to the sequencer 60.
  • the sequencer again withholds the lexack control signal.
  • the target string index value and the match status bit are driven onto the data I/O bus 22 via the status output buffer 20.
  • the host processor is able to determine not only that a match has occurred, but to also explicitly identify the target data string by the index value provided by the lexical search circuit 10.
  • the present invention provides that the data in the memory 14 is fully programmable.
  • the host processor can readily download the translation vector and the state transition array into the memory 14. In the preferred embodiment shown in Figure 1, any location in the memory 14 can be written to by the host processor.
  • the relevant port select address ⁇ es are given in Table IV below.
  • the sequencer 60 enables the output of the download buffer 54 so as to transfer data from the data I/O bus 22 to the memory data bus 30. Once the data has settled, the sequencer 60 provides a register clock signal on the sequencer output line 36 to the index register 16 to latch the data into the index register. This index register value includes the least significant five bits of the memory address.
  • the download buffer is again enabled by distin ⁇ guishing a different port select address to transfer data from the data I/O bus to the memory data bus 30. The sequencer 60 then provides a register clock signal on the sequencer output line 46 to the base register 18.
  • the data is latched into the base register 18.
  • This base register value includes the most significant six bits of the memory address.
  • the data provided in the third write operation is the actual data to be written to the memory 14 at the composite memory address again distinguishes the port select address and provides a write control signal to the memory 14 while the data is present on the memory data bus 30 as provided via the download buffer 54.
  • the memory address is provided by the output enabling of both the index and base register 16, 18.
  • the sequencer 60 withdraws the output enable and write control signals from the download buffer 54, index and base register 16, 18 and the memory 14.
  • an alternative embodiment of the present invention may include an eleven-bit-wide address buffer to drive the address inputs of memory 14 directly from the host processor address bus. Only a slight modification to the sequencer need be made to provide enable signals to this buffer. In this case, only one processor cycle would be required to write each memory location.
  • FIG. 2 A significantly higher performance lexical search circuit, constructed in accordance with the present invention and generally indicated by the reference numeral 80, is shown in Fig. 2. Perhaps the most significant aspect of this lexical search circuit 80 is that a translation vector memory (TVM) 84 is implemented separately from a state transition memory (STM) 86. As with the embodi ⁇ ment illustrated in Figure 1, a logic state sequencer 82 is utilized to control the detailed operation of the lexical search circuit 80. The interface of the sequencer 82 to the host proces ⁇ sor is substantially as before, except that three address lines are used.
  • TMM translation vector memory
  • STM state transition memory
  • the sequencer 82 is directed by the host processor to configure the lexical search circuit 80 for the pending receipt of input characters on the data I/O bus 22.
  • Table V provides the relevant port select addressing of the lexical search circuit 80.
  • the lex signal provided on the lex control line 62 again is used to indicate the availability of each input data character.
  • the sequencer 82 provides appropriate signals onto the sequencer control output bus 100 to enable the output of the trans ⁇ lation vector memory 84 and the state transition memory 86.
  • the data I/O bus 22 drives the address inputs of the translation vector memory 84.
  • the data from the addressed memory location is driven onto data bus 104, which provides the five-bit low order address inputs to the state transition memory 86.
  • the base register 88 is enabled to provide its six-bit wide stored current state number onto the base register output bus 106 so that, in combination with the translated data character as provided on bus 104, a complete eleven-bit address is provided to the state transition memory 86.
  • the data from the addressed memory location is provided onto data bus 108.
  • the next state number thereby provided onto data bus 108 is then latched into the base register 88.
  • the lexical search circuit 80 is substan ⁇ tially prepared for receipt of the next search data character.
  • An exception condition if present, is reported via the most significant bit connected data line 105 to the sequencer 82, via exception status bus 102.
  • a match condition if present, is reported via the most significant bit connected data line 109 to the sequencer 82, via exception status bus 102.
  • the matched target data string index is also provided from the state transition memory 86 onto its data bus 108, and is latched into the base register 88.
  • the lexack control signal is conditioned on the logic state of the exception and match status bits. If either is true, then the sequencer 82 withholds provision of the lexack control signal to indicate the need to be serviced by the host processor.
  • an alternative preferred embodiment of the present invention may again further include an eleven-bit-wide address buffer to drive the address inputs of the state transition memory 86 directly from the host processor address bus. Only a slight modification to the sequencer need be made to provide enable signals to this buffer. In this case, only one host processor cycle would be required to write each memory location.
  • the lexack control signal may alternately be provided early by the sequencer 60 given that the previous input character did not result in a match or other exceptional condition. That is, the sequencer 60 will preferably determine from the currently provided input character whether or not the lexack control signal will be provided after receipt of the next succeeding input character. Thus, any exception condition or target data string match condition will pertain to the immediately preced ⁇ ing processed input character.
  • the innate data processing efficiency of the host processor is thereby substantially retained while only incur ⁇ ring the minor requirement that a dummy search data character be passed to the lexical search circuit at the termination of a data string search to determine whether an exception or target data string match condition is pending.
  • the lexical search circuit 10, in particu ⁇ lar, as shown in Fig. 1 is adapted to operate with a host processor that can transfer several (typi ⁇ cally two or four) bytes of data over a data bus wider than eight-bits.
  • a host processor that can transfer several (typi ⁇ cally two or four) bytes of data over a data bus wider than eight-bits.
  • an input multiplexer circuit is provided to route each input character in sequence to the input character bus.
  • the sequencer sup ⁇ plies control signals to the input multiplexer and repeats the appropriate sequence of control signals for each input character to the remainder of the circuit as necessary to process each of the individual input characters.
  • the preferred embodiments of the present invention described above have assumed that the host processor communicates with the lex circuit by means of an asynchronous handshaking protocol, in which the absence of acknowledgement on the part of the lex circuit provides a convenient means of signaling an exceptional condition to the host processor.
  • the lack of ac ⁇ knowledgement causes an exception interrupt (often termed a BUS error) which can be handled by software.
  • the sequencer could, for example, provide an explicit interrupt signal to the host processor.
  • the lex circuit could be driven by a DMA controller, with the DMA residual count at the time of interrupt indicating which character in the input string had caused the match or other exceptional condition.
  • a programmable lexical search circuit exemplified by a variety of specific embodiments thereof has been described to illustrate its various advantages including simplicity of design, low parts count, extreme flexibility, high search throughput capability and adaptability to a wide variety of host processor configurations has been described.
  • the input data string for searching may come from a wide variety of sources and at various stages of processing within an overall data processing system.
  • the data string may be provided directly from the read circuitry of a high capacity data storage device or from the output of an analog to digital converter operating from an analog data source such as a CATV transmission. It is therefore to be understood that, within the scope of the appended claims, the invention may be practiced otherwise in a specifi ⁇ cally described herein.

Abstract

A lexical search apparatus (10) for detecting the occurrence of predetermined data strings within an input data string, and the methods or its operation. Input characters are collapsed into a smaller set of translated input characters. A state transition array stored in memory, TBL (14), guides a deterministic finite automaton in recognizing the occurrence of the target strings in the input data. Transition to a state corresponding to the terminal character of a target string is detected, and an index identifying the found target data string is provided. A search sequencer (60) provides for the coordination of translating the input characters, guiding the state transitions, and allowing identification of the particular target data string found.

Description

PROGRAMMABLE, CHARACTER REDUCTION LEXICAL SEARCH CIRCUIT
FIELD OF THE INVENTION
The present invention is generally related to computer systems utilized for the storage and retrieval of textual information, and in particu¬ lar, to a programmable, lexical search circuit that provides for the sequential searching of a stream of computer data to locate the occurrence of any of several programmably predefined data strings.
BACKGROUND Of THE INVENTION
Computer applications, especially those involving natural language texts, very often need to search a large amount of textual data in order to locate the occurrence of one or more target patterns, such as words or phrases.
In many cases, an inverted index is utilized, which enumerates every word in the text together with an indication of its location. The inverted index is normally maintained in sorted order. While such an index can facilitate rapid search¬ ing, the index data itself requires a significant amount of storage space, and it must be updated whenever the text itself is changed. Moreover, it is sometimes necessary to search for parts of words (roots, suffixes) that do not occur in the simple alphabetic index.
It is desirable, therefore, to have an efficient manner of searching text that is stored in its natural, unindexed format. This can be accomplished by "brute force" methods, involving the comparison of each target pattern with the entire text; but this requires a tremendous amount of processing, directly proportional to the number and length of the target patterns, and the sheer size of the document being searched.
In order to improve on the effectiveness of textual string searches, various specialized software algorithms have been developed. In particular, software implementations of finite state automata have been used for searching textual data. Examples of such algorithmic automata are disclosed in "Efficient String Matching: An Aid to Bibliographic Search," A. V. Aho and M. J. Corasick, Communications of the ACMr Vol. 18, 1975, No. 10, pp 333-340.
Finite state automaton algorithms can be realized in software as indicated in the above cited article. There is, however, a significant execution speed penalty associated with a software implementation.
A purpose of the present invention, there¬ fore, is to provide an efficient, highly flexible manner of realizing in hardware the implementation of a deterministic finite automaton algorithm for the searching of data for data strings.
≤H ΔEΣ
This is accomplished in the present invention by use of a lexical search apparatus for detecting the occurrence of any one of a plurality of predetermined strings of data values within an input stream of. data values. The apparatus utilizes an input data stream translating vector to convert input stream data values into corres¬ ponding key data values. A state transition table, constructed based on the respective se¬ quences of key data values corresponding to the predetermined data value strings, is used to successively accumulate the input data value stream by successive state transitions repre¬ senting sequences of key data values. A state transition to a state corresponding to a terminal data value of a predetermined data value string is detected. A sequencer coordinates the conversion of the input data value stream by the translate vector and the sequential accumulation of key data values using the state transition table until a terminal key data value state is reached. The transition to a terminal key data value state is reported along with the particular state reached for determining the particular one of the prede¬ termined data value strings found.
An advantage of the present invention is that it utilizes a translation step to reduce greatly the size of the memory needed to -represent the possible state transitions. Additionally, the use of the translation step in accordance with the present invention allows for the substantial simplification of the handling of sets of input characters that should be treated as identical (such as upper-case and lower-case), of sets of characters that should be ignored (such as font shifts or punctuation) , and of sets of characters that require special processing (such as end-of-block or end-of-file markers).
A further advantage of the present invention is that it is completely programmable and thus retains the full flexibility of software implemen¬ tations of deterministic finite automata. The number and maximum length of target data strings are limited only by the amount of programmable memory utilized in any specific implementation.
Still another advantage of the present invention is that it provides for the recognition of target data string matches, unique identifica¬ tion reporting of the particular target string matches, and the handling of significant exception conditions. Additionally, the prevent invention provides for the substantial minimization of required host computer processing overhead during the performance of the data string searching and upon occurrence of a target string match or significant exception condition.
A still further advantage of the present invention is that it can be constructed out of readily available, low-cost, components, and thus provides an economical as well as reliable and fast means and method of implementing a deterministic finite automaton for searching data strings. It can be interfaced easily to small personal or professional computers and can provide an improvement in searching speed of at least an order of magnitude over methods based on software alone.
BRIEF DESCRIPTION OF THE DRAWINGS
These and other attendant advantages of the present invention will become apparent and readily appreciated as the same becomes better understood by reference to the following detailed description when considered in connection with the accompany¬ ing drawings, wherein like reference numerals designate like parts throughout the figures there¬ of, and wherein:
Figure 1 is a schematic diagram of a pre¬ ferred embodiment of the present invention; Figure 2 is a schematic diagram of another preferred embodiment of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
A. Overview
The present invention is a lexical search apparatus for detecting the occurrence of one or more target data strings within an input data string. The apparatus is intended normally to function in concert with a host processor.
In the preferred embodiment, the apparatus appears to a host processor as a memory mapped I/O device with an eight-bit, bidirectional data bus and two address lines. The host processor can write to this device at four different addresses with four different functions. At one designated address the host processor writes, character by character, the input data string to be searched. The remaining three addresses are used for ini¬ tializing and updating the contents of registers and memory within the apparatus. The host proces¬ sor can also read status information from the apparatus.
When operating in search mode, the host processor writes data bytes to the apparatus at a designated address. The host processor does not need to read status information after each byte is sent. So long as there is no target data string match, the transfer occurs in a continuous stream, with the apparatus providing the normal acknowl¬ edgment for each byte transferred (DTACK in the case of a 68000-series host processor) .
When a match (or other terminal condition) occurs, the apparatus withholds its acknowledge¬ ment of the character that caused the match (or other terminal condition) . This intentional failure of handshaking is sensed by the host processor or its support circuit and triggers an exception interrupt (BUS error in the case of a 68000-series host processor) thereby interrupting the transfer at this point.
The system software in the host processor recognizes that this BUS error occurred at the port address of the apparatus and transfers control to a routine that interrogates the appara¬ tus. The host processor address register or other circuitry that serves as a pointer for the data transfer (in an auto-increment mode) will be left pointing at the first character after the excep¬ tion interrupt. The host processor can now read the status port to determine whether a match indeed occurred and, if so, the index of the match (five bits) and several other status bits.
The apparatus converts each input character into a corresponding translated input character, thereby making it possible to treat sets of input characters as equivalent. The resulting set of translated input characters can normally be much smaller than the set of all possible input charac¬ ters.
The apparatus provides a highly compact storage scheme for the two-dimensional array used to implement the deterministic finite automaton to determine the next state for each combination of current state and translated input character. When a state is reached that corresponds to the terminal character of a target data string, the apparatus causes an exception interrupt with the match status bit set.
A search sequencer coordinates the transla¬ tion of input characters, the state transitions performed by the deterministic finite automaton, and the reporting of the particular target data string found or other exceptional condition.
B. Detailed Description of the Preferred Embodi¬ ments
As will become readily apparent, the present invention may be applied to the searching of substantially any data string for the occurrence of particularly patterned data. Therefore, the following discussion of the present invention in terms of its preferred lexical search application is not to be taken as limiting the present inven¬ tion in any way. The preferred embodiment of the present invention, generally indicated by the reference numeral 10, is shown in Figure 1. The preferred usage of this embodiment 10 is for the searching of textual strings of data characters as provided from a computer stored document.
The illustrated embodiment of the lexical search circuit 10 includes a memory (MEM) 14 and a sequencer (SEQ) 60. The lexical search circuit 10 is substantially complete with the inclusion of index register (IR) 16, base register (BR) 18, and data buffers 12, 20, 54. In its preferred embodi¬ ment, the lexical search circuit 10 relies on a host processor, such as a Motorola 68000, to sequentially provide the individual characters of a data string on a data I/O bus 22. With each character, the host processor preferably also provides a character available control signal, identified as lex, on the lex control line 62 to the sequencer 60. Once a data character is processed by the lexical search circuit 10, the sequencer 60 provides an acknowledge control signal, identified as lexack, via the lexack control line 64 to indicate that the lexical search circuit 10 has appropriately handled this input character and is ready to accept the next.
In the preferred embodiment, the memory (MEM) 14 is a 2048 x 8-bit status random access memory device. Its address space (2048 bytes) is treated as 64 segments of 32 bytes each. The eight top segments (256 bytes) are reserved for the translation table, or vector. The remaining 56 segments can be viewed as a two-dimensional array, with the first array index ranging over the 56 possible states (numbered 0 to 55) of the deterministic finite automation implemented by the apparatus, and the second index ranging over the 32 possible translated input characters (numbered from 0 to 31) . This array table stores data identifying the next state to be selected for each combination of a current state and a translated input character.
In order to access the translation vector, address lines A10, A9, and A8 must be held high, with the remaining address lines selecting one of the 256 elements of the translation vector. Address lines AlO through A5, in this preferred embodiment, select the current array table segment (corresponding to the state) . Address lines A4 through A0 select one of the 32 bytes within the segment, as determined by the translated input character.
The optimum size of the memory, and its division into segments, depends on the width of the input character (in this case 8 bits) , the maximum number of distinct translated input characters to be allowed (in this case 32), and the maximum number of states to be allowed (in this case 56). Larger memory would allow the handling of longer target strings, or the recognition of strings containing more than 32 distinct characters, or both. The memory size chosen for the preferred embodiment allows the apparatus to be constructed from a minimum number of low-cost components, while still providing sufficient flexibility for its anticipated uses.
Operation of the lexical search circuit 10 follows from the operation of the sequencer 60 acting in concert with the host processor. In general terms, the lexical search circuit 10 sequentially operates on input data as received from the data I/O bus 22 by way of the input character buffer 12. In the preferred embodiment of the present invention, this input data is an eight-bit wide data character. The input charac¬ ter buffer 12 drives each input data character on to the eight least significant address inputs of the memory 14 via the input character bus 24. The three most significant bit lines of the address inputs of the memory 14 are pulled to a high logic state by a passive, resistive pull-up device 56, when they are not being driven by the enabled outputs of base register 18. With the outputs of index register 16 and base register 18 disabled, and with the three most significant address inputs of the memory 14 held in a high logic state by the pull-up device 56, the eight least significant bits provided via the input character buffer 12 select one of the top 256 addresses within the memory 14. These 256 locations contain the translation vector, which contains the translated input character value associated with each of the 256 possible input character values. The contents of the particular memory location selected is provided onto the memory data bus 30. The now effectively translated input character is then latched into the index register 16. By this translation of the input character, the present invention provides for the reduction of the input character to one of a set of translated input character values numbering preferably 32 or less. In any case, the translated and reduced input character is preferably represented as a five-bit value. Of the remaining three bits stored by the index registering 16, preferably two bits provide .status information regarding the input character that will be routed to the sequencer 50 and to a status output buffer 20.
The translated input character stored by the index register 16 is provided via a five-bit wide index register output bus 26 to the five least significant address inputs of the memory 14. The most significant six bits required to address the memory 14 are provided from the base register 18 via its base register output bus 28. Thus, the index and base registers 16, 18 together provide an eleven-bit composite address. During opera¬ tion, the base register 18 stores a previously established eight-bit value including a six-bit value identifying the current state of the automa¬ ton. This six-bit value can be regarded as the first of the two array indexes into the two-dimensional state transition array stored in memory 14. The five least significant bits of the composite eleven-bit-wide address can be regarded as the second index into this two-dimensional array. The first index value reflects the current state of the automaton, the second index value reflects the translated input character. Again, the eight-bit data from the selected location within the state transition array of the memory 14 is provided onto the memory data bus 30 whereupon it is latched into the base register 18. Prefera¬ bly, this eight-bit data value normally includes a six-bit value identifying the next state of the automaton, and up to two bits of status output buffer 20. In the preferred embodiment of the present invention the data value provided from the memory 14 may alternately contain a five-bit target string index to uniquely identify the particular target string matched. This target string index substitutes for the normal next state identifier in cases where the next state corre¬ sponds to a target string match. The five output lines from the base register that would carry the five-bit value of the target string index, along with the line carrying the match status indicator, or bit, are provided to the status output buff¬ er 20.
In greater detail, the sequencer 60 of the lexical search circuit 10 is a programmable array logic device. The inputs to the sequence 60 include the lex signal on the lex control line 62, a two-bit-wide address provided on the address lines 68, a read/write control signal provided on the read/write control line 70, an exception status report signal on control line 32, a match status report signal on control line 34, and three timing signals (Tl, T2, and T3) on the timing control lines 66. These three timing signals are referenced to the lex signal on the lex control line 62 and occur at fixed times after the asser¬ tion of the lex signal. These timing signals, which are used to delimit the necessary setup and hold delay times required by the circuit, may be generated by a shift register or by a digital delay line driven by the lex signal. In response to the input signals described above, the sequencer 60 generates the lexack signal on the lexack control line 64, and a variety of output enable and register clock signals on sequencer output lines 36, 38, 40, 42, 44, 46, and 50.
The logic configuration of sequencer 60 is best described in terms of its detailed operation. Accordingly, a data character search cycle gener¬ ally begins with the provision of the lex signal on lex control line 62 to the sequencer 60 and the provision of the corresponding input character on the data I/O bus 22. The sequencer 60 immediately provides a control signal on sequencer output line 36 to enable the input character buffer 12 such that the input character is driven onto the input character bus 24. An enable control signal is not provided at this time on the sequencer output line 38, so that the logic state of the base register output bus 28 is controlled by the input character buffer 12 in combination with the passive resistive pull-up device 56. The sequencer 60 also provided a read/write and an output enable signal, via its respective sequencer output lines 42, 40, to the memory 14. Conse¬ quently, a location within the translation vector stored in the memory 14, as selected by the input character provided via the input character buffer 12, is accessed and the corresponding translated input character value is placed onto the memory data bus 30. After a predetermined delay designed to allow the memory output to become stable on the memory data bus 30, the the delay delimiting assertion of timing signal Tl on the timing control lines 66 causes the sequencer 60 to provide a clock signal to index register 16 and simultaneously to withdraw the enable signal from the input character buffer 12. In this preferred embodiment, both actions are accomplished by the high to low transition on sequencer output. line 36. This results in the translated data character being latched into the index register 16. At the same time, the input character buffer 12 is disabled to isolate the data I/O bus 22 from the input character bus 24.
If this input character belongs to a set of characters that have been programmatically prede¬ fined to raise an exception condition, then the translated input character will contain a data bit at the index register storage position correspond¬ ing to the connection of the index register 16 to the exception status reporting line 32.
Also in response to the assertion of timing signal Tl the sequencer 60 provides an output enable signal on the sequencer output line 38 to both the index and base registers 16, 18. The translated input character value is provided by the index register 16 onto the index register output bus 26. The previously provided current state number, as now stored by the base register 18, is provided on the base register output bus 28. The composite eleven-bit-wide address is thus formed and provided on the address inputs to the memory 14. The sequencer 60 again provides the read and output enable control signals on the sequencer output lines 42, 40, respectively, with the result that the corresponding addressed location within the state transition array of the memory 14 is accessed. The contents of this memory location includes the numeric value, or number, of the next state.
After a predetermined delay designed to allow the next state number to become stable on the memory data bus 30, the delay delimiting assertion of timing signal T2 on the timing control lines 66 causes the sequencer 60 to provide a base register clock signal on sequencer output line 46 to latch the next state number into the base register 18. If the current input character is the terminal character in a target data string, then a data bit at the base register storage location correspond¬ ing to the match status reporting line 34 is present and a five-bit index uniquely identifying the data string to which the terminal input character belongs is provided in place of the six-bit next-state number. In either case, the sequencer 60 further preferably removes the output enable control signals from the memory 14 upon the latching of the base register 18.
After another predetermined delay designed to allow the new contents of the base register to become stable (particularly on the match status reporting line 34) the delay delimiting- assertion of timing signal T3 on the timing control lines 66 causes the sequencer 60 to determine whether any exception or match condition has occurred. If neither has occurred, the sequencer 60 simply provides the lexack signal on the lexack control line 64. In response to the lexack signal, the host processor releases the lex signal and, in turn, the sequencer 60 withdraws the index and base register 16, 18 output enable signals. If, however, an exception or match condition has occurred, the operation of the sequencer 60 differs specifically in that the lexack control signal is withheld. For a Motorola 68000 host processor, this results in a bus exception condi¬ tion interrupt that draws the attention of the host processor. The corresponding interrupt handling routine can examine the address at which each bus exception occurs. Assuming that this address corresponds to the lexical search circuit 10, a service routine can be executed or marked for later execution to service the lexical search circuit 10.
To better understand the utilization of the translation vector and the state transition array, both stored in the memory 14, reference is made to Tables I, II, and III below.
TABLE I Example Target Strings
Match Target Index
1 catfish
2 fox
3 fin
4 can
5 cats
6 fix
TABLE I]
Example Translation Vector
Input Translated
Character Input Character
C 2
A 3
T 4
F 5
I 6 s 7
H 8
0 9
X 10
N 11
C 2 a 3 t 4 f 5 i 6 s 7 h 8 o 9
X 10 n 11
- (hyphen) 0
(all other 1 characters) TABLE III Example State Transition Array
Trans slated State
Input c character SO SI S2 S3- S S5 - S6 S7 S8- S9
0 • ( hyphen ) 0 1 2 3 4 5 6 7 8 9
1 [all others) 0 0 0 0 0 0 0 0 0 0
2 [C,c) 1 1 1 1 1 1 1 1 1 1
3 , a) 0 2 0 0 0 0 0 0 0 0
4 (T,t) 0 0 3 0 0 0 0 0 0 0
Figure imgf000022_0001
6 (I,i) 0 0 0 0 5 0 0 9 0 0
7 (S,s) ! 0 0 0 5* 0 6 0 0 0 0
8 (H,h) 0 0 0 0 0 0 1* 0 0 0
9 (O,o) 0 0 0 0 8 0 0 8 0 0
10 (X,x) 1 0 0 0 0 0 6* 0 0 2* 5*
11 (N,n) 1 o 0 4* 0 0 3* 0 0 0 3*
For purposes of example, the data strings identified in Table I are preselected as the search target strings. In accordance with the present invention, a translation vector, as indicated in Table II, is constructed to reduce the full range of 256 possible input characters values to the minimum desired set of 12 translated input characters (numbered from 0 to 11) . In this example, upper-case and lower-case letters are treated as equivalent, so that the input charac¬ ters 'a' and 'A' are equally translated to the value 3. The hyphen is translated to the value 0, which illustrates the treatment of any character that should be ignored whenever it appears as an input character. All other possible input charac¬ ters that do not appear in any target string are translated to the value 1.
The state transition array, shown in Table III, describes a deterministic finite automaton that recognizes the six target strings in Table I. This array can be coded by hand, or it can be constructed by a computer program. An algorithm for constructing such arrays automatically is described in the Aho article.
The values in the state transition array identify the next state that should be entered in response to each possible combination of current state and translated input character. Some of the values are specially flagged (marked by an aster¬ isk in Table III) to show that the combination of this current state and this translated input character indicates that a target string has been matched. That is, the terminal data character of a data string has been received. The flagged value is the index of the matched target string as provided in the state transition array, rather than a next state number (as given in Table I) .
At the start of the search the current state number as held by the base register 18 is initial¬ ized to SO (state 0). The initial state will switch from SO to SI on receipt of translated input character 2 ('C or 'c'). Similarly, the initial state will switch from SO to S7 upon receipt of translated input character 5 ('F' or 'f'). Table III shows that, when the current state is SO, all other translated input characters result in transitions to SO (a nul transition) .
When the current state is S7, it will switch from S7 to S8 upon receipt of translated input character 9 ('0', ro'); it will switch from S7 to S9 upon receipt of translated input character 6 ('I', 'i'); it will switch from S7 to SI upon receipt of translated input character 2 ('C, 'c'); it will remain in S7 upon receipt of trans¬ lated input character 5 ( 'F' , 'f' ) or translated input character 0 (hyphen) ; and it will switch from S7 to SO for all other translated input character values.
When the current state is S9, the receipt of the translated input character 10 ('X', 'x') will cause a transition to a specially flagged terminal state, where a target data string index of 6, corresponding to the target data string "fix" will be provided together with a status bit indicating that a match has occurred. Similarly, in state S9, the receipt of the translated input character 11 ("N", 'n') will cause a transition to a spe¬ cially terminal state, where a target data string index of 3, corresponding to the target data string "fin" will be provided. The receipt of other translated input characters will yield state transitions as indicated in Table III.
In this example, if a hyphen occurs as an input character, it will be converted into the translated input character 0. In this case no state transition will occur, since each corre¬ sponding element in the state transition array indicates that the next state is the same as the current state. By this means, the apparatus effectively ignores the hyphen entirely, and any number of embedded hyphen can occur in the input data without preventing the apparatus from match¬ ing the target strings. In this example, the input strings "cat-fish" or "c-at-fish" or "c-a-t-f-i-s-h" would be recognized as "catfish". This same mechanism can be used to ignore embedded accents and diacritical signs, font shifts, punctuation marks, brackets, or any other charac¬ ters that may optionally occur within the target string.
If the host processor wishes to recognize overlapping target strings in the input data, it must maintain a list showing the state in which the automaton should be initialized (the number that must be written to the base register 18) when the search resumes after the processing of each match (as opposed to reinitialize to zero) . This is necessary because the present embodiment of the memory 14 does not have sufficient space in each state transition array data location to contain both a target match index and a next state number.
Special handling is provided for exception input characters. An exception input character is preferably any input character whose occurrence in the input data string signals a circumstance, other than a target data string match, that requires intervention by the host processor. Preferably the data character representing an end of data file market is such an exception data character. Another exception data character may be a control data character indicating a change in the data character representation of the charac¬ ters in the input data string. An example of this would be a change to an extended or alternate character set. Therefore, an exception condition is invoked to allow the host processor to appro¬ priately alter or replace the current translation vector.
The size of the memory 14 constrains the number, length, and complexity of the target strings that can be used as search targets. These limitations, however, can largely be overcome by appropriate software support. If it is desired to search for a very long target string, for example. the first segment of the target string can be encoded for searching by the hardware apparatus. If the apparatus successfully matches the first segment of a target string, a software search routine can take over the search by using a larger state transition array stored in main memory. Alternatively, the software can load a new state transition array into the apparatus.
Considering again the lexical search circuit 10 of Figure 1, the exception status bit is preferably sensed for by the sequencer 60 immedi¬ ately following each latching of the base register 18. In response to a true exception status bit, the sequencer 60 withholds the lexack control signal, thereby halting the provision of subse¬ quent input characters by the host processor. The exception status signal is further provided to the status output buffer 20 via exception status reporting line 32. A second, user definable, status bit may be provided via the index register 16 on the exception status lines 34 to further identify the particular exception that occurred. Thus, in response to an exception condition, the host processor executes its service routine for the lexical search circuit. Thus routine can perform a read operation to obtain status informa¬ tion via the status output buffer 20.
That is, from the performance of the read operation the sequencer 60 recognizes the port select address provided onto the address lines 68 and the read signal on read/write control line 70. The sequencer 60 preferably responds by providing an output enable control signal to the status output buffer 20 via sequencer output line 44, and an output enable control signal to the index and base registers 16, 18, via sequencer output line 34. The state of the exception status and excep¬ tion identification bits, along with the five low order bits stored in the base register, are therefore driven onto the data I/O bus 22. The sequencer 60 also provides the lexack signal on control line 64. At the conclusion of the read operation indicated by the withdrawal of the host processor lex control signal, from line 62, the sequencer 60 withdraws the output enable signals.
A target string match condition is handled in a similar manner. Upon latching the new value into the base register 18, the match status data bit is provided onto the match status reporting line 34 to the sequencer 60. Upon sensing this signal as true, indicating a match, the sequencer again withholds the lexack control signal. When the lexical search circuit is subsequently ser¬ viced by the host processor, the target string index value and the match status bit, as latched in the base register 18, are driven onto the data I/O bus 22 via the status output buffer 20. Thus, the host processor is able to determine not only that a match has occurred, but to also explicitly identify the target data string by the index value provided by the lexical search circuit 10. In order to obtain flexibility, the present invention provides that the data in the memory 14 is fully programmable. The host processor can readily download the translation vector and the state transition array into the memory 14. In the preferred embodiment shown in Figure 1, any location in the memory 14 can be written to by the host processor. The relevant port select address¬ es are given in Table IV below.
TABLE IV
R/W Addrx Addry
0 0 0 write input character
1 0 0 read status
0 0 1 write index register
0 1 0 write base register
0 1 1 write memory data
To write a single location in memory 14, a series of three host processor write cycles are generally required. With the first write opera¬ tion, the sequencer 60 enables the output of the download buffer 54 so as to transfer data from the data I/O bus 22 to the memory data bus 30. Once the data has settled, the sequencer 60 provides a register clock signal on the sequencer output line 36 to the index register 16 to latch the data into the index register. This index register value includes the least significant five bits of the memory address. In the second write operation, the download buffer is again enabled by distin¬ guishing a different port select address to transfer data from the data I/O bus to the memory data bus 30. The sequencer 60 then provides a register clock signal on the sequencer output line 46 to the base register 18. Thus, the data is latched into the base register 18. This base register value includes the most significant six bits of the memory address. Finally, the data provided in the third write operation is the actual data to be written to the memory 14 at the composite memory address again distinguishes the port select address and provides a write control signal to the memory 14 while the data is present on the memory data bus 30 as provided via the download buffer 54. At the same time, the memory address is provided by the output enabling of both the index and base register 16, 18. At the conclusion of the third write operation, with the data written to the memory 14, the sequencer 60 withdraws the output enable and write control signals from the download buffer 54, index and base register 16, 18 and the memory 14.
Consequently, a generally three-step process of providing a two-part memory address followed by the actual memory data is utilized. Although three separate write cycles are used to write a memory location, such downloading of memory data is relatively infrequent in comparison to the number of input characters transferred in the searching for target data strings.
If speed of downloading is considered impor¬ tant, an alternative embodiment of the present invention may include an eleven-bit-wide address buffer to drive the address inputs of memory 14 directly from the host processor address bus. Only a slight modification to the sequencer need be made to provide enable signals to this buffer. In this case, only one processor cycle would be required to write each memory location.
A significantly higher performance lexical search circuit, constructed in accordance with the present invention and generally indicated by the reference numeral 80, is shown in Fig. 2. Perhaps the most significant aspect of this lexical search circuit 80 is that a translation vector memory (TVM) 84 is implemented separately from a state transition memory (STM) 86. As with the embodi¬ ment illustrated in Figure 1, a logic state sequencer 82 is utilized to control the detailed operation of the lexical search circuit 80. The interface of the sequencer 82 to the host proces¬ sor is substantially as before, except that three address lines are used. By providing an appropri¬ ate address on lines 68 and a write signal on the read/write control line 70, the sequencer 82 is directed by the host processor to configure the lexical search circuit 80 for the pending receipt of input characters on the data I/O bus 22. Table V provides the relevant port select addressing of the lexical search circuit 80.
TABLE V R/W Addrx Addry Addrz
0 x x write input character
X x x read status
1 0 0 latch data for TVM
1 0 1 write latched data to TVM at data I/O address
1 1 0 write to base register
1 1 1 write data to STM
The lex signal provided on the lex control line 62 again is used to indicate the availability of each input data character. Thus, on receipt of the lex control signal, the sequencer 82 provides appropriate signals onto the sequencer control output bus 100 to enable the output of the trans¬ lation vector memory 84 and the state transition memory 86. The data I/O bus 22 drives the address inputs of the translation vector memory 84. The data from the addressed memory location is driven onto data bus 104, which provides the five-bit low order address inputs to the state transition memory 86. Additionally, the base register 88 is enabled to provide its six-bit wide stored current state number onto the base register output bus 106 so that, in combination with the translated data character as provided on bus 104, a complete eleven-bit address is provided to the state transition memory 86. The data from the addressed memory location is provided onto data bus 108. The next state number thereby provided onto data bus 108 is then latched into the base register 88. Thus, the lexical search circuit 80 is substan¬ tially prepared for receipt of the next search data character.
An exception condition, if present, is reported via the most significant bit connected data line 105 to the sequencer 82, via exception status bus 102. Similarly, a match condition, if present, is reported via the most significant bit connected data line 109 to the sequencer 82, via exception status bus 102. In a match status condition the matched target data string index is also provided from the state transition memory 86 onto its data bus 108, and is latched into the base register 88. In accordance with the present invention and similar to the embodiment thereof illustrated in Figure 1, the lexack control signal is conditioned on the logic state of the exception and match status bits. If either is true, then the sequencer 82 withholds provision of the lexack control signal to indicate the need to be serviced by the host processor.
As with the embodiment illustrated in Figure 1, a series of host processor write cycles are required to update a single location in the translation vector memory or the state transition memory. If speed of downloading is considered important, an alternative preferred embodiment of the present invention may again further include an eleven-bit-wide address buffer to drive the address inputs of the state transition memory 86 directly from the host processor address bus. Only a slight modification to the sequencer need be made to provide enable signals to this buffer. In this case, only one host processor cycle would be required to write each memory location.
Depending on the processing speed of the host processor, a delay of one or more system clock cycles may be required to allow sufficient time for the sequencer 60 to determine whether to provide the lexack control signal. This delay may result in the insertion of one or more "wait states" by the host processor, thus reducing the processor speed. However, in accordance with a preferred embodiment of the present invention, the lexack control signal may alternately be provided early by the sequencer 60 given that the previous input character did not result in a match or other exceptional condition. That is, the sequencer 60 will preferably determine from the currently provided input character whether or not the lexack control signal will be provided after receipt of the next succeeding input character. Thus, any exception condition or target data string match condition will pertain to the immediately preced¬ ing processed input character. The innate data processing efficiency of the host processor is thereby substantially retained while only incur¬ ring the minor requirement that a dummy search data character be passed to the lexical search circuit at the termination of a data string search to determine whether an exception or target data string match condition is pending.
Also in accordance with the present inven¬ tion, the lexical search circuit 10, in particu¬ lar, as shown in Fig. 1 is adapted to operate with a host processor that can transfer several (typi¬ cally two or four) bytes of data over a data bus wider than eight-bits. To provide for this case, an input multiplexer circuit is provided to route each input character in sequence to the input character bus. In this case the sequencer sup¬ plies control signals to the input multiplexer and repeats the appropriate sequence of control signals for each input character to the remainder of the circuit as necessary to process each of the individual input characters.
The preferred embodiments of the present invention described above have assumed that the host processor communicates with the lex circuit by means of an asynchronous handshaking protocol, in which the absence of acknowledgement on the part of the lex circuit provides a convenient means of signaling an exceptional condition to the host processor. In such systems, the lack of ac¬ knowledgement causes an exception interrupt (often termed a BUS error) which can be handled by software. It is understood that other methods of interfacing the lex circuit to a host processor are equally possible. The sequencer could, for example, provide an explicit interrupt signal to the host processor. In this case, the lex circuit could be driven by a DMA controller, with the DMA residual count at the time of interrupt indicating which character in the input string had caused the match or other exceptional condition.
Thus, a programmable lexical search circuit exemplified by a variety of specific embodiments thereof has been described to illustrate its various advantages including simplicity of design, low parts count, extreme flexibility, high search throughput capability and adaptability to a wide variety of host processor configurations has been described.
Accordingly, it is to be understood that many modifications and variations of the present invention are possible in light of the above teachings. In particular, it should be recognized that the input data string for searching may come from a wide variety of sources and at various stages of processing within an overall data processing system. For example, the data string may be provided directly from the read circuitry of a high capacity data storage device or from the output of an analog to digital converter operating from an analog data source such as a CATV transmission. It is therefore to be understood that, within the scope of the appended claims, the invention may be practiced otherwise in a specifi¬ cally described herein.

Claims

Claims
1. Apparatus for finding a data pattern in a data stream, said apparatus comprising: a) means for translating the data values of said data stream to respective key data values; b) means for accumulating the sequence of said key data values; and c) means for detecting the accumulation of the key data sequence corresponding to said data pattern.
2. The apparatus of Claim 1 wherein the data values of said data stream comprise a set of data and wherein said translating means converts said set of data values to a reduced set of key data values.
3. The apparatus of Claim 2 wherein said translating means converts the data values of said set of data values to corresponding key data values of said reduced set of key data values and wherein the number of unique said key data values is less than the number of unique said data values.
4. The apparatus of Claim 3 wherein said apparatus provides for finding any of a plurality of data patterns, said detecting means including means for identifying the data pattern detected.-
5. The apparatus of Claim 4 wherein said detecting means further detects the occurrence of an exception condition in the accumulation of said key data values.
6. The apparatus of Claim 5 wherein said apparatus is responsive to a host processor system for the provision of said data stream, and wherein said detecting means includes means for alerting said host processor system to the detection of any said data pattern or said exception condition and for transferring any data pattern identification provided by said identifying means to said host processor system when any data pattern is detect¬ ed.
7. The apparatus of Claim 6 wherein said apparatus nominally returns an acceptance control signal to said host processor system upon accep¬ tance of each data of said data stream, and wherein said detecting means includes means for withholding said acceptance control signal from being provided to said host processor system when said detecting means detects any said data pattern or said exception condition.
8. Apparatus for finding the occurrence of any one of a plurality of predetermined data patterns in a data stream, said apparatus compris¬ ing: a) translate vector means for reducing data received from said data stream to respective key data; b) state array means for maintaining a current state identifier with respect to said key data values as received from said translate vector means, said state array means including a state-transition array representation of said predetermined data patterns; and c) control means for monitoring said current state identifier to detect predetermined state identifiers.
9. The apparatus of Claim 8 wherein said state array means accumulates the current state identifier represented order of said key data values as provided by said translate vector means to said state array means.
10. The apparatus of Claim 9 wherein said data stream comprises a set of data values and wherein said translate vector means reduces said set of data values to a numerically smaller set of key data values.
11. The apparatus of Claim 10 wherein said state array means utilizes said key data value and said current state identifier to select a corre¬ sponding state transition from said state-transition array, the selected state transi¬ tion providing a new current state identifier.
12. The apparatus of Claim 11 wherein said predetermined state values include match and exception state identifiers and wherein said match state identifier is provided as said current state identifier when said state array means accumulates the terminal key data value of a sequence of key data values corresponding to any of said predeter¬ mined data patterns.
13. The apparatus of Claim 12 wherein said match state identifier includes an identification value corresponding to the key data value sequence currently accumulated.
14. A lexical search circuit implementing a deterministic finite automaton for detecting the occurrence of any of one or more predetermined strings of data characters in an input stream of said data characters, said circuit comprising: a) means for translating data characters into corresponding key characters, the set of said key characters being numerically smaller than the set of said data characters; b) means for sequentially tracing the order of said key characters as translated from the data character sequence of said input stream, said tracing means providing a string traced signal upon tracing a key character sequence corresponding to . any of said predetermined strings; c) means for sequentially controlling the acceptance of data characters from said input stream by said translating means, said translating means and said tracing means, said controlling means receiving said string traced signal from said tracing mean.
15. The lexical search circuit of Claim 14 wherein said tracing means further provides an identification of the particular key character sequence traced when said string traced signal is provided and wherein said controlling means is responsive to said string traced signal for reporting the identification of the particular key character sequenced traced.
,16. The lexical search circuit of Claim 15 wherein said translating means includes means for detecting an exception data character as received from said input stream, said translating means providing an exception character detected signal, and wherein said controlling means receives said exception character detected signal.
17. The lexical search apparatus of Claim 16 wherein said controlling means reports the receipt of said string traced signal along with the identification of the particular key character sequence traced, said controlling means being further responsive to said exception character detected signal for reporting the receipt of said exception character detected signal.
18. The lexical search circuit of Claim 17 wherein said translating means translates each data character received to a corresponding key character, each said key character including a translate status value and a key character value, said translate status value specifying whether said exception character detected signal is to be provided in response to the receipt of any corre¬ sponding data character.
19. The lexical search circuit of Claim 18 wherein said translating means includes a program¬ mable data storage means for the address accessi¬ ble storage and retrieval of said key character values and respective translate status values, each such address corresponding to a respective data character.
20. The lexical search circuit of Claim 19 wherein said tracing means includes a programmable data storage means for the address accessible storage and retrieval of state transition data, each state transition data including a state transition status value and at least either a next state value or a key character sequence index value, each said next state value being a prede¬ termined value establishing the tracing paths of key character sequences corresponding to said strings of data characters, each said key charac¬ ter sequence index value being a predetermined, unique value associated with the terminal key character of a respective key character sequence for identifying that key character sequence, and said state transition value specifying whether said string traced signal is to be provided in response to the state transition accessing of its respective state transition data.
21. The lexical search circuit of Claim 20 wherein said controlling means further provides for the selective addressing and writing of data to said programmable data storage means of said translating means and said programmable data storage means of said tracing means.
22. A programmable lexical search circuit for implementing a deterministic finite automaton to detect the occurrence of any of one or more predetermined strings of data characters from an input stream of data characters sequentially received from a host processor, said circuit comprising: a) translating means for the transla¬ tion of said input string of data characters to a corresponding stream of key characters, said translating means translating subsets of said data characters to respective ones of said key charac¬ ters, each said key character including an index value and a exception status value; b) table means, coupled to said translating means to receive said stream of key characters as respective partial table addresses, for storing state data representing a deterministic finite state transition diagram corresponding to said predetermined strings of data characters, said table means including a previous base value register coupled to said table means to provide a previous base value as a partial table address, each said state data including a base value and a match status value, said table means being responsive to the composite address formed by a key character value and said previous base value for selecting a table location within said table means to access, the base value of the location accessed being provided to said base value register to become the new previous base value; and c) sequencer means for controlling the acceptance of data characters by said translating means and for enabling the location accessing within said table means for obtaining the next said state data value from said table means, said sequencer means receiving said exception status and match status values from said translating and table means.
23. The circuit of Claim 22 wherein said base value includes a key character sequence identifier, identifying the key character sequence most recently received where the most recently received key character sequence corresponds to one of said predetermined strings of data characters, said circuit further comprising means, coupled to said host processor, for providing said key character sequence identifier, match status value and exception status value to said host processor. wherein said sequencer means interrupts said host processor on receipt of either a true match status or true exception status value.
PCT/US1987/002286 1986-09-05 1987-09-04 Programmable, character reduction lexical search circuit WO1988001774A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US90452186A 1986-09-05 1986-09-05
US904,521 1986-09-05

Publications (1)

Publication Number Publication Date
WO1988001774A1 true WO1988001774A1 (en) 1988-03-10

Family

ID=25419296

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1987/002286 WO1988001774A1 (en) 1986-09-05 1987-09-04 Programmable, character reduction lexical search circuit

Country Status (2)

Country Link
AU (1) AU8032487A (en)
WO (1) WO1988001774A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7240040B2 (en) * 2001-09-12 2007-07-03 Safenet, Inc. Method of generating of DFA state machine that groups transitions into classes in order to conserve memory

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4241402A (en) * 1978-10-12 1980-12-23 Operating Systems, Inc. Finite state automaton with multiple state types
US4285049A (en) * 1978-10-11 1981-08-18 Operating Systems, Inc. Apparatus and method for selecting finite success states by indexing
US4450520A (en) * 1981-03-11 1984-05-22 University Of Illinois Foundation Method and system for matching encoded characters
US4555796A (en) * 1981-12-10 1985-11-26 Nippon Electric Co., Ltd. DP Matching system for recognizing a string of words connected according to a regular grammar

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4285049A (en) * 1978-10-11 1981-08-18 Operating Systems, Inc. Apparatus and method for selecting finite success states by indexing
US4241402A (en) * 1978-10-12 1980-12-23 Operating Systems, Inc. Finite state automaton with multiple state types
US4450520A (en) * 1981-03-11 1984-05-22 University Of Illinois Foundation Method and system for matching encoded characters
US4555796A (en) * 1981-12-10 1985-11-26 Nippon Electric Co., Ltd. DP Matching system for recognizing a string of words connected according to a regular grammar

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7240040B2 (en) * 2001-09-12 2007-07-03 Safenet, Inc. Method of generating of DFA state machine that groups transitions into classes in order to conserve memory

Also Published As

Publication number Publication date
AU8032487A (en) 1988-03-24

Similar Documents

Publication Publication Date Title
EP0031495B1 (en) Text processing terminal with automatic text string input facility
US8713223B2 (en) Methods and systems to accomplish variable width data input
US4100601A (en) Multiplexer for a distributed input/out controller system
AU660029B2 (en) Chordal keyboard method and apparatus
US8719206B2 (en) Pattern-recognition processor with matching-data reporting module
US5031091A (en) Channel control system having device control block and corresponding device control word with channel command part and I/O command part
EP2350921B1 (en) Pattern-recognition processor with results buffer
US4704703A (en) Dynamic input processing system
US5136289A (en) Dictionary searching system
CN1153169C (en) Device and method for recognizing characters input from touch screen
US5404493A (en) Method and computer system for processing keycode data and symbol code data in a bar code device
JP2511434B2 (en) Memory that can address patterns
JPH0128412B2 (en)
US5081608A (en) Apparatus for processing record-structured data by inserting replacement data of arbitrary length into selected data fields
WO1988001774A1 (en) Programmable, character reduction lexical search circuit
US3316538A (en) Circuit arrangement for processing parts of words in electronic computers
JPS5848180A (en) Character decision processing system
JP2006190256A (en) Data transfer unit and data transfer method
SU491952A1 (en) Device for exchanging information between RAM and processor
JPS61150546A (en) Data transmission controlling system
JPH0863487A (en) Method and device for document retrieval
JPH0736594A (en) Signal input device and signal input method
JPH04205561A (en) Document retrieving system using vocabulary dictionary
JPH06246980A (en) Printer
JPH0833802B2 (en) Column data selection processing circuit

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU BR DK FI JP KR NO

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE FR GB IT LU NL SE