WO1995023407A1

WO1995023407A1 - Speech recognition

Info

Publication number: WO1995023407A1
Application number: PCT/GB1995/000374
Authority: WO
Inventors: David Gareth Ollason
Original assignee: British Telecommunications Public Limited Company
Priority date: 1994-02-23
Filing date: 1995-02-23
Publication date: 1995-08-31
Also published as: AU1713995A

Abstract

Speech recognition apparatus comprising input means (31) for receiving a speech signal; recognition processing means (341) for comparing the received speech signal with each of a plurality of predetermined sub-patterns, for generating signals indicative of the similarity of a portion of the speech signal to each of a plurality of patterns, each pattern similarity signal depending upon the results of a predetermined set of said sub-pattern comparisons, and for outputting a recognition result signal utilising said similarity signals; and store means (342, 343) for storing parameter data representing said sub-patterns; in which each said set of sub-patterns corresponds uniquely to one of a plurality of context-independent common speech segments (i.e. allophone or phonemes), and the number of said sub-patterns in a said set differs for different said context-independent speech segments.

Description

SPEECH RECOGNITION

This in- ention relates to speech recognition. Speech recognition apparatus is used as an input means for the control of machines. Such apparatus commonly comprises an input means, for receiving a speech signal directly or indirectly (e. g. vi_*a a communications link or network) from a microphone, and' recognition processing means which typically divides the speech into successive time segments, extracts some limited set or vector of features from each time set, and compares the feature vectors with a plurality of patterns, representing or relating in some way to the phonemes, words or phrases to be recognised.

Particularly, this invention is concerned with speech recognition at the sub-word level, in which a relatively small number ( on the order of 40-150) of commonly occurring speech segments ("phonemes" ), forming building blocks out of which words are constructed, are recognised. This is particularly useful for tasks where a large vocabulary of words is to be recognised; for example, where the speech recogniser is to act as the input means for a telephone directory enquiries database. However, phoneme recognition is complicated by the fact that a series of phonemes are continually articulated in speech, so that the pronunciation of each depends upon its temporal context; that is, the beginning of each phoneme is affected by the end of the preceding phoneme, and its end is affected by the following phoneme. In principle, the total number of such context- dependent versions of phonemes (" allophones" ) could be extremely large; on the order of (40-150) cubed. However, many combinations are simply physiologically impossible to pronounce, and many others are excluded by rules of syntax, and hence never occur in practice. Of those which do occur, some are much rarer than others (e. g. those involving the letter X). There are many different types of recognition processor. One type employs a so-called "neural network", in which the pattern comparison takes place in the form of some combination operation (e. g. taking the scalar product) between the feature vector and a plurality of weight vectors, which correspond in some manner to the predetermined patterns.

Another type* is ^* the so-called "Hidden Markov Model" (HMM) recogniser (described generally in British Telecom Technology Journal, April 1988, vol. 6, no. 2, page 105: Cox, "Hidden Markov models for automatic speech recognition: theory and application"). In HMM recognisers, each word, phoneme, or other speech segment to be recognised is represented as a model which consists of a time-series of "states", to which each feature vector is compared, and "transition probabilities" which describe the probability of a transition from one state to the next, or to itself (i. e. a repetition of the state). The likelihood that the time series of feature vectors representing the speech signal corresponds to a given segment is derived from the level of similarity of the feature vectors to the states and the transition probabilities, using for example the Viterbi algorithm.

In a so-called "continuous density" (CD) HMM recogniser, each state is represented by data defining at least one "mode", or multidimensional Gaussian distribution of feature values. Greater flexibility to model speech segments is given by permitting multi-modal states, represented by data defining multiple such modes. Incoming feature vectors are compared with each mode, to assess the probability that they correspond to each mode. Another type of HMM recogniser is the so-called "Vector Quantisation" recogniser, in which each state is represented by one, or a plurality, of reference feature vectors to which input feature vectors are compared. In the above-described types of recogniser, and other types also, the predetermined patterns (states or weight vectors ) are derived from an initial training pool of real speech, the identity of which, segment by segment, is already known, before the recogniser can be used.

In 1993 Eurospeech Conference on Speech Communications & Technology (Berlin) ppl575-1578; S. Downey et al, "Experiments in Vocabulary Independent Speech Recognition using phoneme decision trees", the segments of the training pool are initially grouped into sets each corresponding to a phoneme. The sets are then split recursively, until each subset approaches a threshold number (50). Each subset is then processed to yield a HMM state. The effect of providing variable numbers of modes for different states was tested, setting the number of modes to be the greater of a constant or a multiple of the number of members in the subset, so as to be able to provide more modes for subsets where plentiful training data was available. However, it was found that the this made the performance worse. In DARPA Speech and Natural Language Workshop, February 1989, ppl60-166, Philadelphia (US); D. Paul, "The Lincoln Continuous Speech Recognition System: Recent Developments and Results", in one proposal, to save memory storage, a group of modes (all groups having the same size) is provided for each phoneme, and states are provided for each allophone, the states for the allophones of each phoneme being represented merely by different mixtures of the same phoneme set of modes. However, an improvement in performance was only found when the number of modes in the sets was made large, and the training process in this case was extremely computationally expensive.

In the same paper, it is proposed instead to provide a number of modes for each allophone state, and to vary the number modes per state in dependence upon the larger of a constant or the square root of the number of examples of the allophone in the training pool, so as to be able to provide more modes where plentiful training data was available. This was stated to provide a small improvement in performance.

In Proc. DARPA Workshop, September 1992, Stanford (US); P. Woodland and S. Young, "Benchmark DARPA RM results with the HTK portable HMM toolkit", referencing the above Paul paper, it was again proposed to provide allophone state models, each state having a variable number of modes in dependence upon the (cube root of the) amount of training data for each alloptione. However, it is reported that the result of this was degraded performance.

In the same paper, it is proposed instead to provide a recogniser using phoneme states, each having a relatively large number of modes (the same for every state).

None of the above recognisers is entirely satisfactory. In each paper there is a suggestion to match the number of modes for a state to its number of occurrences in the training data. However, in the Downey et al device, rarely-occurring allophones are swallowed up in the state sets, and in the Paul and Woodland & Young devices, there are too few occurrences of rare allophones in the training data to permit the modes for those allophones to be accurately derived.

In one aspect, the present invention provides speech recognition apparatus comprising input means for receiving a speech signal; recognition processing means for comparing the received speech signal with each of a plurality of predetermined sub-patterns, for generating signals indicative of the similarity of a portion of the speech signal to each of a plurality of patterns, each pattern similarity signal depending upon the results of a predetermined set of said sub-pattern comparisons, and for outputting a recognition result signal utilising said similarity signals; and store means for storing parameter data representing said sub- patterns; in which each said set of sub-patterns corresponds uniquely to one of a plurality of context-independent common speech segments (i.e. monophones or phonemes), and the number

SUBSTITUTE SH---ΞΪ (P JLΞ 26] of said sub-patterns in a said set differs for different said context-independent speech segments.

By using phoneme-level sub-pattern sets, it is ensured that sufficient data are available to derive each sub- pattern, and by allowing the size of the sets to vary, it is possible to provide sufficient modes to accurately model multimodal phonemes where plentiful training data is available, without either wasting storage space or overcomplicating the* models for rare phonemes. We have found that this solution gives a substantial increase in accuracy, for a given total number of sub-patterns.

Each state may itself comprise a phoneme. In a preferred embodiment, however, each state corresponds to an allophone to be recognised, and allophones of a single phoneme share the sub-patterns of that phoneme.

Preferably, the number of subpatterns allocated to each phoneme relates to the number of contexts (i. e. allophones) in which that phoneme occurs in the training data. The number preferably also relates to the number of occurrences of the phoneme in the training data.

The invention is applicable not only to CD HMM recognisers, but also to other types of recogniser (for example, those described above).

Other aspects, preferred features and embodiments will be apparent from the following description and claims.

The invention will now be described, by way of example, with reference to the accompanying drawings, in which:

Figure 1 shows schematically an application of a recognition system according to the present invention;

Figure 2 is a block diagram showing schematically the elements of a recognition processor forming part of Figure 1 according to an embodiment of the invention;

Figure 3 is a block diagram indicating schematically the components of a classifier forming part of Figure 2 in a first embodiment; Figure 4 is a flow diagram showing schematically the operation of the classifier of Figure 3;

Figure 5 is a block diagram showing schematically the structure of a sequence parser forming part of the embodiment of Figure 2;

Figure 6 shows schematically the content of a field within a store forming part of Figure 5;

Figure 7 is >a flow diagram showing schematically the operation of the sequence parser of Figure 5; Figure 8 is a block diagram indicating schematically the components of a classifier forming part of Figure 2 in a second embodiment;

Figure 9 is a flow diagram showing schematically a process for deriving the contents of memories forming part of the classifiers of Figures 3 or 8;

Figure 10 is a data flow diagram indicating schematically the relationship between the contents of the memories of the classifier of Figure 3;

Figure 11 corresponds to Figure 10 for the classifier of figure 8;

Figure 12 is a block diagram showing schematically the structure of a classifier according to a third embodiment of the invention; and

Figure 13 corresponds to Figure 10 for the classifier of the third embodiment.

FIRST EMBODIMENT

Referring to Figure 1, a telecommunications system including speech recognition generally comprises a microphone 1 typically forming part of a telephone handset, a telecommunications network (typically a public switched telecommunications network (PSTN) ) 2, a recognition processor 3, connected to receive a voice signal from the network 2, and a utilising apparatus 4 connected to the recognition processor 3 and arranged to receive therefrom a speech recognition signal, indicating recognition or otherwise of particular words or phrases, and to take action in response thereto. For example, the utilising apparatus 4 may be a directory enquiry database for supplying telephone numbers. In many cases, the utilising apparatus 4 will generate an auditory response to the speaker, transmitted via the network 2 to a loudspeaker 5 typically forming a part of the subscriber handset.

In operation, a speaker speaks into the microphone 1 and an analog speech signal is transmitted from the microphone 1 into the network 2 to the recognition processor 3, where the speech signal is analysed and a signal indicating identification or otherwise of a particular word or phrase is generated and transmitted to the utilising apparatus 4, which then takes appropriate action in the event of recognition of an expected string of speech segments making up a word or phrase.

For example, the recognition processor 3 may be arranged to recognise a telephone subscriber name and, in response, to read out a telephone number

Referring to Figure 2, the recognition processor 3 comprises an input 31 for receiving speech in digital form (either from a digital network or from an analog to digital converter), a frame processor 32 for partitioning the succession of digital samples into a succession of frames of contiguous samples; a feature extractor 33 for generating from a frame of samples a corresponding feature vector; a classifier 34 receiving the succession of feature vectors and operating on each with a plurality of model states, to generate recognition results; a sequencer 35 which is arranged to receive the classification results from the classifier 34 and to determine the predetermined utterance to which the sequence of classifier outputs indicates the

S^'JESϊIϊϋϊ≥ SHEET (RULE 26) greatest similarity; and an output port 38 at which a recognition signal is supplied indicating the speech utterance which has been recognised.

Frame Generator 32

The frame generator 32 is arranged to receive speech samples at a rate of, for example, 8, 000 samples per second, and to form frames comprising 256 contiguous samples, at a frame rate of 1 frame every 16ms. Preferably, each frame is windowed (i. e. the samples towards the edge of the frame are multiplied by predetermined weighting constants) using, for example, a Hamming window to reduce spurious artifacts, generated by the frame edges. In a preferred embodiment, the frames are overlapping (for example by 50%) so as to ameliorate the effects of the windowing.

Feature Extractor 33

The feature extractor 33 receives frames from the frame generator 32 and generates, in each case, a set or vector of features. The features may, for example, comprise cepstral coefficients (for example, LPC cepstral coefficients or mel frequency cepstral coefficients as described in "On the Evaluation of Speech Recogniserε and Databases using a Reference System", Chollet & Gagnoulet, 1982 proc. IEEE p2026), or differential values of such coefficients comprising, for each coefficient, the difference between the coefficient and the corresponding coefficient value in the preceding vector, as described in "On the use of Instantaneous and Transitional Spectral Information in Speaker Recognition", Soong & Rosenberg, 1988 IEEE Trans, on Accoustics, Speech and Signal Processing Vol 36 No. 6 p871. Equally, a mixture of several types of feature coefficient may be used.

Finally, the feature extractor 33 outputs a frame number, incremented for each successive frame. The frame generator 32 and feature extractor 33 are, in this embodiment, provided by a single suitably programmed digital signal processor (DSP) device (such as the Motorola DSP 56000, or the^' Texas Instruments TMS C 320) or similar device.

Classifier 34

Referring to Figure 3, in this embodiment, the classifier 34 comprises a classifying processor 341, a state memory 342, and a mode memory 343.

The state memory 342 comprises a state field 3421, 3422, .... , for each of the plurality of speech states. For example, each allophone to be recognised by the recognition processor comprises 3 states, and accordingly 3 state fields are provided in the state memory 342 for each allophone. There may also be provided a state field for noise/silence. Each state field in the state memory 342 includes a pointer field 3421b, 3422b, .... storing a pointer address to a mode set field 361, 362, .... in mode memory 343. Each mode set field comprises a plurality of mode fields 3611, 3612... each comprising data defining a multidimensional Gaussian distribution of feature coefficient values which characterise the state in question.

For example, if there are d different feature coefficients, the data stored in each mode field 3611, 3612... characterising each mode is: a constant C, a set of d feature mean values μ and a set of d feature deviations, o ; in other words, a total of 2d + 1 numbers.

SUDS7-7UTΞ SHEET (RULE 26 The number Ni of mode fields 3611, 3612, .... in each mode set field 361, 362, .... is variable.

The data stored in each state field 3421, 3422, .... comprises also a weight field 3421a, 3422a, storing a set of N- weighting values, where N- is the number of mode fields in the mode set field 361, 362, ... to which the pointer stored in the state field .refers.

The classification processor 34 is arranged to read each state field within the memory 342 in turn, and calculate for each, using the current input feature coefficient set, the probability that the input feature set or vector corresponds to the corresponding state. To do so, as shown in Figure 4, the processor 341 is arranged to read the pointer in the state field; to access the mode set field in the mode memory 343 to which it points; and, for each mode field j within the mode set field, to calculate a modal probability P as follows;

,-._c„ι .<^χ--

2f__ σ2

Next, the processor 341 calculates the state probability by reading each of the weight values stored in the state field and multiplying each calculated modal probability by the corresponding weight value, and summing the weighted modal probabilities P , as follows;

P=ΣW_jP_j Accordingly, the output of the classification processor 341 is a plurality of state probabilities P, one for each state in the state memory 342, indicating the likelihood that the input feature vector corresponds to each state.

There are, in this embodiment, three mode set fields for each phoneme (one for each of the first, second and third states for each all jphone of the phoneme); a total of, for example, 132 mode se^t fields. The number of mode fields per mode set field is different for different mode set fields, as discussed below.

There is, in this embodiment, one state field for each of the three states of each allophone. Accordingly, the number of state fields is much larger than the number of mode set fields; for example, 4,752. The pointer data for each state field, corresponding to one state of an allophone, points to the corresponding mode set field relating to the phoneme of which that allophone is a version. Thus, the pointers for the first states of allophones for the vowel "A" preceded by the consonant " C" , the consonant " T" or any other speech contexts, will all point to the first mode set field for the phoneme "A". Thus, all allophone state models relating to one phoneme are based on the same set of modes, but they differ in that the weighting constants applied to the modal probabilities differ between the allophones, to give different mixtures of the same modes for each.

It will be understood that Figure 7 is merely illustrative of the operation of the classifier processor 341. In practice, the mode probabilities may each be calculated once, and temporarily stored, to be used in the calculation of all the allophone state probabilities relating to the phoneme to which the modes correspond. The classifying processor 341 may be a suitably programmed digital signal processing (DSP) device, may in particular be the same digital signal processing device as the feature extractor 33.

SUB5T5TUTE SHEET (RULE 26) Sequencer 35

Referring to Figure 5, the sequencer 35 in this embodiment comprises a state sequence memory 352, a parsing processor 351, and a sequencer output buffer 354. Also provided is a state probability memory 353 which stores, for each frame processed, the state probabilities output by the classύfier processor 341. The state sequence memory 352 comprised a plurality of state sequence fields 3521, 3522, .... , each corresponding to a word or phrase sequence to be recognised consisting of a string of allophones.

Each state sequence in the state sequence memory 352 comprises, as illustrated in Figure 6, a number of states P_j, P₂, ... P_N (where N is a multiple of 3) and, for each state, two probabilities; a repeat probability (P ) and a transition probability to the following state (P_|2). The states of the sequence are a plurality of groups of three states each relating to a single allophone. The observed sequence of states associated with a series of frames may therefore comprise several repetitions of each state P_; in each state sequence model 3521 etc; for example:

Frame Number 1 2 3 4 5 6 7 8 9 .. . Z Z+l

State PI PI PI P2 P2 P2 P2 P2 P2 .. . Pn Pn

As shown in Figure 8 the sequencing processor 351 is arranged to read, at each frame, the state probabilities output by the classifier processor 341, and the previous stored state probabilities in the state probability memory 353, and to calculate the most likely path of states to date over time, and to compare this with each of the state sequences stored in the state sequence memory 352.

The calculation employs the well known Hidden Markov Model method described in the above referenced Cox paper. Conveniently, the HMM processing performed by the parsing processor 351 uses the well known Viterbi algorithm. The parsing processor 351 may, for example, be a microprocessor such as the Intel^ i-486^^TO) microprocessor or the Motorola'™-' 68000 microprocessor, or may alternatively be a DSP device (for example, the same DSP device as is employed for any of the preceding processors).

Accordingly for each state sequence (corresponding to a word, phrase or other speech sequence to be recognised) a probability score is output by the parser processor 351 at each frame of input speech. For example the state sequences may comprise the names in a telephone director. When the end of the utterance is detected, a label signal indicating the most probable state sequence is output from the parsing processor 351 to the output port 38, to indicate that the corresponding name, word or phrase has been recognised.

This embodiment could be varied to provide a constant number of mode fields for every mode set field in the mode memory 343. We have tested the performance of this embodiment with and without this variation. By comparison with a recogniser using phoneme models each comprising 3 states each having 9 modes, we found that the first embodiment with all mode set fields comprising a fixed number (9) of mode fields achieved an improvement in accuracy of 6%, and the first embodiment using variable numbers of mode fields achieved a substantially higher improvement in accuracy of 8.6%.

SECOND EMBODIMENT

Referring to Figure 8, the second embodiment of the invention corresponds generally to the first, and further discussion of identical elements is omitted for brevity. In the second invention, each state field in the state memory is associated with a phoneme (three states per phoneme), rather than an allophone as in the first embodiment, and there is thus a one-to-one correspondence between state memory fields and mode set fields, rather than a many-to-one correspondence as in the first embodiment. Each state field therefore need not store weighting data, since the mode data stored in the fode fields can instead be stored in appropriately normalised form to achieve the same effect. As in the first embodiment, the number of mode fields differs for different phoneme mode set fields.

The necessary modifications to Figure 4 will be apparent; the state probability is simply the sum, rather than a weighted sum, of the modal probabilities. In fact, since there is a one-to-one correspondence between states and mode sets in this embodiment, it may be preferable to merge the mode set fields with the state fields in single records in a single store or memory device.

We found that, relative to the same reference recogniser as in the first embodiment (a phoneme recogniser in which each phoneme model comprises 3 states each having 9 modes), the second embodiment (with the same total number of mode fields in the mode memory as the reference, but variable numbers in each mode set field) achieved an improvement in recognition accuracy of 2%.

Deriving the contents of the mode and state fields

The process of deriving the contents of the mode and state fields (referred to in this document as "training" ) will now be explained. General aspects of training a HMM recogniser are well known to the skilled person (see, for example, Huang, Ariki and Jack; " HMMs for Speech Recognition", Edinburgh University Press (1990)), and form no part of the present invention; accordingly, they will be omitted. Further information may be found in, for example, " HTK: Hidden Markov Model Toolkit Reference Manual - Version 1.4"; S. J. Young

SUE3TSTUT-S SHEET (RULE 26) (1992), Cambridge University Engineering Department, Cambridge, England. The HTK Toolkit itself, available from Entropic Research Lab. Inc. , 600 Pennsylvania Ave, S. E. Suite 202, Washington D. C. , 20003 U.S. (the subject of the Manual, and referred to in the above-cited Woodland & Young paper) may be used to perform the training.

Referring to Figure 9, initially a large database comprising a trairϋng pool of prerecorded speech segments, separately labelled to indicate correspondence to allophones,

^•-t is required. Firstly, the speech segments for all allophones having a common phoneme are taken and used to train a 3 state model for that phoneme.

The number of modes making up the states of each phoneme is set in accordance with, firstly, the number T of speech segments relating to that phoneme in the training pool and, secondly, the number V of allophones of that phoneme in the training pool.

For example, the number of modes is set as;

M=k. V. -/T,

where k is a constant.

The exact form of the function of T is not critical; we have found that the cube root could also be used. The function is, in any case, nonlinear and such as to increase less than linearly with increases in T, so as to compress somewhat the range of set sizes.

The number V may be taken as the number of different earlier phonemes after which the phoneme in question is found in the training pool ("Left biphones" ), or as the number of successor phonemes before which it occurs ("Right biphones" ), or as the total number of three-phoneme combinations of which it forms the central element (" Triphones" ) , or some other measure of the variability of the context in which it occurs.

For example, it was found that, using one training pool of speech data, the phoneme " T" occurred 28,060 times, in the context of 1,028 triphones, whereas the phoneme "XX" occurred twelve times in the context of 8 triphones; values of T and V for other phonemes were spread between these wide extremes. Having trained 3 state models each comprising M modes for a first phoneme, the process is repeated for the other phonemes with correspondingly different numbers of modes. The models may be refined using iterative re-training, using, for example, the knowrf Baum-Welch re-training algorithm. This process is sufficient to define the data stored in memory in the second embodiment.

For the first embodiment, the next step is, for each phoneme, to copy the phoneme state data V times, so as to create individual allophone state data for each allophone of the phoneme. The next step is to cause the allophone states (all initially identical to the phoneme) each to individually represent their respective allophones, and to achieve this an iterative re-training step (e. g. Baum-Welch retraining) is used to recalculate the mode weights W for each allophone and, optionally, also the transition probabilities. The modes calculated for the phoneme are kept common to all allophones; the mode parameters may be kept constant during the retraining, or they may be readjusted with or after the derivation of the allophone mode weights. The mode data is then stored in the mode memory 343, the weight data in the state memory 342, and the state transition probability data in the state sequence memory 352.

Figure 10 illustrates the relationship between the states and modes derived in the first embodiment, for a single phoneme. Figure 11 correspondingly illustrates the relationship for the second embodiment.

THIRD EMBODIMENT

In the third embodiment the recogniser is similar to that of the first embodiment, but the number of allophone states (having separate transition probabilities) is reduced during the training process by using a clustering algorithm to replace several similar states by a single state.

Referring to Figure 12, in this embodiment, the classifier comprises an allophone state memory 342, a clustered state memory 347, and a mode memory 343. Each state field in the allophone state memory 342 comprises merely a pointer to a clustered state field in the clustered state memory, each of which comprises a set of weight values and a pointer to a mode set field in the mode memory. The content of the mode set fields in this embodiment is the same as in the first embodiment.

In operation, to calculate the state probability for a given allophone state, the classifier processor 341 reads the clustered state field pointed to from the allophone state field, reads the mode fields pointed to from the clustered state field, and calculates the state probability for the clustered state, which is then used for the allophone state probability.

In this embodiment three state fields are provided for each allophone, but the number of clustered state fields per phoneme is less than this. The number of mode set fields may, as before, be three per phoneme, and the number of mode fields varies for different mode set fields.

To derive the clustered state data and transition probabilities in this embodiment, the initial steps of the training process of the first embodiment are performed, to create phoneme models, copy these a plurality of times to create identical allophone models, and then individually train the weights of the allophone models to recognise their respective allophones, using the Baum-Welch retraining algorithm.

Next, the mode weights and transition probabilities of all allophones of one phoneme are clustered to form a smaller number of clustered states, using for example the clustering tool provided in the above-mentioned HMM Toolkit.

Figure 13 illustrates the relationship between the states and modes in this embodiment. The mode data are then stored in the mode memory 343 and the weight data in the clustered state memory 347, and appropriate pointers are entered in the allophone state memory, (to refer each allophone state field to the clustered state field which replaces it) and clustered state memory (to refer to the corresponding mode set field).

This embodiment has been found capable of achieving almost as high a egree of recognition accuracy as the first, and with almost twice the processing speed, depending upon the level of clustering (the greater the reduction in number of states the faster the recognition), at some cost of recognition accuracy. We have found that reductions of 25-50% in the number of states still achieve some advantages of the first embodiment, with a reduction in processing time of 20- 40%.

Other Aspects and Embodiments of the Invention

It will be clear from the foregoing that the described embodiments are merely examples of the invention, which is accordingly not limited thereby. In particular, various novel features of the described embodiments each have separate advantages, whether explicitly described above or clear to the skilled person herefro , and protection is sought for each such advantageous feature in isolation and for any advantageous combination of such features. The use of a Gaussian, continuous density classifier has been described here, but a classifier using vector quantisation could equally be employed. Similarly, other types of sequence processing (e. g. Dynamic Time Warp) could be employed. Whilst only a ' repeat' probability and a ' transition' probability have been discussed, probabilities for transitions to next-but-one and next-but-two (etc) states (skipping transitions) are well known and could equally be employed. Likewise, the number of states mentioned above are purely exemplary. Whilst certain arrangements of separate memory devices and processors have been illustrated, it will be realised that other arrangements are possible; for example, a single memory device (suitably partitioned) and a single processor (operating different stored programs) could be used.

Whilst particular embodiments have been described in detail, it will be realised that other embodiments are realisable using, digital or analog hardware, suitably constructed or programmed. Although speech recognition has been described, use of the same techniques in relation to other types of recognition (for example speaker recognition or verification) is not excluded.

The scope of protection is intended to encompass all constructions within the scope of the claims appended hereto, together with any equivalent constructions achieving substantially the same result or achieving a substantially different result using the same principle or operation.

Claims

CLAI MS

1. Speech recognition apparatus comprising input means (31) for receiving a speech signal; recognition processing means (341) for comparing the received speech signal with each of a plurality of predetermined sub-patterns, for generating signals indicative of the similarity of a portion of the speech signal to eafch of a plurality of patterns, each pattern similarity signal depending upon the results of a predetermined set of said sub-pattern comparisons, and for outputting a recognition result signal utilising said similarity signals; and store means (342, 343) for storing parameter data representing said sub-patterns; in which each said set of sub-patterns corresponds uniquely to one of a plurality of context-independent common speech segments (i. e. monophones or phonemes), and the number of said sub- patterns in a said set differs for different said context- independent speech segments.

2. Apparatus according to claim 1, in which each said pattern relates uniquely to one of said context-independent common speech segments.

3. Apparatus according to claim 1, in which each of said patterns relates to a context-dependent speech segment (e. g. a biphone, triphone or allophone) comprising a version of one of said context-independent speech segments, and in which said recognition processing means (341) is arranged to generate a plurality of said pattern similarity signals from each said set, each of said plurality depending differently upon the results of said sub-pattern comparisons.

4. Speech recognition apparatus comprising input means (31) for receiving a speech signal; recognition processing means (341) for comparing the received speech signal with each of a plurality of predetermined sub-patterns, for generating signals indicative of the similarity of a portion of the speech signal to each of a plurality of patterns, each pattern similarity signal depending upon the results of a predetermined set of said sub-pattern comparisons, and for outputting a recognition result signal utilising said similarity signals; and store means (343) for storing parameter data representing said sub-patterns; in which the number of said sub-patterns is different in different said sets, and said recσignition processing means (341) is arranged to generate a plurality of said pattern similarity signals from each said set, each of said plurality depending differently upon the results of said sub-pattern comparisons.

5. Apparatus according to claim 3 or claim 4, in which said recognition processing means (341) is arranged to generate a plurality of clustered similarity signals, greater in number than said sets but smaller in number than said patterns, from said sets, and to form a plurality of said pattern similarity signals from at least some of said clustered similarity signals.

6. Apparatus according to claim 4 or claim 5, in which each said set of sub-patterns corresponds uniquely to one of a plurality of context-independent common speech segments (i. e. onophones or phonemes) and each of said patterns relates to a context-dependent speech segment (e. g. a biphone, triphone or allophone) comprising a version of one of said context-independent speech segments.

7. Apparatus according to any of claims 1, 2, 3 or 6, in which the size of the sets is related to the number of contexts in which the context-independent speech segment occurs in common speech.

8. Apparatus according to any of claims 1, 2, 3, 6 or 7, in which the size of the sets is related to the number of contexts in which the context-independent speech segment occurs in the speech data from which the sub-patterns were derived.

9. Apparatus according to any preceding claim, in which said recognition processing means (341) comprises a Hidden Markov Model speech recogniser.

10. pparatus according to claim 9, in which said patterns comprise observation states of a state sequence.

11. Apparatus according to any preceding claim, in which said sub-patterns comprise continuous probability distribution modes.

12. A Hidden Markov Model speech recogniser, in which sets of modes each correspond uniquely to monophones, and the number of modes differs between different sets.

13. A Hidden Markov Model speech recogniser, in which stored sets of varying numbers of modes are each accessed to calculate multiple state probabilities.

14. A method of manufacturing a speech recogniser, for example according to claim 7 or claim 8, comprising a step of deriving, from pre-stored speech data, a set of sub-patterns for each context-independent pattern to be recognised, the number of said sub-patterns in the sets being determined in dependence upon the number of different contexts in which the respective context-independent patterns occur in said speech data.

15. A method according to claim 14, in which the number of sub-patterns in said sets is set in dependence also upon the number of occurrences of the pattern in said speech data.

16. A method of manufacturing a speech recogniser, for example according to claim 5, comprising deriving a plurality of sets of sub-patterns, and, for each set, a plurality of pattern output weights for weighting said subpatterns to enable the generation of a plurality of different pattern output signals from each set; and deriving, from said weights, a set of clustered pattern weights smaller in number than said pattern output weights.