CN100538701C

CN100538701C - Be used for from the method for media sample identification medium entity

Info

Publication number: CN100538701C
Application number: CNB2006101725723A
Authority: CN
Inventors: 埃弗里·L·C·王; 朱利叶斯·O·史密斯第三
Original assignee: Landmark Digital Services LLC
Current assignee: Shazam Investments Ltd
Priority date: 2000-07-31
Filing date: 2001-07-26
Publication date: 2009-09-09
Anticipated expiration: 2021-07-26
Also published as: CN1996307A

Abstract

The invention provides a kind of being used for from the method for media sample identification medium entity, this method comprises: be created in the sign of described media sample and the corresponding sign of the medium entity that will discern between unanimity, the described sign of wherein said media sample and the described respective flag of described medium entity have fingerprint of equal value; And if a plurality of described consistent linear relationship that has by following formula definition, then recognize described media sample and described media file, landmark ^* _n=m*landmark _n+ offset, wherein, landmark _nBe the sample sign, landmark ^* _nBe corresponding to landmark _nFile mark, and m represents slope.

Description

Be used for from the method for media sample identification medium entity

The application is to be July 26 calendar year 2001, application number the dividing an application for the application for a patent for invention of " system and method that is used for sound recognition and music signal under very noisy and distortion " that be 01813565.X, denomination of invention the applying date.

Technical field

The present invention relates generally to content-based information retrieval.More particularly, The present invention be more particularly directed to the identification of sound signal, described sound signal comprises sound high distortion or that comprise very noisy or music.The present invention also is particularly related to a kind of being used for from the method for media sample identification medium entity.

Background technology

The music or other sound signal that more and more need identification automatically to produce from multiple source.For example, there are the owner of works of copyright or press agent interested for the data of obtaining about the broadcasting frequency of its material.The music follow-up service provides the listing of main wireless station in big market.The consumer wishes to recognize song or the advertisement in the broadcasting, so that can buy new, interesting music or other products and service.When its during by artificial execution, the voice recognition that continue or program request (on-demand) of any kind of all is poor efficiency and effort.Like this, the automated process of identification music or sound will bring significant meaning to consumer, artist and multiple industry.Along with the music distribution pattern is bought from the shop to have transferred to through the Internet and downloaded, it is very feasible will directly coupling together with computer implemented music recognition and internet purchases and other service based on the Internet.

Traditionally, to the identification of the song play in the broadcasting, be wireless station and the time by making played songs, the listing that provides with wireless station or third party source is complementary and carries out.This method is defined in the wireless station of the information of can obtaining inherently.Other method then depends on and embed the sign indicating number that can not listen in broadcast singal.The signal that is embedded into is decoded in receiver, to extract the identification information about broadcast singal.The shortcoming of this method need to be special-purpose decoding device with the identification signal, and can only recognize that those have the song of embedded code.

Any extensive audio identification all needs certain content-based audio retrieval, wherein, the database of Bian Ren broadcast singal and known signal relatively, to recognize similar and identical data field signal.It is noted that content-based audio retrieval is different from existing audio retrieval by network search engines, wherein, only search for back data (metadata) text that centers on audio file or be associated with audio file.Although also it is noted that speech recognition for the text that sound signal transition is become can use known technology come index and search of great use, it is not suitable for the most of sound signals that comprise music and sound.In some aspects, the audio-frequency information retrieval of similar is in the text based information retrieval that is provided by search engine.In others, audio identification is not similar to: sound signal lacks the entity such as literal that can recognize simply, and described entity is provided for searching for or the identifier of index.Similarly, current audio retrieval scheme is carried out index by the various qualities of the representation signal that calculates and the perceptual feature of feature.

Typically, content-based audio retrieval is carried out by the analysis and consult signal, to obtain many representational features, then the gained feature is carried out similarity measurement to determine similar in appearance to the position of the database file of this request signal.The similarity of the object that is received must be the reflection of selected perceptual feature.This area has a lot of content-based retrieval methods to use.For example, the United States Patent (USP) that is published to Kenyon discloses a kind of signal recognition method the 5th, 210, No. 820, wherein, received signal processed and the sampling to obtain the signal value of each sampled point.Then, the statistic moments of the value of calculating sampling, with produce can with the identifier eigenvector relatively of institute's storage signal, retrieve similar signal.Be published to No. the 4th, 450,531, Kenyon and other local United States Patent (USP)s and disclose similar broadcast message sorting technique the 4th, 843, No. 562, wherein, calculated the crossing dependency of not recognizing between signal and the institute's stored reference signal.

At J.T.Foote, " Content-Based Retrieval of music and Audio (content-based retrieval of music and audio frequency) ", at C.-C.J.Kuo et al., editor, Multimedia Storage andArchiving Systems II (multimedia storage and filing system II), Proc.of SPIE, volume 3229, pages 138-147 discloses a kind of system that is used for by acoustics similarity retrieval audio file in 1997.Come the calculated characteristics vector by (mel-scaled) cepstrum (cepstral) coefficient that each audio file parameter is turned to mark ear tag degree, and from this parametrization data generating quantification tree (quantization tree).Be to carry out inquiry, unknown signal is by parametrization, and obtaining eigenvector, and this eigenvector is classified as the leaf node (leafnode) on the tree.For each leaf node is collected histogram, thereby produce the N n dimensional vector n of representing this unknown signal.Similarity between two audio files of distance expression between two such vectors.In this method, based on the class (class) that people have distributed training data (training data) therein, institutionalized quantization scheme is known the differentiation audio frequency characteristics, and ignores unessential variation.Rely on categorizing system, selecting different acoustic features is key character.Like this, not only discern music, this method be more suitable in find between the song similarity and with music assorting.

Being published to Blum and other local United States Patent (USP)s the 5th, 918,223 discloses a kind of content-based analysis to audio-frequency information, storage, retrieval, has reached segmentation method.In this method, measure many acoustic features such as volume (loudness), bass (bass), tone (pitch), brightness (brightness), bandwidth (bandwidth), mark ear-frequency (mel-frequency) cepstral coefficients in the periodic intervals of each file.These features are carried out statistical measurement and in conjunction with to form eigenvector.Eigenvector based on the audio data file in the database arrives the not similarity of the eigenvector of the file of identification, the audio data file in the searching database.

The key issue of the audio identification method of all above-mentioned existing technology is, when the signal that will discern suffer because, for example, when the filtering of ground unrest, error of transmission and information loss (dropout), interference, limited bandwidth system (band-limited filtering), quantification, time deformation (time-warping) and voice quality digital compression and the linearity that causes and nonlinear distortion, just lost efficacy easily.In existing technical method, when the sample sound of handling distortion when obtaining acoustic feature, can only find the feature that a part derives from raw readings.Therefore, the eigenvector of eigenvector and raw readings is not closely similar as a result, may not necessarily carry out correct identification.Still need a kind of sound recognition system, can under the condition of very noisy and distortion, work well.

Another problem of existing technical method is that its calculated amount is big, and can not classify well (scale).Like this, can not use existing technical method to carry out Real time identification with large database.In this system, can not be so that database has the record more than hundreds of or thousands of.Tend to the big or small linear growth along with database the search time in the existing technical method, this makes SoundRec up to a million is classified is infeasible economically.The method of Kenyon needs the digital signal processing hardware of a large amount of special uses equally.

Existing commercial method has strict requirement to the input sample that can carry out identification usually.For example, it requires complete song, or at least 30 seconds song, so that sampling perhaps requires song from the beginning to sample.It also is difficult to be identified in a plurality of songs that mix in the single stream (stream).All these shortcomings make that the use of method in many practical applications of existing technology is infeasible.

Summary of the invention

Correspondingly, fundamental purpose of the present invention provides a kind of method, is used to discern the sound signal that suffers very noisy and distortion.

A further object of the present invention provides a kind of recognition methods, and it can be only carried out in real time based on several seconds of the signal that will be recognized.

Another object of the present invention provides a kind of recognition methods, and it can be based on the sample sound recognition of any position almost in the sound, and is not only can only be in beginning.

An additional purpose of the present invention provides a kind of recognition methods, and it does not need that sample sound is encoded or is associated with specific wireless station or listing

A further object of the present invention provides a kind of recognition methods, and it can be identified in each the first song in a plurality of songs that mix in the single stream.

Another object of the present invention provides a kind of sound recognition system, therein, can provide unknown sound from any environment to described system by any known method in fact.

According to an aspect of the present invention, providing a kind of is used for from the method for media sample identification medium entity, comprise: be created in the sign of described media sample and the corresponding sign of the medium entity that will discern between unanimity, the described sign of wherein said media sample and the described respective flag of described medium entity have fingerprint of equal value; And if a plurality of described consistent linear relationship that has by following formula definition, then recognize described media sample and described media file, landmark ^* _n=m*landmark _n+ offset, wherein, landmark _nBe the sample sign, landmark ^* _nBe corresponding to landmark _nFile mark, and m represents slope.

These purposes and advantage can obtain with a kind of method that is used to discern the media sample such as audio samples by the database index of given many known media files.Database index comprises the fingerprint (fingerprint) of the feature at the certain location place that represents indexed media file.Unknown media sample is consistent with the media file ((winning) media file of choosing) in the database, and the relative position of the fingerprint of this media file fingerprint relative position and sample mates the most.In the situation of audio file, the temporal evolution of the fingerprint in temporal evolution of the fingerprint of the file of choosing (evolution) and the sample is complementary.

This method is preferably in the computer system of distribution and realizes, and comprises following steps: determine one group of fingerprint in the certain location of sample; In database index, determine the position of the fingerprint of coupling; Be created in consistent between the position in position and the file in the sample with fingerprint of equal value; And the media file of the abundant linear dependence of the very a plurality of unanimities of identification.Has the media file that the file of unanimity of the linear dependence of maximum number is considered to choose.A kind of method of file that identification has a large amount of unanimities is: carry out the cornerwise process that is equivalent among distribution (scatter) figure that scanning produces from many-one causes.In one embodiment, identification has first subclass that a large amount of linear consistent media files comprises a searching media file.The probability that file in first subclass has higher quilt to recognize than the file in first subclass not.The probability of identification is preferably based on the newness degree of experiment frequency or previous identification and measures, together with reasoning (a priori) prediction to the identification frequency.If there is not media file to be recognized in first subclass, then search comprises second subclass of remaining paper.As selection, file can be arranged by probability, and the sequential search to arrange.When being determined, the position of file stops search.

Preferably, the certain location in the sample calculates in the mode that is independent of sample renewablely.But renewable calculating location like this is called as " sign (landmark) ".Fingerprint is numerical value preferably.In one embodiment, the representative of each fingerprint is in each position or from a plurality of features of the media sample of the small skew in this position.

This method is particularly useful to the identification audio samples, and wherein, certain location is the time point in the audio samples.These time points appear at, for example, and the local maximum place of the frequency spectrum Lp norm (norm) of audio samples.Fingerprint can be calculated by any analysis meter to audio samples, and preferably constant with respect to the time explanation (time stretching) of sample.The example of fingerprint comprises the frequency component of wavelength coverage fingerprint, multistage fingerprint, linear predictive coding (LPC) coefficient, cepstrum (cepstral) coefficient and audio frequency spectrogram (spectorgram) spike.

The present invention also provides a kind of system for carrying out said process that is used to realize, comprises: sign is handled (landmarking) object, is used to calculate certain location; Fingerprint is handled (fingerprinting) object, is used for calculated fingerprint; Database index comprises document location and fingerprint at media file; And analytic target.Analytic target by determine the position of the fingerprint of coupling in database index, produces unanimity, and analysis is consistent, with the media file of choosing, realizes this method.

Also provide a kind of can be by the program storage device of computer access, comprise the program of the instruction that can carry out by computing machine effectively, to carry out method step at said method.

In addition, the invention provides a kind of method that is used for creating the index of many audio files, comprise following step: calculate one group of fingerprint in the certain location of each file at database; And the identifier (identifier) of the file in storage fingerprint, position and the storer.In storer, corresponding fingerprint, position, an identifier tlv triple of associated formation (triplet).Preferably, can be the position of the time point in the audio file, depend on file and calculated, and renewable.For example, time point can appear at the local maximum place of the frequency spectrum Lp norm of audio file.In some cases, each fingerprint of numerical value is preferably represented near many features of the file the certain location.Can calculate fingerprint from any analysis or digital signal processing to audio file.The example of fingerprint comprises the frequency component and the connected audio frequency spectrogram spike of wavelength coverage fingerprint, multistage fingerprint, linear forecast coding coefficient, cepstrum (cepstral) coefficient, audio frequency spectrogram spike.

At last, the invention provides: be used to recognize the method for the audio samples of having incorporated the fingerprint constant into respect to time explanation, and the method that is used for various hierarchical searches

Description of drawings

Fig. 1 is the process flow diagram that is used for the method for the present invention of sound recognition sample.

Fig. 2 is the block scheme of exemplary distributed computing machine that is used to realize the method for Fig. 1.

Fig. 3 is the process flow diagram of method that is used for being based upon the database index of the audio files that the method for Fig. 1 uses.

Fig. 4 has roughly illustrated sign and the fingerprint that calculates for sample sound.

Fig. 5 is the figure at the L4 norm of sample sound, and the selection of sign has been described.

Fig. 6 is the process flow diagram of optional embodiment that is used for being based upon the database index of the audio files that the method for Fig. 1 uses.

Fig. 7 A-7C shows the audio frequency spectrogram with the salient point of indicating (salient point) and the salient point of link.

Fig. 8 A-8C has illustrated index set, index and the master index tabulation of the method for Fig. 3.

Fig. 9 A-9C has illustrated index, the candidate list of the method for Fig. 1 and has scattered tabulation (scatterlist).

Figure 10 A-10B is respectively the correct identification and the not enough scatter diagram of identification of the unknown sample sound of explanation.

Embodiment

The invention provides a kind of being used for discerns the method for external media sample under the situation of the given database that comprises a large amount of known media files.A kind of method that is used to produce database index also is provided, and described database index allows to use recognition methods of the present invention that efficient search is arranged.Although following discussion relates generally to voice data, should be appreciated that method of the present invention goes for the media sample and the media file of any kind, includes but not limited to: any multimedia combination of text, audio frequency, video, image and single medium type.In the situation of audio frequency, the present invention is particularly useful to the sample that identification comprises highly linear or nonlinear distortion, wherein, described distortion be because, for example, the filtering of ground unrest, error of transmission and information loss, interference, limited bandwidth system, quantification, time deformation and voice quality digital compression are caused.Along with will become clear from following description, why the present invention works under such condition, be because: even the feature of having only a little part to calculate survives distortion, it also can correctly discern the signal of distortion.Can discern the sound signal of any kind by the present invention, comprise the combination of sound, voice, music or a plurality of types.The example of audio samples comprises music, radio program and the advertisement of being write down.

As used herein, external media sample be from as hereinafter described in the segment of media data of any size of obtaining of multiple source.In order to carry out identification, sample must be the reproduction of the part of the media file of index in database used in the present invention.Described indexed media file can be regarded as raw readings, and sample then is the distortion of raw readings and/or the version or the reproduction of abreviation.Typically, sample is only consistent with the sub-fraction of indexed file.For example, can be to the segment execution identification in ten seconds of five minutes of index in database long songs.Although describe indexed entity with term " file ", described entity can be any form that can obtain essential value (as described below).And, after obtaining this value, do not need storage or visit this document.

Fig. 1 shows the block scheme of the Overall Steps of conceptual illustration method 10 of the present invention.Each step has hereinafter been described in more detail.The media file that this method identification is chosen, promptly the relative position of the same fingerprint of a kind of relative position of its characteristic fingerprint and external sample the most closely mates.In step 12, capture after the external sample, just calculation flag and fingerprint in step 14.Sign comes across certain location in the sample, that is, and and time point.The position of sign in sample preferably determined by sample self, that is, depended on the sample quality, and be reproducible.That is to say, during each re-treatment, be the identical sign of identical calculated signals.For each sign, fingerprint is at obtaining sign place or describe one or more features of sample near it.The degree of closeness of feature and sign (nearness) defines by employed fingerprint disposal route.In some cases, if feature significantly with one the sign consistent and with before or sign subsequently inconsistent, think that then this feature approaches this sign.In other situation, feature is consistent with the sign of a plurality of vicinities.For example, text fingerprints can be word string (word string); Audio-frequency fingerprint can be audio spectrum (spectral) component; And finger image can be pixel RGB (RGB) value.Two general embodiment of step 14 have hereinafter been described, calculation flag and fingerprint successively among the embodiment, and calculation flag and fingerprint simultaneously among another embodiment.

In step 16, sample fingerprint is used to the fingerprint of the many group couplings of retrieve stored in database index 18, and in described database index 18, the fingerprint of coupling is associated with the sign and the identifier of a group media file.Then, the file identifier and the value of statistical indicant that use this group to be retrieved, produce unanimity to (correspondence pair) (step 20), the file mark of described unanimity to comprising sample sign (calculating) and be retrieved in step 14 calculated identical fingerprint at this.Then, consequent unanimity is to by song identifier classification, and many groups of producing between sample sign and the file mark for each available file are consistent.Scan each group, to carry out the calibration between file mark and the sample sign.That is to say, recognize each, and this group is marked according to the right number of linear dependence to the linear unanimity in the sign.When a large amount of corresponding sample position can use fully identical linear equation to describe in certain tolerance limit with document location, linear unanimity just appearred.For example, ± 5% scope in, changes, should put in order so and organize the consistent linear dependence that can be regarded as if describe the slope of one group of consistent right a plurality of equation.Certainly, can select any suitable tolerance limit.Having best result, promptly have the identifier of group of unanimity of the linear dependence of maximum, is the file identifier of choosing, and is determined its position, and returns in step 22.

As hereinafter further describing, can carry out identification with the time component that is directly proportional with the logarithm of number of entity in the database.Basically, can carry out identification in real time, promptly use very big database.That is to say, obtaining in the sample, with little time lag, just can recognition sample.This method can be based on 5-10 second, even is low to moderate the 1-3 segment identification sound of second.In a preferred embodiment, along with catch sample in step 12, the executed in real time sign is handled and fingerprint Treatment Analysis, i.e. step 14., just carry out data base querying (step 16), and accumulation consistent results, periodically scanning linearity unanimity when but sample fingerprint becomes the time spent.Like this, the institute of this method takes place in steps simultaneously, but not the linear patterns successively of being advised among Fig. 1.It should be noted that this method and text search engine are partly similar: the submit queries sample, and return the matching files of index in audio database.

Typically, this method realizes that as the software of operation on computers wherein, each step conduct independently software module realizes most effectively.Like this, realize that system of the present invention can think to form, be used for the search database index, calculate unanimity, and recognize the file of choosing by the object that sign is handled and fingerprint is handled, indexed database and analytic target.In the situation that sign is handled and fingerprint is handled successively, sign is handled the object of handling with fingerprint can be considered to the different objects that sign is handled and fingerprint is handled.The computer instruction code that is used for different objects is stored in the storer of one or more calculating, and is carried out by one or more computer processors.In one embodiment, code object with such as together based on (intel-based) personal computer of Intel or the single computer systems cluster (cluster) other workstation.In a preferred embodiment, this method is to realize by the cluster of the network of central processing unit (CPU), and wherein, different processors is carried out different software objects, so that disperse calculated amount.As selection, each CPU can have the copy of all software objects, allows full peer-to-peer network (homogeneous network) with the element that disposes.In this latter's configuration, each CPU has the subclass of database index, and is responsible for the subclass of the media file of search its oneself.

Although the invention is not restricted to any specific hardware system, the example of a preferred embodiment of distributed computer system 30 has been described roughly among Fig. 2.System 30 comprises (Linux-based) processor 32a-32f based on Linux of a cluster, these processors are by multiprocessing bus structure (multiprocessing bus architecture) 34 or the gateway protocol such as Beowulf cluster computer agreement, or both mixing, connect.Under such arrangement, database index preferably is stored in the random-access memory (ram) at least one node 32a in the cluster, to guarantee very rapidly to carry out fingerprint search.With the corresponding computing node such as

sign processing node

32c and 32f,

fingerprint processing node

32b and 32e and calibration scan node 32d of other object, do not need node or the as many random access memory of a plurality of node 32a with the supporting database index.Like this, the number that is assigned to the computing node of each object can be regulated as required, and making does not have single object to become bottleneck.So computational grid is that height can walk abreast, and can handle a plurality of synchronous signal identification inquiries that are distributed in the available computational resource extraly.This shows that this makes a large amount of users can ask to discern and becomes possibility near the application of reception result in real time.

In an embodiment as selection, some functional object can more closely be coupled, and keeps more untight coupling with other object.For example, sign is handled and be may reside in the position of separating physically with other calculating object with the fingerprint process object.A this example is the tight associating of sign processing and fingerprint process object and signal capture processing.Under this arrangement, sign processing and fingerprint process object can be used as the additional hardware or the software that will embed and incorporate into, for example, mobile phone, wireless application protocol (wap) browser, PDA(Personal Digital Assistant) or other remote terminal such as the client of audio search engine.In the audio search service such as the content identification service based on the Internet, sign is handled and the fingerprint process object can be incorporated in the client browser application program, as software instruction or the software insert module such as Microsoft's dynamic link libraries (DLL) connected group.In these embodiments, the signal capture of institute's combination, sign are handled and the fingerprint process object, have constituted the client of this service.The user end to server end sends the summary of the extraction feature (feature-extracted) of the sample of signal caught, described sample of signal comprise sign and fingerprint right, and server end is carried out this identification.The summary from this extraction feature to server end rather than the unprocessed signal of catching that send are favourable because reduced data volume widely, usually with 500 or bigger factor reduce.Such information can be passed through low bandwidth side channel, together with or replace for example sending to the audio stream of server, sent in real time.This makes it possible to carry out the present invention on public communication network, described public communication network provides relatively little bandwidth for each user.

Referring now to audio samples and in audio database the audio file of index this method is described.This method is made of two major ingredients, and promptly sound database index makes up and sample identification.

Database index makes up

Before can carrying out voice recognition, the sound database index that must structure can search for.As used herein, database is the set of any index of data, and is not limited to commercial available database.In database index, the coherent element of data is associated with each other, and each element can be used to retrieve associated data.Sound database index comprises: at the selected set of record or the index set of each file in the storehouse or record, described record comprises speech, music, advertisement, sonar signature (sonarsignature) or other sound.Each record also has unique identifier, sound ID (sound_ID).Audio database itself does not need to be each recording storage audio file, but sound _ ID can be used to retrieve the audio file from other places.The desired audio database index is very big, comprises at millions of or even the index of more than one hundred million file.New record preferably adds in the database index in the mode that increases progressively.

The block scheme that is used for making up according to first embodiment the method for optimizing 40 that can search for sound database index has been shown among Fig. 3.In the present embodiment, calculation flag at first, near calculated fingerprint at the sign place or it then.One skilled in the art will be appreciated that, can be designed for the method for the conduct selection that makes up database index.Especially, below listed many steps be optionally, but be used to produce the database index that efficient search is more arranged.Though search efficiency is very important for carrying out real-time voice recognition from large database,, toy data base can be searched for relatively soon, even it is not optimally classified.

Be the index audio database, each record in the set all stands sign to be handled and the fingerprint Treatment Analysis, comes to produce an index set for each audio file.Fig. 4 has roughly illustrated as calculated the segment of the SoundRec of sign (LM) and fingerprint (FP).Sign occurs at the specific time point of sound, and has from the value of the chronomere of the beginning skew of file, and fingerprint is at specific sign place or describe the feature of sound near it.Like this, in the present embodiment, all be unique at each sign of specific file, and identical fingerprint can occur many times in single file or a plurality of file.

In step 42, use the method for in SoundRec, finding unusual and reproducible position, each musical recording is indicated.Preferred sign Processing Algorithm can be indicated identical time point in SoundRec, and no matter the existence of noise and other linearity and nonlinear distortion.Some sign processing method is independent of following fingerprint processing procedure conceptive, but it can be selected to optimize its performance.Sign is handled a row time point { landmark who causes in the SoundRec _k, subsequently in these time point calculated fingerprint.Good sign processing scheme per second in SoundRec is indicated about 5-10 sign; Certainly, sign density depends on the amount of the activity (activity) in the SoundRec.

Multiple technologies can be used for calculation flag, and it all within the scope of the present invention.Be used for realizing that the detailed technology processing of sign processing scheme of the present invention is known in the field, so no longer go through.A kind of simple sign treatment technology is known as power norm (Power Norm), and instantaneous power is calculated at the time point place that each in record is possible, and selects local maximum.A kind of mode of doing like this is to calculate envelope by waveform directly being proofreaied and correct also filtering.Another kind of mode is the Hilbert transform (integration) of signal calculated, and use Hilbert transform and original signal square value and.

The power norm method that sign is handled is longer than the transition of finding in the voice signal.The power norm is actually the more generally special circumstances of frequency spectrum Lp norm when p=2.General frequency spectrum Lp norm is by calculating short-term spectrum, and calculates constantly along each of voice signal, for example, and by the Hanning-windowed fast Fourier transform (FFT).A preferred embodiment uses sampling rate, the fast fourier transform frame size of 1024 samples and the stride of each time period 64 samples of 8000Hz.The p power sum of absolute value of calculating spectrum component then alternatively, is asked root p time again as at the Lp norm of each time period.As previously mentioned, the local maximum of selection result value is as a token of in time.Fig. 5 shows an example of frequency spectrum Lp norm method, promptly at the L4 norm of the specific voice signal figure as the function of time.The dotted line at local maximum place shows the position of selected sign.

When p=∞, L ∞ norm is actually the maximal value norm.That is to say that the value of norm is the absolute value of maximum spectrum component in the wavelength coverage.This norm is brought strong (robust) sign and good whole recognition performance, and preferably is used for (tonal) music of tone.

As selection, by in fixing or variable each other skew place, ask the absolute value p power sum of the spectrum component on a plurality of time periods, calculate " multistage " frequency spectrum sign, rather than single section.Find the local maximum of the summation of this expansion, allow the optimization of the position of multistage fingerprint, as described below.

In case calculated sign, in step 44, each the sign time point place calculated fingerprint in record.Usually, fingerprint is or a class value that is summarised in this time point place in the record or near the stack features it.In current preferred embodiment, each fingerprint is the individual digit value, and it is latent (hashed) function of a plurality of features.The possible type of fingerprint comprises wavelength coverage fingerprint, multistage fingerprint, linear forecast coding coefficient and cepstral coefficients.Certainly, the fingerprint of feature any kind, that describe signal or near the signal of sign all within the scope of the present invention.Can pass through digital signal processing or frequency analysis, come calculated fingerprint any kind of signal.

For producing the wavelength coverage fingerprint, near execution frequency analysis each sign time point is to extract the highest several frequency spectrum spikes.Simple fingerprint value just in time is the single frequency value of the strongest frequency spectrum spike.Use so simple spike, bring the surprising good identification in having the situation of noise; Yet, to compare with other fingerprint schemes, single-frequency spectral coverage fingerprint often produces more vacation on the occasion of (false positive), because it is not unique.Can be by using a fingerprint that function constitutes by two or three the strongest frequency spectrum spikes, reduce false on the occasion of number.Yet, if the last the second frequency spectrum spike is strong inadequately, be not enough to identify among the rival from the noise that exists, so just may be more responsive to noise.That is to say that institute's calculated fingerprint value may be strong inadequately, and can not reproduce reliably.However, the performance of this situation also is good.

In order to utilize the temporal evolution of many sound,, determine one group of time period by adding one group of time migration to the sign time point.In each gained time period, calculate the wavelength coverage fingerprint.Make up this group finger print information of gained then, to form a multitone (multitone) or multistage fingerprint.Each multistage fingerprint is unique more than single-frequency spectral coverage fingerprint, because its tracking time develops, brings the vacation coupling in following database index search less.Experiment shows, because the uniqueness of its enhancing is levied, the multistage fingerprint that single the strongest frequency spectrum spike in from two time periods each calculates, bring calculating faster in database index search subsequently (approximately fast 100 times), but when having significant noise, the identification percent has some declines.

As selection, fixing just do not put or calculate the multistage fingerprint if do not use from the skew of given time period, then can use variable skew.Variable skew to selected section is, fingerprint indicates the skew of next one sign or the sign in certain deviation range from " anchor (anchor) ".In this case, the time difference between the sign together with multi-frequency information, also is encoded in the fingerprint.By add more multidimensional number to fingerprint, they will become unique more, and have the chance of lower vacation coupling.

Except that spectrum component, can extract other spectrum signature, and as fingerprint.Linear forecast coding analysis extracts the measurable feature such as frequency spectrum spike and frequency spectrum shape dress of signal linearly.Linear predictive coding is known in the digital processing field.For the present invention, hide (hashing) by the linear forecast coding coefficient that will quantize and advance in the index value, anchor can be used as fingerprint at the linear forecast coding coefficient of the waveform segment at place, mark position.

Cepstral coefficients is useful when property measuring period, and can be used to describe the harmonious signal such as the perhaps many musical instruments of voice.Cepstrum analysis is known in the digital processing field.For the present invention, many cepstral coefficients are hidden in the index together, and as fingerprint.

Illustrated one among Fig. 6 as the embodiment 50 that selects, therein, while calculation flag and fingerprint.The step 42 of Fig. 3 and 44 by

step

52,54 and 56 replace.As described below, in step 52, calculate multidimensional function, and from this function, extract sign 54 and fingerprint 56 from SoundRec.

In a kind of realization of the embodiment of Fig. 6, fall into a trap from the audio frequency spectrogram of SoundRec and to calculate sign and fingerprint.The audio frequency spectrogram is the T/F analysis of SoundRec, in described SoundRec, to the spectrum analysis of making a sound of (windowed) of the window of sample sound and overlapping frame, typically, uses fast fourier transform.As previously mentioned, sampling rate, the fast fourier transform frame size of 1024 samples and the stride of each time period 64 samples of a preferred embodiment use 8000Hz.An example of spectrogram has been shown among Fig. 7 A.Time is on transverse axis, and frequency is on Z-axis.Each continuous fast fourier transform frame along transverse axis with corresponding equi-spaced apart vertical stacking.The audio frequency spectrogram is described the energy density of each temporal frequency point; Regional Representative's high energy density of deceiving among the figure.The audio frequency spectrogram is known in the Audio Signal Processing field.For the present invention, can from a plurality of salient points, obtain sign and fingerprint, the audio frequency spectrogram local maximum that the audio frequency spectrogram centre circle of described salient point such as Fig. 7 B goes out.For example, obtained the time and the frequency coordinate of each spike, wherein, the time is as sign, and frequency is used for calculating corresponding fingerprint.This spectrogram spike sign is similar to L ∞ norm, in L ∞ norm, determines the mark position by the maximum value of norm.Yet in this audio frequency spectrogram, the local maximum search is carried out on the spot on T/F plane, rather than carries out on the whole time period.

In this article, will from the point of SoundRec extract analyze and the set of the salient point that gets is called constellation (constellation).For the constellation that constitutes by local maximum, the preferred analysis to selecting a plurality of points, described a plurality of points are near the maximum energy value on the T/F plane each institute's reconnaissance.For example, if coordinate (t ₀, f ₀) locate one o'clock in a rectangle, be the maximum energy value point, just select coordinate (t ₀, f ₀) point located, wherein, the angular coordinate of described rectangle is (t ₀-T, f ₀-F), (t ₀-T, f ₀+ F), (t ₀+ T, f ₀-F) and (t ₀+ T, f ₀+ F), promptly the length of side is the rectangle of 2T and 2F, and T and F are selected to provide the constellation point of proper number.Also can change the size of the scope of rectangle according to frequency values.Certainly use any region shape.Can also be to the weighting of maximum energy value standard, like this, competition T/F energy spikes is according to the distance measure in the T/F plane (metric) and by contrary weighting, some weighting promptly far away more is more little.For example, energy can be weighted into:

\frac{S (t, f)}{1 + C_{t} {(t - t_{0})}^{2} + C_{f} {(f - f_{0})}^{2}},

Wherein, (t is that the audio frequency spectrogram is at point (t, the amplitude of f) locating (magnitude) square value, and C f) to S _tAnd C _fBe positive (needing not to be constant).It also can be other distance weighted function.Local maximum selects constraint can be applied to other (non-maximal value) salient point feature extraction scheme, and within the scope of the present invention.

This method bring with above-mentioned single-frequency spectral fingerprint closely similar, have the value of many identical attributes right.Audio frequency spectrogram T/F method produces than the single-frequency method that more to many sign/fingerprint right, but also can obtain many false couplings at following matching stage.Yet it provides stronger sign to handle and the fingerprint processing than single-frequency spectral fingerprint, because can make surging noise in the sample sound not expand to all parts of audio spectrum in each section.That is to say in a plurality of parts of audio spectrum, some sign and fingerprint might be arranged very much to not influenced by surging noise.

Audio spectrum figure sign is handled and the fingerprint disposal route is the special circumstances of characteristic analysis method, and described characteristic analysis method calculates the multidimensional function of voice signal, and determines the position of salient point in functional value, and wherein, it is the time that one dimension is arranged.Salient point can be local maximum, local minimum, zero crossing (zerocrossings) or other distinguished feature.Sign is used as the time coordinate of salient point, and from remaining coordinate at least one calculated corresponding fingerprint.For example, the non-time coordinate of multidimensional salient point can be hidden (hashed) together, to form the fingerprint of multidimensional function.

The above-mentioned variable offset method that is used for multistage frequency spectrum fingerprint can be applied to audio frequency spectrogram or other multidimensional function fingerprint.In this case, illustrated in the audio frequency spectrogram as shown in Fig. 7 C, the point in the constellation be linked at together and form the link point.Each point in the constellation is as the anchor point of definition sign time, and all the other coordinate figures of other point are combined to form the fingerprint of link.For example, approximating point, the following definition is joined together to form more complicated condensate (aggregate) characteristic fingerprint, and it can more easily be distinguished and be searched for.With the same with multistage frequency spectrum fingerprint, is to create how multifarious possible fingerprint value with information from the purpose that multilinked salient point is attached to the single fingerprint, thereby reduce the possibility of false coupling, that is, reduce the possibility of describing two different music samples with identical fingerprint.

On principle, in 2 connectivity scenarios, each of N salient point can be linked to each other point, produces about N ²/ 2 combinations.Similarly, connect for the K point, the magnitude of the number of the possible combination that causes from a constellation is N ^KFor fear of the surge of such combination, expectation can retrain the point that will link together, and makes it adjacent.A kind of mode of finishing this constraint is to be each one " target area " of anchor point definition.Anchor point is connected with a plurality of points in its target area then.The subclass of the point in also can the select target zone connects---and be not that each point all needs to be connected.For example, only to connect the point that is associated with the strongest spike in the target area.The target area can have fixed shape, or changes according to the feature of anchor point.Anchor point (t for audio frequency spectrogram spike constellation ₀, f ₀) the simple case of target area be: make t at [t at interval ₀+ L, t ₀+ L+W] in the audio spectrum image strip in point (wherein, L is the introduction (lead) that enters in the future for t, set f), and W is the width of target area.In this scheme, in the target area, allow all frequencies.L or W can be variablees, for example, if the number that the connection of using the adjustment of a kind of ratio controlling mechanism to be produced is made up.As selection, for example, make to be that frequency f is at interval [f by the constrained objective zone ₀-F, f ₀+ F] in, can realize frequency limitation, wherein, F is a boundary parameter.An advantage of frequency constraint is: in psychologic acoustics, when known tone when a plurality of sequences had approximating frequency, melody was often consistent better.Such constraint can make more, and " true to nature on the psychologic acoustics " recognition performance becomes possibility.Although for the psychologic acoustics modeling is not a necessary purpose of the present invention.Also can consider opposite rule, wherein, f is chosen as at zone [f ₀-F, f ₀+ F] outside.This forces and connects the different point of frequency each other, can avoid following situation, promptly constellation extract human factor produce stammer (stuttering), around and frequency T/F points identical, a plurality of sequences.As other location parameter, F needs not to be constant, and can, for example, be f ₀Function.

When in fingerprint value, comprising the time coordinate of non-anchor salient point, must use relative time values, be the time invariant to allow fingerprint.For example, fingerprint can be (i) the non-time coordinate value and/or the (ii) function of the difference of the corresponding time coordinate value of a plurality of salient points.Can service time the difference value, for example, about anchor point, or the continuous difference between the salient point of concentrating in succession of link.Coordinate and difference can be placed in the bit field (concatenated bit field) of link, to form hiding (hashed) fingerprint.Because the those skilled in the art in this area will be appreciated that existence will be organized many alternate manners that coordinate figure is mapped to fingerprint value more, and all within the scope of the present invention.

Object lesson of this scheme uses N〉the audio frequency spectrogram spike of 1 link, its coordinate is (t _k, f _k), k=1 ..., N.Then, (i) the time t1 that gets first spike is the sign time, and (ii) time difference Δ t _k=t _k-t1, k=2 ..., N adds the frequency f of the spike of link _k, k=1 ... N is hidden (hashed) together to form fingerprint value.Fingerprint can be from all available Δ t _kAnd f _kCoordinate or its subclass calculate.For example, if desired, can ignore some or all time difference coordinates.

Another advantage of using multiple spot to form fingerprint is, can make finger-print codes relative time broadening constant, for example, and when playing sound record with the speed that is different from raw readings speed.This advantage both had been applicable to the audio frequency spectrogram, was applicable to the time period method again.Notice that in the signal of time broadening, time difference value and frequency have inverse relation (for example, reduce time differences with factor two, can make doubling frequency).This method is carried out combination by the mode that removes time explanation from fingerprint with time difference and frequency, has utilized the sort of fact.

For example, be (t at coordinate figure _k, f _k), k=1 ..., in the situation of the N point audio spectrum spike of N, will hide the available middle intermediate value (intermediate value) that (hash) advance in the fingerprint is Δ t _k=t _k-t1, k=2 ..., N, and f _k, k=1 ... N.Then, by getting in a plurality of frequencies, such as f ₁, as the reference frequency, and form the merchant of (i) itself and all the other frequencies and the (ii) product of itself and time difference, can make intermediary's value constant about time explanation.For example, middle intermediate value can be g _k=f _k/ f ₁, k=2 ..., N, and s _k=Δ t _kf ₁, k=2 ..., N.If sample quickens with factor α, so frequency f _kBecome α f _k, and time difference Δ t _kBecome Δ t _k/ α, g like this _k=α f _k/ α f ₁=f _k/ f ₁, and s _k=(Δ t _k/ α) (α f ₁)=Δ t _kf ₁Then, use function that these new middle intermediate values are combined (hashed) fingerprint value hidden that is independent of time explanation with formation.For example, can pass through g _kAnd s _kValue is put into the bit field of link and is hidden (hash) g _kAnd s _kValue.

As selection, can use the reference time difference, for example Δ t ₂Replace reference frequency.New middle intermediate value is calculated as merchant's Δ t of (i) and all the other time differences _k/ Δ t ₂And (ii) with the product Δ t of frequency ₂f _kThis situation is equivalent to the use reference frequency, because end value can be from top g _kAnd s _kThe product and the merchant of value obtain.The inverse of frequency ratio can be used equally effectively; Also can with the logarithm value of intermediate value in original and do not replace long-pending and merchant with difference.Independently fingerprint value is all within the scope of the present invention for any time explanation of obtaining of mathematical operations by such conversion (commutation), replacement (substitution) and displacement (permutation).In addition, can use a plurality of reference frequencies or reference time difference, they also make the time difference relativization.Use a plurality of reference frequencies or reference time difference to be equivalent to and use single reference value, because can pass through g _kAnd s _kThe operation that counts of value realizes identical result.

Turn back to Fig. 3 and Fig. 6, indicate by any one of said method and handle and the fingerprint Treatment Analysis can be brought at the index of each sound ID and gathers, shown in Fig. 8 A.Index set at given SoundRec is that a train value is to (fingerprint, sign).Typically, each is indexed is recorded in has (fingerprint, the sign) of 1,000 magnitude right in its index set.In first above-mentioned embodiment, sign is handled and the fingerprint treatment technology is independently basically, can regard it as separation and tradable module.According to the type of system, signal quality, the sound that maybe will be identified, can use many different signs handle or Fingerprint Processing Module in one.In fact because index set simply by a plurality of values to forming, so can, and often preferably use a plurality of signs processing and fingerprint processing scheme simultaneously.For example, a kind of processing and fingerprint processing scheme of indicating may be longer than the tone patterns of surveying uniqueness, but no longer than the identification percussion music, because different algorithms has opposite attribute.Use a plurality of sign processing/fingerprint processing policies to bring the scope of strongr and abundanter recognition performance.By keep the fingerprint value of some scope for the fingerprint of some kind, can use multiple different fingerprint technique together.For example, in 32 fingerprint values, can be with 29 codings of preceding 3 locator qualification back be in 8 fingerprint processing schemes which.

For will be in audio database each SoundRec of index produce after index set, make up the database index that can search in the mode that allows (being logarithmic time) search fast.This be in step 46 by make up a row tlv triple (fingerprint, sign, sound _ ID) finish, described tlv triple is to add corresponding sound _ ID by each couple (doublet) in each index set to obtain.All these tlv triple at all SoundRecs are collected in the large-scale index, and its example has been shown among Fig. 8 B.Then, handle optimization, this row tlv triple is classified according to fingerprint in order to make search subsequently.The Fast Classification algorithm is well known in the art, and, at D.E.Knuth, The Art ofComputerProgramming (technology of computer programming), Volume 3:Sorting and Searching (classification and search), Reading, Massachusetts:Addison-Wesly, discussed widely in 1998, be incorporated herein by reference at this.Can use the high-performance sorting algorithm tabulation to be classified in the time at NlogN, wherein, N is the number of the project in the tabulation.

In case index is classified, in step 48, it is further handled by segmentation, like this, each unique fingerprint is collected into new master index tabulation in the tabulation, and its example has been shown among Fig. 8 C.Each project in the master index tabulation all comprises fingerprint value and points to row (sign, the pointer that sound _ ID) is right.According to the number and the feature of indexed record, hundreds of can appear in given fingerprint in whole set inferior even more.Index is rearranged to the master index tabulation is optionally, but saved storer, because each fingerprint value only occurs once.It also can quicken database search subsequently, because the effective number of the project in the tabulation greatly is reduced to the unique value of row.As selection, can make up the master index tabulation by each tlv triple being inserted a B-tree (B tree).Known as the those skilled in the art in this area, there is other possibility that is used to make up the master index tabulation.Master index tabulation preferably is retained in the system storage such as dynamic RAM (DRAM), is used for fast access between the signal recognition phase.The master index tabulation can be retained in the storer of intrasystem individual node, as illustrated in fig. 2.As selection, the master index tabulation can be divided into the piece that is assigned in a plurality of computing nodes.With reference to the preferably illustrated master index tabulation among Fig. 8 C of above sound database index.

Sound database index preferably off-line (offline) makes up, and when incorporating new sound in the recognition system into, just increases ground and upgrade.For upgrading tabulation, can insert new fingerprint in the suitable position in master list.If new record comprises a plurality of existing fingerprints, add corresponding (sign, sound _ ID) right to the existing tabulation that is used for these fingerprints so.

Recognition system

Use the master index tabulation that produces as described above, external sample sound is carried out voice recognition, typically, described sample sound is to recognize that by hope the user of this sample is provided.For example, the user hears a first new song in broadcasting, and wants to understand the author and the title of this song.This sample can be derived from the environment of any kind such as radio broadcasting, discotheque, public house, seabed, audio files, audio stream fragment or stereophonic sound system, and can comprise ground unrest, information loss or conversational speech.Providing to system audio samples for identification before, the user can be stored in it in the memory device such as answering machine, computer documents, blattnerphone or phone or mobile phone phonetic mailing system.Based on system's setting and user constraint, audio samples is from such as stereophonic sound system, TV, Disc player, radio broadcasting, answering machine, phone, mobile phone, the Internet (Internet) streaming, file transfer protocol (FTP) (FTP), offer recognition system of the present invention as the computer documents of e-mail attachment or the analog or digital source of transmitting the arbitrary number any other suitable devices of such recording materials.According to the source, the form of sample can be the digital audio stream (such as Dolby Digital (Dolby Digital) or motion picture expert group 3 (MP3)) or the Internet flows broadcasting of sound wave, radiowave, DAB pulse code modulation (pcm) stream, compression.The user is undertaken by standard interface such as phone, mobile phone, web browser or Email and recognition system alternately.Sample can be by system acquisition and is handled in real time, and perhaps it can be replicated, and is used for handling from the sound (for example audio files) of before having caught.Between trapping period, audio samples is digitally sampled, and by the sample devices such as microphone, sends it to system.According to catching method, sample may stand further deterioration because of the limitation of channel or voice capturing equipment.

In case voice signal is converted into digital form, it is processed so that identification.As be used for the structure of the index set of database file, and use and the identical algorithm of algorithm that is used to handle the SoundRec database, come to be sample calculation sign and fingerprint.If after the processing to the reproduction of the high distortion of original sound file, can obtain with right at identical or similar a group mark and fingerprint that raw readings obtained, so, this method is optimum.Consequent index set at sample sound is that one group of value by analysis is to (fingerprint, sign), as shown in Fig. 9 A.

Given a plurality of right at sample sound, the search database index is with the position of the file of determining potential coupling.Search is performed as follows: by search for fingerprint in the master index tabulation _k, handle the each (fingerprint in the index set of unknown sample _k, landmark _k) right.Fast search algorithm about orderly tabulation is well known in the art, and, at D.E.Knuth, The Art of ComputerProgramming (computer programming), Volume 3:Sorting and Searching (classification and search), Reading, Massachusetts:Addison-Wesly was discussed in 1998 widely.If in the master index tabulation, found fingerprint _k, so, (the landmark of its corresponding row coupling ^* _j, sound_ID _j) to being replicated, and augment landmark _k, be (landmark with the formation form _k, landmark ^* _j, sound_ID _j) one group of tlv triple.In this symbolic notation, asterisk (*) shows the sign of the indexed file in the database, and does not have the sign of asterisk to refer to sample.In some cases, preferably, the fingerprint of coupling needs not be identical, need be similar only; For example, in predetermined threshold value, they can be different.The fingerprint of coupling, no matter be identical or similar, it is of equal value all to be known as.Sound_ID in the tlv triple _jCorresponding with the file of sign with band asterisk.Like this, each tlv triple comprises two different signs, one in database index, and one in sample, calculates fingerprint of equal value at these two different sign places.All k in the index range of convergence of the sample imported are repeated this process.All tlv triple that obtain are collected in the big candidate list, as illustrated among Fig. 9 B.Be called candidate list and be because: it comprises the sound _ ID of a plurality of audio files, and by their characteristics of fingerprint of coupling, described audio files is the candidate of the sample sound that is used to recognize external.

After having edited candidate list, it is further handled is according to sound _ ID segmentation.A kind of mode easily of doing this part thing is by sound _ ID candidate list to be classified, or is inserted into the B-tree.As mentioned above, there is a large amount of sorting algorithms to use in the art.The result of this processing is a row candidate sound _ ID, and wherein, each row all has by sample and file mark time point (landmark _k, landmark ^* _j) form a distribution tabulation, wherein, peelled off sound _ ID alternatively, as shown in Fig. 9 C.Like this, each scatters tabulation and comprises one group of corresponding sign, is the feature described according to their fingerprint value with equivalence and corresponding.

Dissecting needle is to the distribution of each candidate sound _ ID tabulation then, with determine this sound _ ID whether with sample matches.Can use an optional threshold value (thresholding) step, at first get rid of potential a large amount of candidate with very little distribution tabulation.Clearly, scatter the candidate that has only a project in the tabulation, promptly have only a candidate that fingerprint is the same with sample, not with sample matches at it.Can use any more than or equal to one appropriate threshold number.

In case determined the final number of candidate, just determined the position of the candidate chosen.If following algorithm be can not determine the position of the candidate of choosing, then return failed message.The key of seeing clearly matching treatment is: the time base (timebase) of supposing both sides all is stable, and then the temporal evolution in the sound of coupling must be followed linear consistent.This is almost always correct, unless a sound has intentionally non-linearly been twisted or the defective playback equipment the boxlike videocorder of (warbling) speed issue that stood to tremble such as having.Like this, the accurate indication in the distribution tabulation of given sound _ ID is to (landmark _n, landmark ^* _n) the linear consistent of following form must be arranged:

landmark ^* _n＝m*landmark _n+offset，

Wherein, m is a slope, should be near one; Landmark _nIt is the external interior time point of sample; Landmark ^* _nBe by the corresponding time point in the SoundRec of sound _ ID index; And offset be displaced to the corresponding SoundRec of the beginning of external sample sound in time.Can satisfy a plurality of signs at the above-mentioned equation of the specific value of m and offset to being called as linear dependence.Obviously, the notion of linear dependence is only to effective more than a pair of corresponding sign.Notice that this linear dependence is with the very high correct audio files of probability identification, getting rid of does not simultaneously have the irrelevant sign of importance right.Although can comprise the fingerprint of many unanimities for two different signals, these fingerprints can not have identical relevant (relative) temporal evolution very much.To linear consistent requirement is key feature of the present invention, and a kind of recognition technology is provided, and it is better than counting simply the technology of the number of identical feature or the similarity between the measurement features and so on significantly.In fact and since of the present invention this on the one hand, even the fingerprint of the raw readings that occurs in external sample sound is less than 1%, that is, if sample sound is very short, if or its be distortion significantly, still can sound recognition.

Like this, determine whether problem, be simplified as to be equivalent to and find that in the scatter diagram of the monumented point of given distribution tabulation slope is near one diagonal line at the coupling of external sample.Two sample scatter diagrams have been shown among Figure 10 A and Figure 10 B, and wherein, the audio files sign is on transverse axis, and external sample sound sign is on Z-axis.In Figure 10 A, recognized that slope is approximately equal to a diagonal line of one, show certain and this sample matches of this song, that is, this audio files is the file of choosing.Intercept on the transverse axis shows and is displaced in this audio file that sample begins there.In the scatter diagram of Figure 10 B, find that statistics goes up significant diagonal line, show that this audio files and external sample do not match.

A variety of cornerwise methods of finding in scatter diagram are arranged, and all these methods all within the scope of the present invention.Being appreciated that phrase " is determined cornerwise position " and refer to be equivalent to determines cornerwise position and don't produces cornerwise all methods significantly.A kind of preferable methods starts from: deduct m*landmark from the two ends of above-mentioned equation _n, will obtain:

(landmark ^* _n-m*landmark _n)＝offset。

Suppose that m is approximately equal to one, that is, suppose not free broadening, we can obtain:

(landmark ^* _n-landmark _n)＝offset。

Then, diagonal line is found (diagonal-finding) problem, be reduced to discovery at given sound ID, near a plurality of signs of cluster (cluster) identical offset value are right.This point can also be collected the histogram of gained off-set value and easily finishes by deduct another from a sign.Can classify to the off-set value of gained or have case (bin) project of counter and be inserted in the B-tree by using the Fast Classification algorithm, prepare this histogram by establishment.The skew case of choosing in the histogram comprises the maximum number point of destination.Here, this case is called as the spike of histogram.Because if external voice signal is completely contained among the audio files of correct storehouse (library), then skew just is necessary for, so it is right to get rid of the sign that causes negative bias to move.Similarly, also can get rid of the skew of the ending that exceeds file.Be recorded in the number of the point in the skew case of choosing of histogram for each titular sound _ ID.This number becomes the score value at each SoundRec.Select the SoundRec in the candidate list to be the person of choosing with highest score.As described below, the sound _ ID to user report is chosen recognizes successful signal to give notice.For preventing the identification failure, the success that can use the minimum threshold score value to handle with the control identification.If do not have score value to surpass the storehouse sound of threshold value, so, just do not discern, and so notify the user.

If external voice signal comprises a plurality of sound, then can discern each independent sound.In this case, in calibration scan, determine a plurality of persons' of choosing position.Calibration scan do not need to know that voice signal comprises a plurality of persons of choosing, because will be determined the position more than one sound _ ID of score value far above all the other score values.Employed fingerprint method preferably shows favorable linearity and overlaps (superposition), so that can extract a plurality of independent fingerprints.For example, audio frequency spectrogram fingerprint disposal route shows linear the coincidence.

If sample sound has stood time explanation, then slope is not as one man to equal one.To the result who supposes consistent slope with the sample (supposing that fingerprint is the time explanation invariant) of time explanation be: the off-set value that is calculated does not equate.The mode that addresses this problem and provide the time explanation of appropriateness is the size that increases the skew case, that is, be thought of as in a deviation range, equates.Usually, if a plurality of point does not drop on the straight line, then the off-set value of being calculated is different significantly, and the vacation that the slight increase on the size of skew case can't produce significant number on the occasion of.

Other line finds that (line-finding) strategy is fine.For example, can use T.Risse, " HoughTransform for Line Recognition (being used for the Hough conversion of line identification) ", Computer Visionand Image Processing (computer vision and Flame Image Process), 46,327-345, Radon or the Hough conversion described in 1989, they are known in machine vision and the figure research field.In the Hough conversion, each point in the scatter diagram projects to straight line in (slope, skew) space.Like this, in the Hough conversion, this group point in the scatter diagram is projected in the dual space (dual space) of many straight lines.Spike in the Hough conversion is corresponding with the point of crossing of parameter straight line.All spikes of a such conversion of given scatter diagram show the maximum number of cross linear in the Hough conversion, and work in coordination with the maximum number that linear (co-linear) puts.For allowing 5% velocity variations, for example, can like this, save some calculated amount between 0.95 and 1.05 with the structural limitations of Hough conversion to the zone that Slope Parameters changes.

Hierarchical search

Except eliminating has the threshold step of the candidate of very little distribution tabulation, can also raise the efficiency further.In such raising, according to the probability that occurs, database index is segmented into two parts at least, and a search originally has the audio files of the maximum probability of matched sample.This division can appear at the various stages of processing.For example, master index tabulation (Fig. 8 C) can be segmented into two or more parts, and like this, step 16 and 20 is at first carried out on a section.That is to say, only retrieval and the corresponding file of fingerprint that mates from the part of database index, and tabulation is scattered in one of generation from this part.If do not determine the position of the audio files chosen, then the remainder to database index repeats this processing.In another is realized, all files of retrieval from database index, but diagonal line scanning is carried out on different sections discretely.

Use this technology, at first carry out diagonal line scanning on the smaller subset of the audio files in database index, described diagonal line scanning is the intensive part of calculated amount of this method.Because the number that diagonal line scanning has about the audio files that is scanned is approximated to linear time component, it is advantageous to carry out such hierarchical search.For example, suppose that sound database index comprises and represent 1,000, the fingerprint of 000 audio files, but have only about 1000 files with very high frequency matching sample queries, for example 95% inquiry is at 1000 files, is at remaining 999,000 file and have only 5% inquiry.Suppose that the number that assesses the cost with file is a linear dependence, then cost is in time of 95% and 1000 proportional, and has only time of 5% and 999,000 proportional.Thereby average unit cost is approximately proportional with 50,900.Like this, hierarchical search makes calculated amount be reduced to 1/20th.Certainly, also database index can be divided into more than two-stage, for example the song of one group of new issue, one group of song and one group of old unfashionable song of issuing recently.

As mentioned above, at first to first subclass of audio files, promptly the high probability file is carried out search, then, only when first search is failed, second subclass that comprises remaining paper is carried out search.If the number of the point in each skew case does not reach predetermined threshold value, then diagonal line scanning failure.As selection, (side by side) carries out this two-stage search concurrently.If in search, determined the position of correct audio files, then send signal to stop search to second subclass to first subclass.If in to first search, do not determine the position of correct audio files, then continue second search, up to the position of the file of determining to choose.These two kinds of different realizations relate in computing power (effort) and temporal balance.First kind of realization has better counting yield, still, if first search has been failed, just introduced the slight stand-by period; And,, then can waste computing power if the file of choosing is in first subclass for second kind of realization, still, the file in elected is not in first subclass time, the stand-by period minimum.

Purpose to the segmentation of tabulating is to estimate the probability of an audio files for the target of inquiry, and search is limited in the file that those are most possible and query sample is mated.There are various possible modes to specify probability and to the sound classification in the database, they all within the scope of the present invention.Preferably, specify probability based on newness degree and the frequency that is identified as the audio files of choosing.Newness degree is useful measuring, particularly for popular song, because along with the distribution of new song, interest in music changes very fast in time.After the calculating probability score value, be the file given level, and tabulation is by grade classification (self-sort) certainly.Then, classified tabulation is segmented into two or more subclass, is used for search.Less subclass can comprise the file of predetermined number.For example, if arrange document alignment, in other words, in 1000 files, then file is placed in less, the search faster in the top.As selection, can dynamically adjust the separation that is used for two subclass.For example, all files with the score value that surpasses a specific threshold value can be placed in first subclass, and therefore, the number of the file in each subclass changes continuously.

A kind of specific mode of calculating probability is: when an audio files is recognized as at one of query sample coupling, increase by one just for the score value of audio files.Be the explanation newness degree, periodically reduce the score value of all records, like this, the older inquiry of newer inquiry has stronger effect on arranging.For example, can become (ratchet) low all score values by the constant multiplication factor is used in each inquiry, make: if be not updated, score value just is exponential damping.According to the number of the file in the database, this number is easy to reach 1,000,000, upgrades a large amount of score values when this method requires each the inquiry, makes it be out of favour potentially.As selection, adjust score value downwards with relative not frequent interval, such as once a day.The order that obtains from more not frequent adjustment is all adjusted the order that obtains with from each inquiry the time, is similar effectively, but very not consistent.Yet the calculated amount of upgrading grade is very low.

A slight variation of this newness degree adjustment is: when inquiry, the score value that adds exponential increase to the audio files of choosing upgrades a ^t, wherein, t has been the amount of institute's elapsed time since last time all renewal, this variation keeps the newness degree score value more accurately.Then when each all the renewal, by with all score values divided by a ^T, adjust all score values downwards, wherein, T be since last time all renewal total time of process.In changing in this, a is the new and old factor greater than.

Except above-mentioned arrangement, can introduce some inferenctial knowledge, to help the seed candidate (seed) in the selective listing.For example, the song of new issue has higher inquiry number than old song probably.Like this, the song of new issue can automatically be placed in first subclass, and described first subclass comprises the song of the high probability with matching inquiry.This can be independent of above-mentioned being performed from permutation algorithm.If also use from arranging feature, the song of new issue can designated initial grade, to place it in first subclass somewhere.The song of new issue can be by seed candidate (seed) in the bottom of position, the high probability list of songs at the very top of tabulation or between the two somewhere.Because the purpose of search, initial position is unimportant, because grade will restrain in time, to reflect real interest level.

Select among the embodiment a conduct, search is to carry out with the order of new and old arrangement, and stops when sound _ ID value surpasses predetermined threshold value.This and every section above-mentioned method that only comprises a sound _ ID are of equal value.Experiment shows, the score value of the sound of choosing is much larger than the score value of all other audio files, and therefore can select appropriate threshold with minimal experiment.A kind of mode that realizes present embodiment is: according to all sound _ ID in the index of newness degree array data storehouse, and appraise through comparison (tie-breaking) in the situation to identical score value arbitrarily again.Because it is unique that each newness degree is arranged, so be to shine upon one by one between newness degree score value and the sound _ ID.So, when dividing time-like, can use to arrange to replace sound _ ID, with tabulation and the related distribution tabulation (Fig. 9 C) that forms candidate sound _ ID by sound _ ID.(during the index of sound _ ID), index is classified as before the master index tabulation for fingerprint, sign, can arrange number to limit (bound) in index producing tlv triple.Then, replace sound _ ID to arrange.As selection, can use search and replacement function to come to replace sound _ ID with arranging.As long as keep the mapping integrality, along with arrangement is updated, new arrangement just is mapped in the old arrangement.

As selection, in processing, arrangement can be defined (bound) after a while.Scatter tabulation in case created, arrangement can be associated with each sound _ ID.Then, by arranging a plurality of collection are classified.In this is realized, only need to revise and point to the pointer that scatters tabulation; Do not need repeated packets to become to scatter tabulation.The advantage that limits (bindings) after a while is: need not rebuild the entire database index when arranging each the renewal.

Notice that popularity (popularity) arrangement itself just can be used as the object of economic worth.That is to say, arrange the demand of identification that the user obtains the sample sound an of the unknown of reflecting.In a lot of situations, inquiry is to be caused by the desire of buying this song.In fact, if known demographic information about the user can realize as the arrangement scheme of selecting for the demographic colony of each expectation so.The profile information that can require when the user signs the identification service is obtained user's demographic colony.Also can dynamically determine by the standard filtering technique (standard collaborative filtering technique) of cooperating.

In real-time system, sound incrementally offers recognition system in time, makes it possible to flowing water (pipelined) identification.In this case, can staging treating import data, and incrementally upgrade the sample index set.After each update cycle, use above-mentioned search and scanning step, the index set that increases recently is used to retrieve the candidate storehouse SoundRec.The fingerprint of the sample fingerprint coupling of from database index, searching for and obtaining recently, and produce new (landmark _k, landmark ^* _j, sound_ID _j) tlv triple.Scatter and to have added new rightly in the tabulation, and histogram also is increased.The advantage of this approach is: if collected the adequate data that can in so many words recognize SoundRec, for example, if outnumbering a high threshold or surpassing the second high audio files score value of the point in one the skew case in a plurality of audio files so just can stop data acquisition and announce the result.

In case recognized correct sound, with any suitable method to user or System Reports result.For example, the result can be by computer printout output, Email, the web search results page, issue verbal announcement that the Short Message Service (SMS) of mobile phone, computing machine by phone produce or the result be published on the website or the Internet account number that the user can visit after a while.The result who is reported can comprise the identification information such as the sound of the company and product of the composer of the title of song and author, classic and title and record attribute (for example, player, commander, performance ground), advertisement or any other suitable identifier and so on.In addition, can provide biography information, about near information and other song fan's information of interest of concert; Can be provided to the hyperlink of these class data.The result who is reported also can comprise absolute score value or its score value of comparing with the file of second high score of audio files.

A useful achievement of this recognition methods is: it does not obscure two different performance of same sound.For example, the difference of the same chapter of classical music is played can not be considered to identical, even people can not perceive difference between the two.This be because, pair develop at the sign/fingerprint of twice different performance with its time, extremely can not accurately mate.In current embodiment, sign/fingerprint is to each other must be within 10 milliseconds, so that be recognized as linear consistent.As a result of, the performed automatic identification of the present invention makes and can trust suitable performance/audio track (soundtrack) and author/label in all cases.

The example of realizing

The preferred realization, i.e. a window audio identification of sliding continuously of the present invention are described below.Microphone or other sound source are sampled in the impact damper continuously, to obtain preceding N record second of sound.Periodically analyze the content of sound buffer, to determine the consistance of sound-content.Sound buffer can have fixing size, or can be sampled and increased size along with sound, and here, the order that is called as audio samples increases section.Provide report to show the appearance of the SoundRec of being recognized.For example, can the collector journal file, or on equipment, show the information about music of indication such as title, artist, recording cover picture, the lyrics or purchase information.For avoiding redundant, can only when changing, the consistance of the sound of being discerned provide report, for example, after the jukebox program changes.Such equipment can be used to create the tabulation of the music of playing from any sound stream (radio broadcasting, internet information streaming, hiding microphone, call, etc.).Except the music consistance, can charge to daily record to the information such as the time of identification.If positional information is obtainable (for example, from GPS (GPS)), also can charge to daily record to this category information.

For finishing this identification, can recognize each impact damper again.As selection, for example, audio parameter can be drawn in the form of fingerprint or the feature extraction in the middle of other, and be stored in second impact damper.New fingerprint can be added to the front end of second impact damper, and from the old fingerprint of the tail end hop-off of this impact damper.The advantage of a rolling buffer schemes like this is, do not need overlapping old section to sample sound to carry out identical analysis redundantly, like this, saves computing power.The content of rolling fingerprint impact damper is periodically carried out identification to be handled.In the situation of small portable apparatus, because fingerprint stream is not that non-regular data is intensive, so, can in equipment, carry out fingerprint analysis, and can use the data channel of relative low bandwidth to send the result to identified server.Rolling fingerprint impact damper can be retained in the portable set, and at every turn to identified server transmission, maybe can be retained on the identified server, in this situation, discern continuously session (session) by high-speed cache (cache) on server.

In a such rolling impact damper recognition system, one has sufficient information to can be used for identification, just can discern new SoundRec.Sufficient information can take the length less than impact damper.For example, if identification that can be unique after first distinguished song is being play a second, and the system identification cycle be a second, so, just can discern this song immediately, can have 15-30 long second although cushion.On the contrary, if the song of a first less characteristic requires the sample in more seconds to discern, so, before the consistance of announcing song, system just must wait for long period.In this moving window identifying schemes, one has sound to be recognized, just can discern this sound.

What need exactissima diligentia is, although described the present invention with the recognition system and the method for complete function, those those of skill in the art will be appreciated that, mechanism of the present invention can be with the form of the computer-readable media of various forms of instructions and is assigned with, and, the present invention can use coequally, and no matter be used for the particular type of the signal bearing media of actual execution distribution.But the example of this class computing machine access arrangement comprises computer memory (random access memory or ROM (read-only memory) (ROM)), floppy disk and compact disc read-only memory (CD-ROM), and the medium of the transport-type such as numeral and analog communication link.

Claims

1. one kind is used for comprising from the method for media sample identification medium entity:

Unanimity between the corresponding sign of the medium entity that is created in the sign of described media sample and will discerns, the described sign of wherein said media sample and the described respective flag of described medium entity have fingerprint of equal value; And

If a plurality of described consistent linear relationships that have by the following formula definition are then recognized described media sample and described media file,

landmark* _n＝m*landmark _n+offset，

Wherein, landmark _nBe the renewable position in the described media sample, landmark ^* _nBe corresponding to landmark _nDescribed medium entity in renewable position, m represents slope, and offset is the skew with respect to the starting position of described media sample.

2. the method for claim 1, wherein each sample fingerprint is illustrated near one or more features of described media sample described sample sign or its, and each entity fingerprint representation is in described entity sign or near one or more features of described medium entity it.

3. the method for claim 1, wherein said media sample is an audio samples.

4. method as claimed in claim 3, wherein said entity fingerprint are characterized in described entity sign or near one or more features of described medium entity it in comprising at least one dimension of time.

5. method as claimed in claim 3, wherein said sample fingerprint is constant for the time explanation of described audio samples.

6. method as claimed in claim 5, wherein said sample fingerprint comprises the merchant of each frequency component of described audio samples.

7. method as claimed in claim 3, wherein said sample sign are the time points in described audio samples.

8. method as claimed in claim 7, wherein said time point appear at the local maximum of the frequency spectrum Lp norm of described audio samples.

9. method as claimed in claim 3, wherein said sample fingerprint is calculated from the frequency analysis of described audio samples.

10. method as claimed in claim 3, wherein said sample fingerprint is selected from the group of being made up of wavelength coverage fingerprint, linear forecast coding coefficient and cepstral coefficients.

11. method as claimed in claim 3, wherein, described sample fingerprint calculates from the audio frequency spectrogram of described audio samples.

12. method as claimed in claim 11, wherein said audio frequency spectrogram comprises the salient point of anchor point and link.

13. method as claimed in claim 12, wherein said sample fingerprint calculates from the frequency coordinate of the salient point of described anchor point and at least one link.

14. method as claimed in claim 13, the salient point of wherein said link falls in the target area.

15. method as claimed in claim 14, wherein said target area is defined by time range.

16. method as claimed in claim 14, wherein said target area is defined by frequency range.

17. method as claimed in claim 14, wherein said target area is variable.

18. method as claimed in claim 13, wherein said fingerprint is calculated from the described frequency coordinate of the salient point of described link and described anchor point discussing between the two, thereby described fingerprint is that time explanation is constant.

19. method as claimed in claim 13, wherein said fingerprint also calculate from least one time difference between the time coordinate of the salient point of the described time coordinate of described anchor point and described link.

20. method as claimed in claim 19, wherein said fingerprint also one product from the described frequency coordinate of the salient point of one of described mistiming and described link and described anchor point calculate, thereby described fingerprint is that time explanation is constant.

21. as the method for claim 13, the salient point of wherein said anchor point and described link is selected from the group of being made up of local maximum, local minimum and the zero crossing of described audio frequency spectrogram.