CN105279252A

CN105279252A - Related word mining method, search method and search system

Info

Publication number: CN105279252A
Application number: CN201510657691.7A
Authority: CN
Inventors: 韩增新; 蒋冠军; 董良
Original assignee: Guangzhou Shenma Mobile Information Technology Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2015-10-12
Filing date: 2015-10-12
Publication date: 2016-01-27
Anticipated expiration: 2035-10-12
Also published as: CN105279252B; WO2017063538A1

Abstract

The invention discloses a related word mining method. The method comprises: obtaining parallel sentence pairs expressing the same meaning with different expression forms based on large-scale user search behavior data; performing word segmentation processing on each set of parallel sentence pairs; performing word alignment processing on the parallel sentence pairs subjected to the word segmentation processing to obtain first aligned word pairs; calculating co-occurrence frequency of the first aligned word pairs; and determining the first aligned word pairs with the co-occurrence frequency higher than a predetermined threshold as related words. With the related word mining method, related words with higher relevancy can be mined, the retrieval word search range also can be expanded, and the probability of finding a better search result is increased. At the same time, the invention furthermore discloses a search method and a search system.

Description

Excavate method, searching method, the search system of related term

Technical field

The present invention relates to information retrieval field, particularly relate to and a kind ofly excavate the method for related term, a kind of searching method and a kind of search system.

Background technology

Search engine is for the necessary function that " user uses the convenience of website " provides in Web Hosting, is also " effective tool of research website user behavior " simultaneously.Efficient website search can allow user find target information rapidly and accurately, thus effectively solve customer problem, also the sale of product/service can more effectively be promoted, and by the depth analysis to website caller search behavior, for formulating more efficiently net marketing strategy further, there is important value.

User is when using search engine to search for, and by the searching page of search engine, input search key, search engine retrieving also returns result for retrieval.General search engine can directly use the keyword of user's input to carry out former word search, or uses the synonym of term to search for.

But when using the former word of term or synonym to search for, Search Results is limited.Usually have some good results, their word itself is inconsistent with term, but semantically very relevant to search word, causes the webpage of such result to recall.

Summary of the invention

Technical matters to be solved by this invention solves traditional search engines only to carry out retrieving the limited problem of the result for retrieval that obtains by former word or synonym, provides a kind of and excavate the method for related term, a kind of searching method and a kind of search system.

According to an aspect of the present invention, a kind of method excavating related term is provided.

Excavate a method for related term, comprising:

The parallel sentence of expressing identical meanings based on large-scale consumer search behavior data acquisition employing different expression form is right;

To often organizing described parallel sentence to carrying out word segmentation processing;

To the parallel sentence after described word segmentation processing to carrying out word alignment process, to obtain the first alignment word pair;

Calculate the co-occurrence frequency that described first alignment word is right;

Co-occurrence frequency is alignd word to being defined as related term higher than described first of predetermined threshold.

Like this, by this excavation related term method, the related term of the higher degree of correlation can be excavated, also can expand the scope of term search, improve the probability finding better Search Results.

Preferably, the step that the parallel sentence of described acquisition is right comprises:

According to the literal similarity of two sentences, the parallel sentence that filtering implication is different is right.

Like this, the parallel sentence different by the literal similarity filtering implication of two sentences is right, thus obtains that to express implication identical but saying is different parallel sentence is right.

Preferably, the method also comprises the context of co-text word recording described related term.

By recording the context of co-text of this related term, whether identical or close by judging the context of co-text of two related terms, be conducive to the degree of correlation judged further between related term.

Preferably, described word alignment process comprises regular word alignment process and/or the process of statistics word alignment.

Preferably, the described regular word alignment process at least one that comprises literal identical word alignment process, literal part same words registration process or close in word alignment process.

Like this, the different related term of degree of correlation degree can be excavated.

Preferably, described statistics word alignment is treated to and uses GIZA++ instrument to carry out the process of statistics word alignment.

Preferably, the method also comprises:

Use linear model to filter described large-scale consumer search behavior data acquisition second to align word pair;

Acquisition can embody the statistical nature of the degree of correlation between described related term;

With described first alignment word to for positive sample, described second alignment word, to being negative sample, based on described statistical nature, adopts gradient to promote decision tree (GBDT) algorithm, train described positive sample and described negative sample, obtain described related term confidence calculations model.

Like this, by setting up related term confidence calculations model, the degree of correlation between related term can be distinguished by this model.

Preferably, described related term confidence calculations model is GBDT nonlinear regression model (NLRM).

According to another aspect of the present invention, a kind of searching method is also disclosed.

A kind of searching method, comprises the steps:

The related term of term is obtained based on related term dictionary;

The degree of confidence between described term and each described related term is calculated based on confidence calculations model;

Retrieve obtained result according to the degree of confidence of correspondence to the described term of use and described related term to sort.

Like this, by this searching method, the related term of its correspondence can be found for term, expand the scope of search, expand Search Results, word itself and term can be prevented and inconsistent, but semantically with term very phase time, the result that such Search Results cannot be recalled occurs.

Preferably, described related term dictionary is by the method establishment according to above-mentioned excavation related term.

By the method for above-mentioned excavation related term, the related term of the higher degree of correlation can be excavated, also can expand the scope of term search, improve the probability finding better Search Results.

Preferably, the method also comprises and carries out word segmentation processing to obtain described term to retrieve statement.

When user inputs retrieve statement, by retrieve statement is carried out participle, thus obtain some terms, thus retrieve the result for retrieval relevant to above-mentioned some terms by this search method, expand the scope of search further.

Preferably, the step of the degree of confidence calculated between described term and each described related term based on confidence calculations model comprises:

Obtain the eigenwert between each described term and corresponding each described related term;

Using the input of described eigenwert as described confidence calculations model, calculate described degree of confidence based on described confidence calculations model.

Preferably, described eigenwert comprises:

Degree of correlation information, for weighing the degree of correlation between each described term with each corresponding related term; And/or

Replaceable degree information, for weighing in the context of co-text of described related term, the replaceable degree between described term and described related term; And/or

Cooccurrence relation information, for weighing the cooccurrence relation between described term; And/or

Language model scores information, replaces the language model scores of the retrieve statement before and after described term for showing described related term; And/or

Weighted value information, for representing the weight of described related term.

Preferably, described degree of correlation information comprises the first translation probability P ₁and/or the second translation probability P ₂;

P_{1} (A, A^{'}) = \frac{{count}_{1} (A, A^{'})}{{count}_{1} (A, \cdot)}, P_{2} (A, A^{'}) = \frac{{count}_{1} (A, A^{'})}{{count}_{1} (\cdot, A^{'})};

count ₁(A,·)＝∑ _jcount ₁(A,w _j)，count ₁(·,A′)＝∑ _icount ₁(w _i,A′)；

Wherein, term A and related term A ' forms the first word to (A, A '), count ₁(A, A ') represents the number of times be aligned (A, A ') at parallel sentence centering first word, count ₁(A) represents the total degree be aligned at parallel sentence centering term A, count ₁(, A ') and represent the total degree be aligned at parallel sentence centering related term A ', w _jrepresent the jth in all words alignd with term A of parallel sentence centering, w _irepresent i-th in all words alignd with related term A ' of parallel sentence centering, count ₁(A, w _j) represent at parallel sentence centering term A and word w _jthe number of times of alignment, count ₁(w _i, A ') represent at parallel sentence centering word w _ithe number of times alignd with related term A ', i and j is natural number.

Preferably, described replaceable degree information comprises the first replaceable degree score (D, Q) and/or the second replaceable degree score (D, Q ');

s c o r e (D, Q) = Σ_{i = 1}^{n} I D F (q_{i}) \cdot \frac{f (q_{i}, D) \cdot (k_{1} + 1)}{f (q_{i}, D) + k_{1} \cdot (1 - b + b \cdot \frac{| D |}{a v g d l})};

s c o r e (D, Q^{'}) = Σ_{j = 1}^{m} I D F (q_{j}^{'}) \cdot \frac{f (q_{j}^{'}, D) \cdot (k_{1} + 1)}{f (q_{j}^{'}, D) + k_{1} \cdot (1 - b + b \cdot \frac{| D |}{a v g d l})};

Wherein, term A and related term A ' forms the first word to (A, A '),

The all upper and lower cliction of term A and related term A ' as document D, | D| is the length of D,

Q is retrieve statement, q _ifor i-th term of described retrieve statement Q, n is total number of term in described retrieve statement Q,

The term that Q ' is the word of the m near term A combines, m<n, q ' _jfor a jth term of described term combination Q ',

Avgdl is the average length of the document that the context of all related terms of term A is formed,

K ₁be the first constant, b is the second constant,

F (q _i, D) and represent the frequency of occurrences of qi in document D,

F (q ' _j, D) and represent q ' _jthe frequency of occurrences in document D.

Preferably, described cooccurrence relation information comprises the first cooccurrence relation information and/or the second cooccurrence relation information that obtain based on cooccurrence relation indices P MI, wherein,

p m i (A, B) = l o g \frac{{count}_{2} (A, \cdot) \times {count}_{2} (\cdot, B)}{{count}_{2} (A, B) \times {count}_{2} (\cdot, \cdot)} / l o g \frac{{count}_{2} (A, B)}{{count}_{2} (\cdot, \cdot)};

count ₂(A,·)＝Σ _jcount ₂(A,w _j)；

count ₂(·,B)＝∑ _icount ₂(w _i,B)；

count ₂(·,·)＝Σ _i,jcount ₂(w _i,w _j)；

Count ₂(A) represents the total degree that term A and other term occur simultaneously in retrieve resources, count ₂(, B) and represent the total degree that term B and other term occur simultaneously in retrieve resources, count ₂(A, B) represents the number of times that two terms A, B occur simultaneously in retrieve resources, w _jrepresent the jth in all words simultaneously occurred with term A in retrieve resources, w _irepresent i-th in retrieve resources in all words simultaneously occurred with related term B, count ₂(A, w _j) represent two terms A, w in retrieve resources _jthe number of times simultaneously occurred, count ₂(w _i, B) and represent two term w in retrieve resources _i, the number of times that simultaneously occurs of B, count ₂(w _i, w _j) represent two term w in retrieve resources _i, w _jthe number of times simultaneously occurred, i and j is natural number;

First cooccurrence relation information is the mean value of the cooccurrence relation indices P MI of other word in term and retrieve statement;

Second cooccurrence relation information is the mean value of the cooccurrence relation indices P MI of other word in related term and retrieve statement.

Preferably, the method also comprises and obtains described language model based on large-scale consumer search behavior data training N-gram language model.

Preferably, the described degree of confidence according to correspondence, to the step using described term and described related term to retrieve obtained result to sort, sorts for retrieving obtained result according to the degree of confidence of described correspondence to the described term of use and described related term by order models.

Preferably, the method also comprises the step that described order models carries out described retrieve resources according to described retrieve statement and retrieve resources page info just sorting.

Preferably, described retrieve resources is web page resources and/or document resources.

According to another aspect of the present invention, a kind of search system is additionally provided.

A kind of search system, comprising:

Related term dictionary memory storage;

Related term acquisition device, the related term dictionary for storing based on described related term dictionary memory storage obtains the related term of term;

Confidence calculations device, for calculating the degree of confidence between described term and each described related term based on related term confidence calculations model;

Collator, sorts for retrieving obtained result according to the degree of confidence of described correspondence to the described term of use and described related term.

Preferably, this search system also comprises related term dictionary apparatus for establishing, for setting up described related term dictionary, comprising:

Parallel sentence acquisition module, right for the parallel sentence of expressing identical meanings based on large-scale consumer search behavior data acquisition employing different expression form;

Segmenter, for often organizing described parallel sentence to carrying out word segmentation processing;

Word alignment module, for by the parallel sentence after described word segmentation processing to carrying out word alignment process to obtain the first alignment word pair;

Co-occurrence frequency acquisition module, for calculating the right co-occurrence frequency of described first alignment word;

Related term determination module, for aliging word to being defined as related term by co-occurrence frequency higher than described first of predetermined threshold.

Preferably, described related term dictionary apparatus for establishing also comprises:

Linguistic context acquisition module, for obtaining the context of co-text word of described related term.

Preferably, this search system also comprises related term confidence calculations model apparatus for establishing, for setting up described related term confidence calculations model, comprising:

Linear model filtering module, filters described large-scale consumer search behavior data to obtain the second alignment word pair for using linear model;

Training module, for aliging word to for positive sample with described first, with described second alignment word to for negative sample, based on positive sample and described negative sample described in GBDT Algorithm for Training, obtains described related term confidence calculations model.

Preferably, described segmenter is also for carrying out word segmentation processing to obtain term to retrieve statement.

Preferably, described confidence calculations device comprises:

Characteristics extraction module, for extracting the eigenwert between each described term and corresponding each described related term;

Confidence calculations module, for using the input of described eigenwert as described related term confidence calculations model, calculates described degree of confidence based on described related term confidence calculations model.

Preferably, described characteristics extraction module comprises:

Degree of correlation information acquisition unit, for obtaining degree of correlation information, described degree of correlation information is for weighing the degree of correlation between each described term with each corresponding related term; And/or

Replaceable degree information acquiring unit, for obtaining replaceable degree information, described replaceable degree information for weighing in the context of co-text of described related term, the replaceable degree between described term and described related term; And/or

Cooccurrence relation information acquisition unit, for obtaining cooccurrence relation information, described cooccurrence relation information is for weighing the cooccurrence relation between described term; And/or

Language model scores information acquisition unit, for obtaining language model scores information, described language model scores information replaces the language model scores of the retrieve statement before and after described term for showing described related term; And/or

Weighted value information acquisition unit, for obtaining weighted value information, described weighted value information is for representing the weight of described related term.

Preferably, described characteristics extraction module also comprises:

Language model acquiring unit, for obtaining described language model based on described large-scale consumer search behavior data training N-gram language model.

Preferably, described collator sorts for retrieving obtained result according to the degree of confidence of described correspondence to the described term of use and described related term by order models.

Preferably, described collator is also for carrying out just sequence according to retrieve statement and retrieve resources page info to described retrieve resources by described order models.

Like this, by the method for above-mentioned excavation related term, searching method and search system, the related term that term is corresponding can be found, term and its related term is used to retrieve in the lump, expand the scope of search, expand Search Results, word itself and term can be prevented and inconsistent, but semantically with term very phase time, the result that such Search Results cannot be recalled occurs.

Accompanying drawing explanation

In conjunction with the drawings disclosure illustrative embodiments is described in more detail, above-mentioned and other object of the present disclosure, Characteristics and advantages will become more obvious, wherein, in disclosure illustrative embodiments, identical reference number represents same parts usually.

Fig. 1 shows the process flow diagram of the method excavating related term according to an embodiment of the invention;

Fig. 2 shows the process flow diagram of the method excavating related term according to another embodiment of the present invention;

Fig. 3 shows the process flow diagram of searching method according to an embodiment of the invention;

Fig. 4 shows the process flow diagram of searching method according to another embodiment of the present invention;

Fig. 5 shows the process flow diagram of step S240 embodiment illustrated in fig. 4;

Fig. 6 shows the schematic diagram of search system according to an embodiment of the invention;

Fig. 7 shows the schematic diagram of search system according to another embodiment of the present invention;

Fig. 8 shows the schematic diagram of related term dictionary apparatus for establishing 310 embodiment illustrated in fig. 7;

Fig. 9 shows the schematic diagram of related term confidence calculations model apparatus for establishing 350 embodiment illustrated in fig. 7;

Figure 10 shows the schematic diagram of confidence calculations device 390 embodiment illustrated in fig. 7;

Figure 11 shows the schematic diagram of characteristics extraction module 394 embodiment illustrated in fig. 10.

Embodiment

Below with reference to accompanying drawings preferred implementation of the present disclosure is described in more detail.Although show preferred implementation of the present disclosure in accompanying drawing, but should be appreciated that, the disclosure can be realized in a variety of manners and not should limit by the embodiment of setting forth here.On the contrary, provide these embodiments to be to make the disclosure more thorough and complete, and the scope of the present disclosure intactly can be conveyed to those skilled in the art.

A kind of according to an embodiment of the invention method excavating related term is described, for from large-scale consumer search behavior data acquisition related term below with reference to Fig. 1.

Fig. 1 shows the process flow diagram of the method excavating related term according to an embodiment of the invention.

In step S110, the parallel sentence of expressing identical meanings based on large-scale consumer search behavior data acquisition employing different expression form is right.

Based on large-scale consumer search behavior data, from the data such as the retrieve log of user and/or the daily record of retrieval title, obtain parallel sentence right.Wherein, parallel sentence is to referring to that employing different expression form is right to the sentence of expressing identical meanings.Such as, the parallel sentence that above-mentioned employing different expression form expresses identical meanings is right, can be " baby's neck is long erythema mole " and " baby's neck has macle " etc.

In above-mentioned large-scale consumer search behavior data, such as, in the data such as the retrieve log of user and/or retrieval title daily record, there is a lot of implication identical, but to express and inconsistent sentence is right.Further, can according to the literal similarity of two sentences, the parallel sentence that filtering implication is different is right.

In step S120, to often organizing parallel sentence to carrying out word segmentation processing.

By participle technique, above-mentioned each sentence often organizing parallel sentence centering is carried out participle.

In step S130, to the parallel sentence after above-mentioned word segmentation processing to carrying out word alignment process, to obtain the first alignment word pair.

By word alignment process, the word of expressing identical meanings can be found out.

Wherein, above-mentioned word alignment process can comprise regular word alignment process and/or statistics word alignment processing mode.At least one that above-mentioned regular word alignment process comprises literal identical word alignment process, literal part same words registration process or closes in word alignment process.Above-mentioned statistics word alignment is treated to and uses GIZA++ instrument to carry out the process of statistics word alignment.

In step S140, calculate the co-occurrence frequency that above-mentioned first alignment word is right.

Wherein, the evaluation index of co-occurrence frequency can be the first translation probability P1 and/or the second translation probability P2, and the computing formula of P1, P2 is as follows:

P_{1} (A, A^{'}) = \frac{{count}_{1} (A, A^{'})}{{count}_{1} (A, \cdot)}, P_{2} (A, A^{'}) = \frac{{count}_{1} (A, A^{'})}{{count}_{1} (\cdot, A^{'})};

count ₁(A,·)＝Σ _jcount ₁(A,w _j)，count ₁(·,A′)＝Σ _icount ₁(w _i,A′)；

Wherein, term A and related term A ' forms the first word to (A, A '), count ₁(A, A ') represents the number of times be aligned (A, A ') at parallel sentence centering first word, count ₁(A) represents the total degree be aligned at parallel sentence centering term A, count ₁(, A ') and represent the total degree be aligned at parallel sentence centering related term A ', w _jrepresent the jth in all words alignd with term A of parallel sentence centering, w _irepresent i-th in all words alignd with related term A ' of parallel sentence centering, count ₁(A, w _j) represent at parallel sentence centering term A and word w _jthe number of times count of alignment ₁(w _i, A ') represent at parallel sentence centering word w _ithe number of times alignd with related term A ', i and j is natural number.

Be appreciated that count ₁the value of (A, A ') and A, A ' order be irrelevant, i.e. count ₁(A, A ') and count ₁(A ', A) be identical.

P1 represents that number of times that query word A aligns with related term A ' accounts for the ratio of the total degree that query word A is aligned, and P2 represents that number of times that query word A aligns with related term A ' accounts for the ratio of the total degree that related term A ' is aligned.

Wherein, alignment number of times is the number of times of two words in multiple different parallel sentence centering alignment, and co-occurrence number of times is the number of times that two words occur in same language material simultaneously.

In step S150, co-occurrence frequency is alignd word to being defined as related term higher than first of predetermined threshold.

Wherein, above-mentioned predetermined threshold can according to the setting carried out the requirement difference of the degree of correlation between related term in various degree.In one embodiment, above-mentioned predetermined threshold can be 1.0*e ^-99.

Like this, by this excavation related term method, the related term of the higher degree of correlation can be excavated, the scope of term search can be expanded further, improve the probability finding better Search Results.Further, also according to the difference of predetermined threshold, the related term that similarity is different can be obtained.

A kind of according to another embodiment of the present invention method excavating related term is described, for from large-scale consumer search behavior data acquisition related term below with reference to Fig. 2.

With reference to figure 2, the method for above-mentioned excavation related term also comprises the steps:

In step S160, the context of co-text word of record related term.

By recording the context of co-text word of this related term, the context of co-text of related term can be known.Whether identical or close by judging the context of co-text of two related terms, the degree of correlation between related term can be judged further, be conducive to the related term obtaining higher similarity.

The acquisition of the context of co-text word of above-mentioned related term, different according to the length of parallel sentence, the restriction of length in various degree can be done.In the present embodiment, because considering parallel sentence, right length generally can not be long, therefore can not do length or other forms of restriction.In other embodiments, according under the requirement difference of the degree of correlation to related term or other standards, different restrictions can be done to the obtain manner of its length or context of co-text word.

In step S170, use linear model to filter described large-scale consumer search behavior data acquisition second and to align word pair.

Wherein, above-mentioned linear model can be simple linear model.Further, this simple linear model can be employment work mark a small amount of (can be ten thousand ranks) word pair, use predicate between statistical nature, with the linear model of simple linear regression model matching.Wherein, above-mentioned matching can refer to linear regression fit modeling.

The word of above-mentioned artificial mark is to negligible amounts, and model is simple, and the confidence score therefore using this model to export is not high.Above-mentioned large-scale consumer search behavior data are filtered by this linear model, confidence score is less than the result of specific threshold as above-mentioned second alignment word pair, the word gone out because using this model filter is not high to confidence score, and therefore this second alignment word is to as poor word pair.Concrete, above-mentioned specific threshold is close or be less than zero.

The word of above-mentioned " manually marking " is to referring to: under certain query statement (query), the former word in a query forms a word pair to related term, and whether this word, to through mark, is suitable as a related term.Above-mentioned notation methods can be, " what does within eight months, baby eat? " in this query, this related term centering of baby-> baby, " baby " is former word, " baby " is related term, and this related term can mark 1 point, and representative can as a related term; Under this query, " baby "-> " dotey " marks 0 point, and representative can not as a related term.

Above-mentioned poor word is to referring to that, under current queries word border, the erroneous words pair that should not occur, violates the word pair of user view in other words.Such as, user search " baby's sucking ", obtaining " baby has milk " is that a better word is to (namely marking the related term of 1 point); But " what fruit is fond of eating ", becoming " what fruit is drunk well ", is exactly the erroneous words pair of an escape, i.e. poor word pair.Further, above-mentioned poor word, to the expression that can have more kinds of form, is not limited to this citing.

In step S180, obtain the statistical nature of the degree of correlation that can embody between related term.

Above-mentioned statistical nature, be under current query linguistic context, whether be applicable to the right context word statistical testing of business cycles feature of this word, these features comprise at least one in the degree of correlation information between every two related terms, replaceable degree information, cooccurrence relation information, language model scores information, weighted value information.

In step S190, with above-mentioned first alignment word to for positive sample, second aligns word to being negative sample, based on above-mentioned statistical nature, adopt gradient to promote decision tree (GBDT) algorithm, train above-mentioned positive sample and negative sample, obtain above-mentioned related term confidence calculations model.

Wherein, above-mentioned related term confidence calculations model can be GBDT nonlinear regression model (NLRM).

Below with reference to Fig. 3, a kind of according to an embodiment of the invention searching method is described.

Fig. 3 shows the process flow diagram of searching method according to an embodiment of the invention.

A kind of searching method, comprises the steps:

In step S220, obtain the related term of term based on related term dictionary.

Wherein, above-mentioned related term dictionary is by the method establishment according to above-mentioned excavation related term.Like this, can obtain all related terms of this term, this related term not only comprises the synonym (can comprise strong synonym and linguistic context synonym) of term, further comprises the related term of wider level of coverage.By the method for above-mentioned excavation related term, the related term of the higher degree of correlation can be excavated, expand the scope of search further, improve the probability finding better Search Results.

In step S240, calculate the degree of confidence between above-mentioned term and each related term based on confidence calculations model.

In step S260, according to the degree of confidence of correspondence, obtained result is retrieved to the above-mentioned term of use and its related term and sort.

Above-mentioned steps, sorts for retrieving obtained result according to the degree of confidence of above-mentioned correspondence to use term and related term by order models.Above-mentioned order models can for the quicksort model carrying out sorting according to existing quick sorting algorithm.Known, this order models also can be other models existing.

Carry out searching for according to related term and not only cover synon high frequency, also more focused on the related term of medium and low frequency, especially when retrieve resources is fewer, use related term to search for, achieve and farthest get retrieving information.

Like this, by this searching method, the related term of its correspondence can be found for term, use term and related term to retrieve, expand the scope of search, expand Search Results; Word itself and term can be prevented and inconsistent, but semantically with term very phase time, the result that such Search Results cannot be recalled occurs.

In another embodiment, this order models can also be comprised according to retrieve statement and retrieve resources page info, the step of just sequence is carried out to retrieve resources before above-mentioned steps S260.

This first ordered steps is general retrieving, also can be limited by setting retrieval degree, and the result for retrieval reaching predetermined score just can enter the step that step S260 sorts again.Like this, when first result for retrieval is more, the amount sorted again can be reduced.Also can, when user requires the Search Results that only display degree of accuracy is high, this dual sort method be used to search for.

Wherein, above-mentioned retrieve resources can be web page resources and/or document resources.Retrieve resources can be one section of text message, the title of a webpage, the statement of one query, also may be a long document.

Below with reference to Fig. 4, a kind of according to another embodiment of the present invention searching method is described.

Fig. 4 shows the process flow diagram of searching method according to another embodiment of the present invention.

Above-mentioned searching method, can also comprise step S210 before above-mentioned steps S220.In step S210, word segmentation processing is carried out to obtain above-mentioned term to retrieve statement.

When user inputs retrieve statement, by retrieve statement is carried out participle, thus obtain some terms, thus retrieve the result for retrieval relevant to above-mentioned some terms by this search method, expand the scope of search further.Above-mentioned participle, can comprise Chinese word segmentation and/or English string segmentation, also can comprise the participle of other languages forms, and corresponding participle mode can be existing various forms of participle technique.

Below with reference to the process flow diagram that Fig. 5 is step S240 embodiment illustrated in fig. 4.

Fig. 5 shows the process flow diagram of step S240 embodiment illustrated in fig. 4.

In step S242, obtain the eigenwert between each term and corresponding each related term.

Retrieval of content is different each time, and corresponding term also can be different, and therefore above-mentioned eigenwert also can be different.

In step S244, using the input of above-mentioned eigenwert as confidence calculations model, calculate degree of confidence based on this confidence calculations model.

Wherein, above-mentioned eigenwert can comprise at least one in degree of correlation information, replaceable degree information, cooccurrence relation information, language model scores information, weighted value information.

Wherein, above-mentioned degree of correlation information is for weighing the degree of correlation between each term with each corresponding related term.

Above-mentioned degree of correlation information can comprise the first translation probability P ₁and/or the second translation probability P ₂, and represent with following formula respectively:

P_{1} (A, A^{'}) = \frac{{count}_{1} (A, A^{'})}{{count}_{1} (A, \cdot)}, P_{2} (A, A^{'}) = \frac{{count}_{1} (A, A^{'})}{{count}_{1} (\cdot, A^{'})};

Wherein, replaceable degree information for weighing in the context of co-text of related term, the replaceable degree between term and related term.

Replaceable degree information comprises the first replaceable degree score (D, Q) and/or the second replaceable degree score (D, Q '), and represents with following formula:

s c o r e (D, Q) = Σ_{i = 1}^{n} I D F (q_{i}) \cdot \frac{f (q_{i}, D) \cdot (k_{1} + 1)}{f (q_{i}, D) + k_{1} \cdot (1 - b + b \cdot \frac{| D |}{a v g d l})};

s c o r e (D, Q^{'}) = Σ_{j = 1}^{m} I D F (q_{j}^{'}) \cdot \frac{f (q_{j}^{'}, D) \cdot (k_{1} + 1)}{f (q_{j}^{'}, D) + k_{1} \cdot (1 - b + b \cdot \frac{| D |}{a v g d l})};

Wherein, term A and related term A ' forms the first word to (A, A '),

The cliction up and down of term A, and the cliction up and down of related term A ' is together as document D, | D| is the length of D; Wherein, the cliction up and down of term A and related term A ' is the same most sentence centering, but also has indivedual difference, all can record context as a whole;

Q is retrieve statement, q _ifor i-th term of retrieve statement Q, n is total number of term in retrieve statement Q,

K ₁be the first constant, b is the second constant,

F (q _i, D) and represent the frequency of occurrences of qi in document D,

F (q ' _j, D) and represent q ' _jthe frequency of occurrences in document D.

Wherein, cooccurrence relation information, for weighing the cooccurrence relation between term, refers to that two terms appear at the statistics simultaneously occurred in an inquiry language material (retrieve resources, webpage and/document).

Cooccurrence relation information comprises the first cooccurrence relation information and/or the second cooccurrence relation information that obtain based on cooccurrence relation indices P MI:

p m i (A, B) = l o g \frac{{count}_{2} (A, \cdot) \times {count}_{2} (\cdot, B)}{{count}_{2} (A, B) \times {count}_{2} (\cdot, \cdot)} / l o g \frac{{count}_{2} (A, B)}{{count}_{2} (\cdot, \cdot)};

count ₂(A,·)＝∑ _jcount ₂(A,w _j)；

count ₂(·,B)＝∑ _icount ₂(w _i,B)；

count ₂(·,·)＝∑ _i,jcount ₂(w _i,w _j)；

Count ₂(A) represents the total degree that term A and other term occur in retrieve resources simultaneously, count ₂(, B) and represent the total degree that term B and other term occur in retrieve resources simultaneously, count ₂(A, B) represents the number of times that two terms A, B occur in retrieve resources simultaneously, w _jrepresent the jth in all words simultaneously occurred with term A in retrieve resources, w _irepresent i-th in retrieve resources in all words simultaneously occurred with related term B, count ₂(A, w _j) represent two terms A, w in retrieve resources _jthe number of times simultaneously occurred, count ₂(w _i, B) and represent two term w in retrieve resources _i, the number of times that simultaneously occurs of B, count ₂(w _i, w _j) represent two term w in retrieve resources _i, w _jthe number of times simultaneously occurred, i and j is natural number.

Be appreciated that count ₂the value of (A, B) and the order of A, B are irrelevant, i.e. count ₂(A, B) and count ₂(B, A) is identical.

First cooccurrence relation information is the mean value of the cooccurrence relation indices P MI of other word in term and retrieve statement.

Second cooccurrence relation information is related term and the mean value of the cooccurrence relation indices P MI of other term in retrieve statement (not comprising other terms of the term corresponding with this related term).

Wherein, when calculating above-mentioned first cooccurrence relation information, can directly use above-mentioned formula and calculating mean value; When calculating the second cooccurrence relation, the term A in above-mentioned formula is replaced with its related term A '.

Language model scores information, replaces the language model scores of the retrieve statement before and after term for showing related term.Wherein, the method also comprises and obtains above-mentioned language model based on large-scale consumer search behavior data training N-gram language model.

Wherein, above-mentioned weighted value information is for representing the weight of related term.

Wherein, the account form of above-mentioned statistical nature, equally for step S180, calculates the statistical nature between each related term.

Below with reference to Fig. 6, a kind of according to an embodiment of the invention search system is described.

Fig. 6 shows the schematic diagram of search system according to an embodiment of the invention.

A kind of search system 300, comprises related term dictionary memory storage 320, related term acquisition device 340, searcher 360, collator 380, confidence calculations device 390.

Related term acquisition device 340 join dependency word dictionary memory storage 320, and the related term of term is obtained based on related term dictionary memory storage 320.Searcher 360 is retrieved based on the related term of above-mentioned term and term.Confidence calculations device 390 calculates the degree of confidence between term each related term corresponding with it based on confidence calculations model.Collator 380, retrieves to searcher 360 result obtained according to the degree of confidence of the correspondence of confidence calculations device 390 calculating and sorts.

Like this, by this search system 300, can find the related term of its correspondence for term, the related term according to term and its correspondence is retrieved, and expands the scope of search, expands Search Results further, improve the probability retrieving object file.Word itself and term can be prevented and inconsistent, but semantically with term very phase time, the phenomenon that so good Search Results cannot be recalled occurs.

Below with reference to Fig. 7, a kind of according to another embodiment of the present invention search system is described.

Fig. 7 shows the schematic diagram of search system according to another embodiment of the present invention.

Above-mentioned search system 300 can also comprise related term dictionary apparatus for establishing 310 and related term confidence calculations model apparatus for establishing 350.

Above-mentioned related term dictionary apparatus for establishing 310 join dependency word dictionary memory storage 320, for the method by above-mentioned excavation related term to set up above-mentioned related term dictionary.

The schematic diagram according to related term dictionary apparatus for establishing 310 embodiment illustrated in fig. 7 is described, for setting up related term dictionary with reference to figure 8.

Fig. 8 shows the schematic diagram of related term dictionary apparatus for establishing 310 embodiment illustrated in fig. 7.

Above-mentioned related term dictionary apparatus for establishing 310 can comprise: parallel sentence acquisition module 311, segmenter 313, word alignment module 315, co-occurrence frequency acquisition module 317, related term determination module 319 and linguistic context acquisition module 318.

Parallel sentence acquisition module 311, the parallel sentence of expressing identical meanings based on large-scale consumer search behavior data acquisition employing different expression form is right, segmenter 313 is to often organizing parallel sentence to carrying out word segmentation processing, word alignment module 315 by the parallel sentence after word segmentation processing to carrying out word alignment process to obtain the first alignment word pair, co-occurrence frequency acquisition module 317 calculates the right co-occurrence frequency of the first alignment word, and co-occurrence frequency is alignd word to being defined as related term to form related term dictionary higher than first of predetermined threshold by related term determination module 319.

Like this, by this related term dictionary apparatus for establishing 310, the related term of the higher degree of correlation can be excavated, also the scope of term search can be expanded, improve the probability finding better Search Results, also according to the difference of predetermined threshold, the related term that similarity is different can be obtained.

By setting up related term dictionary, can obtain all relevant related term of this term, this related term not only comprises the synonym (can comprise strong synonym and linguistic context synonym) of term, further comprises the related term of wider level of coverage.By the method for above-mentioned excavation related term, the related term of the higher degree of correlation can be excavated, also can expand the scope of term search, improve the probability finding better Search Results.

In addition, above-mentioned segmenter 313 is also for carrying out word segmentation processing to obtain term to retrieve statement.When user inputs retrieve statement, by retrieve statement is carried out participle, thus obtain some terms, thus retrieve the result for retrieval relevant to above-mentioned some terms by this search method, expand the scope of search further.

Further, above-mentioned related term dictionary apparatus for establishing 310 also comprises linguistic context acquisition module 318, for obtaining the context of co-text word of above-mentioned related term.

Below with reference to the schematic diagram that Fig. 9 is related term confidence calculations model apparatus for establishing 350 embodiment illustrated in fig. 7.

Fig. 9 shows the schematic diagram of related term confidence calculations model apparatus for establishing 350 embodiment illustrated in fig. 7.

Related term confidence calculations model apparatus for establishing 350 can comprise linear model filtering module 352 and training module 354.

Linear model filtering module 352 filters large-scale consumer search behavior data to obtain the second alignment word pair for using linear model.

Above-mentioned linear model can be simple linear model, further, this simple linear model can be employment work mark a small amount of (can be ten thousand ranks) word pair, use predicate between statistical nature, with the linear model of simple linear regression model matching.The word of above-mentioned artificial mark is to negligible amounts, and model is simple, and the degree of confidence precision therefore using this model to export is not high.Filter above-mentioned large-scale consumer search behavior data acquisition second by this linear model to align word pair, this second alignment word, to being poor word pair, refers to that, under current queries word border, the erroneous words pair that should not occur, violates the word pair of user view in other words.Such as, user search " baby's sucking ", obtaining " baby has milk " is a good word pair; But " what fruit is fond of eating ", becoming " what fruit is drunk well ", is exactly the erroneous words pair of an escape, i.e. poor word pair.

Training module 354 is join dependency word dictionary apparatus for establishing 310, linear model filtering module 352 respectively, with above-mentioned first alignment word to for positive sample, above-mentioned second alignment word, to being negative sample, based on GBDT Algorithm for Training this positive sample and negative sample, obtains related term confidence calculations model.

With reference to Figure 10, confidence calculations device 390 embodiment illustrated in fig. 7 can comprise confidence calculations module 392 and characteristics extraction module 394.

Characteristics extraction module 394 extracts the eigenwert between each term each described related term corresponding with it, and confidence calculations module 392, using the input of above-mentioned eigenwert as confidence calculations model, calculates above-mentioned degree of confidence based on this confidence calculations model.

With reference to the schematic diagram that Figure 11 is characteristics extraction module 394 embodiment illustrated in fig. 10.

Wherein, characteristics extraction module 394 can also comprise degree of correlation information acquisition unit 3941, replaceable degree information acquiring unit 3942, cooccurrence relation information acquisition unit 3943, language model scores information acquisition unit 3944, at least one in weighted value information acquisition unit 3945 and language model acquiring unit 3946.

Degree of correlation information acquisition unit 3941, for obtaining degree of correlation information.Degree of correlation information is for weighing the degree of correlation between each term with each corresponding related term.

Replaceable degree information acquiring unit 3942, for obtaining replaceable degree information.Replaceable degree information for weighing in the context of co-text of related term, the replaceable degree between term and related term.

Cooccurrence relation information acquisition unit 3943, for obtaining cooccurrence relation information.Wherein, cooccurrence relation information is for weighing the cooccurrence relation between term.

Language model scores information acquisition unit 3944, for obtaining language model scores information.Wherein, language model scores information replaces the language model scores of the retrieve statement before and after term for showing related term.

Weighted value information acquisition unit 3945, for obtaining weighted value information.Wherein, weighted value information is for representing the weight of related term.

Further, characteristics extraction module 394 can also comprise language model acquiring unit 3946.Language model acquiring unit 3946 is for obtaining above-mentioned language model based on large-scale consumer search behavior data training N-gram language model.

Wherein, above-mentioned collator 380 is for sorting to using term and corresponding related term to retrieve obtained result according to the confidence information of correspondence by order models.Wherein, above-mentioned order models can for the quicksort model carrying out sorting according to existing quick sorting algorithm.

Further, above-mentioned collator 380 can also carry out just sequence according to retrieve statement and retrieve resources page info to retrieve resources by above-mentioned order models.This first sequence is general search procedure, and also can be limited by setting retrieval degree, the result for retrieval reaching predetermined score just can enter and sort.When first result for retrieval is more, the workload sorted again can be reduced.Also when user requires the Search Results that only display degree of accuracy is high, this dual sort method can be used.

Carry out searching for according to related term and not only cover synon high frequency, also more focused on the term of medium and low frequency, especially when retrieve resources is fewer, use related term to search for, farthest get retrieving information.Like this, by this search system, the related term of its correspondence can be found for term, use term and related term to retrieve, expand the scope of search, expand Search Results; Word itself and term can be prevented and inconsistent, but semantically with term very phase time, the result that such Search Results cannot be recalled occurs.

The method according to excavation related term of the present invention, searching method and search system are above described in detail with reference to the attached drawings.

In addition, a kind of computer program can also be embodied as according to method of the present invention, this computer program comprises computer-readable medium, stores the computer program for performing the above-mentioned functions limited in method of the present invention on the computer-readable medium.Those skilled in the art will also understand is that, may be implemented as electronic hardware, computer software or both combinations in conjunction with various illustrative logical blocks, module, circuit and the algorithm steps described by disclosure herein.

Process flow diagram in accompanying drawing and block diagram show the architectural framework in the cards of the system and method according to multiple embodiment of the present invention, function and operation.In this, each square frame in process flow diagram or block diagram can represent a part for module, program segment or a code, and a part for described module, program segment or code comprises one or more executable instruction for realizing the logic function specified.Also it should be noted that at some as in the realization of replacing, the function marked in square frame also can be different from occurring in sequence of marking in accompanying drawing.Such as, in fact two continuous print square frames can perform substantially concurrently, and they also can perform by contrary order sometimes, and this determines according to involved function.Also it should be noted that, the combination of the square frame in each square frame in block diagram and/or process flow diagram and block diagram and/or process flow diagram, can realize by the special hardware based system of the function put rules into practice or operation, or can realize with the combination of specialized hardware and computer instruction.

Be described above various embodiments of the present invention, above-mentioned explanation is exemplary, and non-exclusive, and be also not limited to disclosed each embodiment.When not departing from the scope and spirit of illustrated each embodiment, many modifications and changes are all apparent for those skilled in the art.The selection of term used herein, is intended to explain best the principle of each embodiment, practical application or the improvement to the technology in market, or makes other those of ordinary skill of the art can understand each embodiment disclosed herein.

Claims

1. excavate a method for related term, comprising:

2. method according to claim 1, wherein, the right step of the parallel sentence of described acquisition comprises:

3. method according to claim 1, also comprises:

Record the context of co-text word of described related term.

4. method according to claim 1, wherein,

Described word alignment process comprises regular word alignment process and/or the process of statistics word alignment;

At least one that described regular word alignment process comprises literal identical word alignment process, literal part same words registration process or closes in word alignment process;

Described statistics word alignment is treated to and uses GIZA++ instrument to carry out the process of statistics word alignment.

5. method according to claim 1, also comprises:

6. method according to claim 5, wherein, described related term confidence calculations model is GBDT nonlinear regression model (NLRM).

7. a searching method, comprises the steps:

The related term of term is obtained based on related term dictionary;

8. method according to claim 7, wherein, described related term dictionary is by according to the method establishment in claim 1 to 6 described in any one.

9. method according to claim 7, also comprises:

Word segmentation processing is carried out to obtain described term to retrieve statement.

10. method according to claim 9, wherein, the step of the degree of confidence calculated between described term and each described related term based on confidence calculations model comprises:

11. methods according to claim 10, wherein, described eigenwert comprises:

12. methods according to claim 11, wherein, described degree of correlation information comprises the first translation probability P ₁and/or the second translation probability P ₂;

P_{1} (A, A^{'}) = \frac{{count}_{1} (A, A^{'})}{{count}_{1} (A, \cdot)}, P_{2} (A, A^{'}) = \frac{{count}_{1} (A, A^{'})}{{count}_{1} (\cdot, A^{'})};

count ₁(A，·)＝∑ _jcount ₁(A，w _j)，count ₁(·，A′)＝∑ _icount ₁(w _i，A′)；

Wherein, term A and related term A ' forms the first word to (A, A '), count ₁(A, A ') represents the number of times be aligned (A, A ') at parallel sentence centering first word, count ₁(A) represents the total degree be aligned at parallel sentence centering term A, count ₁(, A') and represent the total degree be aligned at parallel sentence centering related term A ', w _jrepresent the jth in all words alignd with term A of parallel sentence centering, w _irepresent i-th in all words alignd with related term A ' of parallel sentence centering, count ₁(A, w _j) represent at parallel sentence centering term A and word w _jthe number of times of alignment, count ₁(w _i, A ') represent at parallel sentence centering word w _ithe number of times alignd with related term A ', i and j is natural number.

13. information retrieval methods according to claim 11, wherein, described replaceable degree information comprises the first replaceable degree score (D, Q) and/or the second replaceable degree score (D, Q');

s c o r e (D, Q) = Σ_{i = 1}^{n} I D F (q_{i}) \cdot \frac{f (q_{i}, D) \cdot (k_{1} + 1)}{f (q_{i}, D) + k_{1} \cdot (1 - b + b \cdot \frac{| D |}{a v g d l})};

s c o r e (D, Q^{'}) = Σ_{j = 1}^{m} I D F (q_{j}^{'}) \cdot \frac{f (q_{j}^{'}, D) \cdot (k_{1} + 1)}{f (q_{j}^{'}, D) + k_{1} \cdot (1 - b + b \cdot \frac{| D |}{a v g d l})};

Wherein, term A and related term A ' forms the first word to (A, A '),

Q' is the term combination of m word near term A, m<n, q' _jfor a jth term of described term combination Q',

K ₁be the first constant, b is the second constant,

F (q _i, D) and represent the frequency of occurrences of qi in document D,

F (q' _j, D) and represent q' _jthe frequency of occurrences in document D.

14. methods according to claim 11, wherein, described cooccurrence relation information comprises the first cooccurrence relation information and/or the second cooccurrence relation information that obtain based on cooccurrence relation indices P MI, wherein,

p m i (A, B) = l o g \frac{{count}_{2} (A, \cdot) \times {count}_{2} (\cdot, B)}{{count}_{2} (A, B) \times {count}_{2} (\cdot, \cdot)} / l o g \frac{{count}_{2} (A, B)}{{count}_{2} (\cdot, \cdot)};

count ₂(A，·)＝∑ _jcount ₂(A，w _j)；

count ₂(·，B)＝∑ _icount ₂(w _i，B)；

count ₂(·，·)＝∑ _i，jcount ₂(w _i，w _j)；

15. information retrieval methods according to claim 11, wherein, also comprise and obtain described language model based on large-scale consumer search behavior data training N-gram language model.

16. methods according to claim 7 or 9, wherein, the described degree of confidence according to correspondence, to the step using described term and described related term to retrieve obtained result to sort, sorts for retrieving obtained result according to the degree of confidence of described correspondence to the described term of use and described related term by order models.

17. methods according to claim 16, wherein, also comprise described order models and carry out the step of just sequence according to described retrieve statement and retrieve resources page info to described retrieve resources.

18. methods according to claim 17, wherein,

Described retrieve resources is web page resources and/or document resources.

19. 1 kinds of search systems, comprising:

Related term dictionary memory storage;

20. search systems according to claim 19, wherein,

Also comprising related term dictionary apparatus for establishing, for setting up described related term dictionary, comprising:

21. search systems according to claim 20, wherein, described related term dictionary apparatus for establishing also comprises:

22. search systems according to claim 20, wherein, also comprise related term confidence calculations model apparatus for establishing, for setting up described related term confidence calculations model, comprising:

23. search systems according to claim 22, wherein, described related term confidence calculations model is GBDT nonlinear regression model (NLRM).

24. search systems according to claim 20, wherein,

Described segmenter is also for carrying out word segmentation processing to obtain term to retrieve statement.

25. search systems according to claim 24, wherein, described confidence calculations device comprises:

26. search systems according to claim 25, wherein, described characteristics extraction module comprises:

27. search systems according to claim 26, wherein, described characteristics extraction module also comprises:

28. search systems according to claim 19, wherein, described collator sorts for retrieving obtained result according to the degree of confidence of described correspondence to the described term of use and described related term by order models.

29. search systems according to claim 28, wherein, described collator is also for carrying out just sequence according to retrieve statement and retrieve resources page info to described retrieve resources by described order models.