CN102622405B

CN102622405B - Method for computing text distance between short texts based on language content unit number evaluation

Info

Publication number: CN102622405B
Application number: CN 201210012475
Authority: CN
Inventors: 杨震; 王来涛; 赖英旭; 高凯明; 张龙伯; 段立娟; 范科峰
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2012-01-16
Filing date: 2012-01-16
Publication date: 2013-08-21
Anticipated expiration: 2032-01-16
Also published as: CN102622405A

Abstract

A method for computing text distance between short texts based on language content unit number evaluation belongs to the field of Chinese short text information processing and is characterized in that the method is used for processing online comment short text clustering and includes: removing webpage marks, processing the short texts in a standardization manner, processing words separation, converting the texts into word strings, computing edit distance of two sentences by utilizing words as units on the basis of word strings, defining the number of words with content meanings in the sentences as content units, evaluating the content units in the sentences by means of Heap's principle, selecting the larger number of the content units in the two sentences, subjecting the text distance indicated by the edit distance to text length punishment by the larger number of the content units to obtain the text distance punished by the number of the content units. By the method for computing text distance between the short texts based on language content unit number evaluation, errors caused by original sentence length processing in the conventional method are avoided.

Description

Computing method based on text distance between the short text of the real adopted unit number estimation of language

Technical field

The present invention relates to a kind of short text text distance calculating method and system based on the real adopted unit estimation of language, belong to the Word message process field.

Background technology

In recent years, along with popularizing and fast development of information technology of network, network becomes the topmost media type of public's contact.The Web2.0 technology becomes internet information and is easy to more obtain, and makes each user can become the issue source of information, and the internet information amount is also more and more.Analyze by the content to the information on the network especially user issue, can understand the much-talked-about topic of current social and people to viewpoint and the position of various social phenomenons.

Online comment starts from certain public accident or much-talked-about topic usually, although expression content has very strong subjectivity, reflects that the public is to the attitude of event.Its main source has: microblogging, forum's comment, news analysis.Along with the rise of microblogging and forum, online comment becomes the public and expresses the topmost mode of viewpoint.Online comment has quick propagation and influences characteristics widely, and it has not only represented reviewer's self viewpoint, also can influence other participants' viewpoint, is the importance that network public-opinion is analyzed so it is analyzed and researched.Government is by monitoring public opinion, and the correct guidance public opinion is maintained social stability; Enterprise grasps the up-to-date feedback of product by product review is analyzed.Therefore the analysis and research of online comment, all significant for country, society and enterprise, caused the great attention of government, academia and industrial community.

The text that application such as note, microblogging, forum's comment and news analysis produce, generally in 100 Chinese characters, we are referred to as short text (Short Text) to these text sizes.At the ever-increasing demand of user, a lot of information filtering systems at short text have appearred, comprise public sentiment monitoring system, recommendation of personalized information system, product quality investigating system etc.No matter be which kind of system, all must solve basic problem, i.e. a text cluster.Its basic process is the similarity of calculating between the short text, and the text of similarity height (the text distance is little) is gathered in the theme.Wherein Text similarity computing is the technical matters of a most critical in the text cluster.Because the text distance belongs to same concept with text similarity in text calculates, the two is opposite relation on the numerical value, and the present invention does not distinguish in discussing afterwards, is referred to as the text distance.Traditional short text text distance calculating method is many to go to weigh difference between sentence from the angle of syntactic structure, as interdependent based on semanteme, based on the computing method of editing distance.Because influenced by text size bigger for traditional text distance calculating method, therefore the result of calculation for the text of different text sizes exists than mistake.In order to alleviate the error that the short text difference in length causes result of calculation, the present invention has introduced a kind of mode that text size is punished, has overcome the error that classic method exists apart from calculating at the short text text.

Summary of the invention

The objective of the invention is at the short text on the network, propose computing method and the system of text distance between a kind of short text.The present invention is on the basis of traditional text apart from calculating, introduce real adopted unit number (Distinct Words Length) the estimation mode of a kind of text effectively, utilize real adopted unit number that the text distance is punished, the error that has overcome in traditional short text distance calculating method distich long process not or utilized the punishment of primitive sentence progress row to produce.

Computing method based on text distance between the short text of the real adopted unit number estimation of language is characterized in that, realize according to the following steps successively in computing machine:

Step (1) computer initialization

Input: from the two class online comment short texts that network obtains, each is made up of described two class online comment short texts the sentence of some numbers;

Chinese lexical analytic system ICTCLAS participle algorithm software module;

Function match instrument Curve Fitting Tool in the Matlab tool box;

Step (2) text pre-service

Comprise in the described two class online comment short texts of step (2.1) removal＜html 〉,＜body 〉,＜div〉html webpage mark;

Step (2.2) is done the variation short text to the described two class online comment short texts of having removed html webpage mark in the step (2.1) and is handled: nonstandard alphabetic writing, complex form of Chinese characters standardization, the symbol lack of standardization that remove to be used for an expression of expression standardizes to the use of numeral and punctuation mark;

Step (3) is calculated according to the following steps by the text distance between the pretreated described two class online comment short texts of step (2);

Step (3.1) divides word algorithm that the pretreated described two class online comment short texts of the process described in the step (3) are carried out word segmentation processing with ICTCLAS, and the short text sentence is converted into the word string;

Step (3.2) is unit with the word in the two class online comment short texts in the step (3.1), calculate editing distance between described two class online comment short texts with the editing distance algorithm, and with this as the text distance: a sentence is converted to the needed minimum word editing operation number of times of another sentence; Calculate the editing distance compute matrix of two sentences, the value of the last cell unit of this matrix is two text distances between sentence: dis (S ₁, S ₂), " dis " expression text distance, S ₁, S ₂Represent described two sentences;

Step (4) is successively by following step, and the text between described two sentences that step (3.2) obtained with the adopted unit number of reality in the online comment short text described in the step (1) is apart from dis (S ₁, S ₂) punish that the text that obtains punishing through real adopted unit number is apart from dis ' (S ₁, S ₂);

Step (4.1) is carried out word frequency statistics to the word in the two class online comment short texts described in the step (1), sorts according to the descending of the word frequency of occurrences, obtains word frequency descending table separately;

Step (4.2) imports separately word frequency descending table in the described two class online comment short texts in the described Matlab tool box as data set, the word frequency f in contrast Zipf ' the s rule and the formula of corresponding ranking r:

f(r)＝f _max·r ^-α （1）

Wherein, f _MaxBe coefficient, greater than 0; α is the Zipf index, greater than 0;

Select a ﹒ x in the Power option in the Matlab tool box ^b, with this objective function as data fitting, carry out data fitting, obtain b, b＜0, obtain separately Zipf index α of described two class online comment short texts=| b|;

Step (4.3) obtains the adopted unit number N of every sentence reality separately (t) in the described two class online comment short texts by following formula:

N (t) = \{\begin{matrix} {(α - 1)}^{1 / α} t^{1 / α}, α > 1 \\ (1 - α) t, 0 < α \leq 1 \end{matrix} - - - (2)

The adopted unit number of wherein said reality refers to have in each sentences of described two class online comment short texts the content number of real adopted content, and wherein not comprising not have the text message contributed to expressing content of text; T is for the word being the length of each sentence in the described two class online comment short texts of unit;

Step (4.4) utilizes original text that step (3.2) obtains apart from dis (S ₁, S ₂) the adopted unit number N of reality (t) that obtains with step (4.3) calculates final text apart from dis ' (S ₁, S ₂);

Select the adopted unit number max of reality (N (t bigger in the adopted unit number of reality of two sentences in the described two class online comment short texts ₁), N (t ₂)), to described dis (S ₁, S ₂) carry out length punishment, obtain final text apart from dis ' (S ₁, S ₂).

dis'(S ₁,S ₂)＝dis(S ₁,S ₂)max(N(t ₁),N(t ₂)) （3）

Compare with classic method and to have the following advantages:

Utilize real adopted unit that the text distance is punished.The present invention pass through Heap ' s rule and Zipf ' ' s rule to valency relation, introduced and a kind of real adopted unit number in the text carried out estimation approach.Utilize the number of real adopted unit in the text that initial semantic distance is punished, overcome the error of utilizing the primitive sentence long process in the classic method.

Description of drawings

Fig. 1 is based on the short text semantic distance method of adjustment of the real adopted unit estimation of language and the process flow diagram of system.

Fig. 2 is word frequency descending table

Fig. 3 distance calculating method-relative accuracy curve

Embodiment

So-called real adopted unit (Distinct Word), the word that namely has real adopted content in the text.Owing to be stored in the word that some do not have real adopted content in the text, as " ", " " etc., to the not contribution of content information of expressing text, the adopted element number of reality of text is less than text size usually, but concrete estimated value need have a rational foundation.

The present invention comprises following components:

At first, carry out the text pretreatment operation, carrying out the pretreated purpose of text is standardization text data form.Online comment for directly extracting from network comprises a large amount of webpage marks in the text, and has a lot of variation short text contents, and these noises have a significant impact apart from result of calculation text.The data pretreatment operation that the present invention comprehensively uses always is formed the text pretreatment module.Pretreatment operation commonly used comprises removes webpage mark, the processing of variation short text, text participle.By the online comment short text being removed mark, variation short text processing operation, the online comment short text is carried out standardization processing.To utilize and divide word algorithm to be converted to the word string through the text of standardization processing.

The variation short text refers to reach with symbol, colloquial style vocabulary the text of conventional meaning.This phenomenon is particularly evident in online comment, live chat.The variation short text has following characteristics usually:

1, phonetic is replaced, as: " not having " is written as " meiyou ";

2, simplified and traditional body cross occurrence, as: " PLA " is written as " liberation Army ";

3, special symbol, as: with "〉_＜" represent an expression;

4, numeral and punctuation mark use confusion, as " 8...8. ";

Secondly, the text after handling is calculated the text distance.The present invention utilizes the traditional editing distance algorithm after the improvement to calculate the distance of text, also can utilize other classic algorithm as calculate the distance of text based on the interdependent method of sentence structure.Editing distance refers to the word to be unit, by " insertion ", " deletion " and " replacement " three kinds of editing operations, is the needed minimum editing operation number of times of another text with a text transform.Being unit of account with the word among the present invention, is the needed minimum word editing operation number of times of another text with a text transform, is the editing distance of text.Analyze the algorithm of editing distance, we find that difference in length has a significant impact result of calculation.

At last, in order to alleviate the error of calculation that the traditional text distance calculating method exists, the present invention utilizes in the text real adopted unit number that the text distance is punished, and has introduced the evaluation method of real adopted unit number in a kind of text.The present invention is by the word frequency of word in the statistics online comment short text, simulate the Zipf index according to Zip ' s rule, and valency is concerned according to Heap ' s rule and Zipf ' s rule, provide a kind of method of estimation to real adopted unit number in the text, utilized the adopted unit number of reality that obtains that text size is punished.Experimental result shows that the effect of utilizing the real adopted unit number of text that the text distance is punished is better than classic method.

The present invention introduces the evaluation method of the adopted unit number of a kind of reality.We find that the adopted unit number of the reality of text and text size satisfy Heap ' s rule:

N(t)～t ^λ,λ＜1 （4）

Wherein N (t) is real adopted unit number, and t is that text is the length of unit with the word, and λ is the Heap index.

Owing to be difficult to calculate Heap ' s index λ by direct method, therefore according to Heap ' s rule and Zipf ' s rule to the valency relation, utilize Zip ' s rule that adopted unit is in fact estimated.

By the word frequency of statistics through word in later all short texts of participle, all words according to the word frequency descending sort that occurs, are formed word word frequency descending table in short text, and utilize the formula of word frequency f and ranking r in Zip ' the s rule:

f(r)＝f _max·r ^-α （5）

Wherein r is the ranking of word, and α is the Zipf index, f _MaxBe coefficient, f (r) is the word frequency of ranking r corresponding word.

Utilize Zipf index α, provided a kind of method of estimation to real adopted unit number in the text:

N (t) = \{\begin{matrix} {(α - 1)}^{1 / α} t^{1 / α}, α > 1 \\ (1 - α) t, α \leq 1 \end{matrix} - - - (6)

The real adopted unit number N (t) of the text that utilization obtains is to initial semantic distance dis (S ₁, S ₂) punish, obtain final text distance.

The present invention is described further below in conjunction with the concrete operations step:

Experimental data is the online comment short text that obtains from network, wherein a class be in Sina's microblogging (http://weibo.com) about the microblogging information of Netease's mailbox, another kind of in war network game play forum's World of Warcraft's column (http://www.battlenet.com.cn/wow/zh/forum/) about the comment of World of Warcraft.Select two classes each 210 texts, totally 420 short texts respectively.

To these 420 online comment raw data, at first carry out step 1 pretreatment operation.The present invention utilizes the text pretreatment module that the online comment short text is carried out pre-service, and the text pretreatment module comprises 3 treatment steps:

Step 1.1 is removed the webpage mark.Owing to have a large amount of html marks in the online comment raw data, need at first the html mark to be removed.With the html mark in the text as＜html,＜body,＜div etc. the html mark remove, extract the comment content in the html page.

Step 1.2 variation short text is handled.The online comment short text that step 1.1 the is obtained short text that makes a variation is handled.Variation short text in the comment is cleaned, as remove " meiyong ", " orz ", "〉_＜" content of text.

Step 1.3 utilization divides word algorithm that the short text text of handling through the variation short text is carried out participle, and text is converted into the word string.Branch word algorithm commonly used has methods such as MMSEG4, ICTCLAS, Pan Gu's participle, and the present invention adopts ICTCLAS to divide word algorithm to carry out word segmentation processing.Utilize the participle interface of ICTCLAS, text is input in the word-dividing mode, return results is the word string of short text.As Netease's mailbox class comment " number of the account of seeking help problem " and World of Warcraft's class comment " comrade-in-arms recruits the award problem " are converted to " number of the account of seeking help problem " and " comrade-in-arms recruits the award problem " respectively.

Step 2 is through after the word segmentation processing, and the online comment data are converted to 420 word string records, utilizes the text distance calculation module to calculate the editing distance of text.Being unit of account with the word among the present invention, is the needed minimum word editing operation number of times of another text with a text transform, is the editing distance of text.As Netease's mailbox class comment " number of the account of seeking help problem " and World of Warcraft's class comment " comrade-in-arms recruits the award problem ", by " insertion ", " deletion " and " replacement " three kinds of editing operations, calculate the editing distance of two texts, calculate short text S ₁, S ₂The editing distance matrix of editing distance as follows:

	S ₂	The comrade-in-arms	Recruit	Reward	Problem
						S ₁	0	1	2	3	4
Seek help	1	1	2	3	4
						Number of the account	2	2	2	3	4
Problem	3	3	3	3	3

By editing distance matrix last cell unit numerical value, obtain two short text S ₁, S ₂Editing distance dis (S ₁, S ₂)=3.

Step 3 utilizes in the text real adopted unit number that the text distance is punished, and has provided the algorithm for estimating of the adopted unit number of a kind of reality.

Step 3.1 is at first carried out word frequency statistics to 420 text word strings, and word is arranged according to the descending of frequency, obtains word frequency descending table, as shown in Figure 2.

Step 3.2 is utilized the relation of the middle word frequency f of formula (2) and ranking r, simulates Zipf index α.

Utilize Curve FittingTool in the matlab tool box to simulate the value of Zipf index α according to word frequency descending table among the present invention.At first word frequency descending table is imported matlab as data set, the a*x^b option in the selection Power option carries out data fitting and obtains b=-0.697 as objective function, obtains Zipf index α=0.697 according to formula (2)

Step 3.3 has provided a kind of method of estimation to real adopted unit number in the text.The Zipf index α I that utilization simulates and formula (3) calculate the adopted element number of reality of text.

As Netease's mailbox class comment " number of the account of seeking help problem " and World of Warcraft's class comment " comrade-in-arms recruits the award problem ", text size t is respectively t ₁=3, t ₂=4, according to α=0.67 that step 3.2 obtains, utilize formula (3) to obtain the adopted element number of reality of two texts, N (t ₁The * of)=(1-0.67) 3=0.99, N (t ₂The * of)=(1-0.67) 4=1.32

Step 3.4 utilizes real adopted unit number that the distance between text is punished.Utilize the editing distance algorithm to obtain Netease mailbox class comment " number of the account of seeking help problem " and World of Warcraft's class in the step 2 and comment on editing distance dis (s between " comrade-in-arms recruits the award problem " ₁, s ₂Bigger value (max (N (t in the real adopted unit number of two short texts is selected in)=3 ₁), N (t ₂))=N (t ₂)) to dis (s ₁, s ₂) carry out length punishment, punished through real adopted unit number that text was apart from dis ' (s ₁, s ₂)=dis (s ₁, s ₂)/N (t ₂)=3/1.32=2.273, and editing distance dis (s between original text ₁, s ₂)=3.

At last, improve to some extent with respect to classic method in order to check method of the present invention, the present invention has carried out following experiment.

Experimental data is the online comment short text that obtains from network, wherein a class be in Sina's microblogging (http://weibo.com) about the microblogging information of Netease's mailbox, another kind of in war network game play forum's World of Warcraft's column (http://www.battlenet.com.cn/wow/zh/forum/) about the comment of World of Warcraft.Select each 210 text of two classes respectively, totally 420 short texts carry out ten folding cross-over experiments.

The present invention at first with step 2 calculate between the arbitrary text apart from dis (s ₁, s ₂), the text between arbitrary text is apart from forming distance matrix DisMatrix ₁Between recycling step 3 pair text apart from dis (s ₁, s ₂) punish and obtain dis ' (s ₁, s ₂), the text of all process punishment is apart from forming distance matrix DisMatrix ₂Utilize traditional clustering algorithm to carry out cluster respectively in two matrixes, traditional having levels of clustering algorithm cluster, K mean cluster, Affinity Propagation cluster etc., the present invention selects the Affinity Propagation algorithm matrix of adjusting the distance to carry out cluster analysis.

Because experimental data comes from mailbox, two classifications of recreation respectively, therefore be 2 for Affinity Propagation algorithm target setting class number, namely in cluster process with sample is poly-be two classes in, algorithm finishes automatically, and returns cluster result.

For experimental result is compared, the present invention adopts relative accuracy (relative accuracy rate) to come the cluster performance of comparative experiments.

Ra＝(n ₁+n ₂)/N （7）

Wherein, n1 is the number that same subject is got together in the 1st class text, and n2 is the number that same subject is got together in the 2nd class text, and N is whole short text numbers of experiment.

Utilize two kinds of algorithms, 10 group data sets are calculated, obtain corresponding distance matrix respectively.Utilize the Affinity Propagation algorithm matrix of adjusting the distance to carry out cluster analysis, obtain results of property.Therefore come the validity of two kinds of text distance calculating methods of comparison.

Table 1 is handled the relative accuracy that experimental data obtains for utilizing two kinds of texts apart from the computing method method.Fig. 3 is the diagrammatic representation of relative accuracy.

Table 1 is accuracy (%) relatively

	The editing distance algorithm	Editing distance+the present invention handles
			Dataset1	63.49	70.63
Dataset2	61.64	71.16
			Dataset3	62.17	72.22
Dataset4	66.67	72.49
			Dataset5	61.64	65.54
Dataset6	62.7	75.66
			Dataset7	62.43	70.37
Dataset8	63.49	66.40
			Dataset9	66.40	72.22
Dataset10	63.23	72.75

Experimental result shows that the effect of utilizing the real adopted unit of text that the text distance is punished has bigger improvement than classic method.

Claims

1. based on the computing method of text distance between the short text of the real adopted unit number estimation of language, it is characterized in that, in computing machine, realize according to the following steps successively:

Step (1) computer initialization

Chinese lexical analytic system ICTCLAS participle algorithm software module;

Function match instrument Curve Fitting Tool in the Matlab tool box;

Step (2) text pre-service

f(r)＝f _max·r ^-α

Select the adopted unit number max of reality (N (t bigger in the adopted unit number of reality of two sentences in the described two class online comment short texts ₁), N (t ₂)), to described dis (S ₁, S ₂) carry out length punishment, obtain final text apart from dis ' (S ₁, S ₂):

dis'(S ₁,S ₂₎＝dis(S ₁,S ₂)max(N(t ₁),N(t ₂)) 。