CN102622405B - Method for computing text distance between short texts based on language content unit number evaluation - Google Patents

Method for computing text distance between short texts based on language content unit number evaluation Download PDF

Info

Publication number
CN102622405B
CN102622405B CN 201210012475 CN201210012475A CN102622405B CN 102622405 B CN102622405 B CN 102622405B CN 201210012475 CN201210012475 CN 201210012475 CN 201210012475 A CN201210012475 A CN 201210012475A CN 102622405 B CN102622405 B CN 102622405B
Authority
CN
China
Prior art keywords
text
distance
online comment
short texts
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 201210012475
Other languages
Chinese (zh)
Other versions
CN102622405A (en
Inventor
杨震
王来涛
赖英旭
高凯明
张龙伯
段立娟
范科峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN 201210012475 priority Critical patent/CN102622405B/en
Publication of CN102622405A publication Critical patent/CN102622405A/en
Application granted granted Critical
Publication of CN102622405B publication Critical patent/CN102622405B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for computing text distance between short texts based on language content unit number evaluation belongs to the field of Chinese short text information processing and is characterized in that the method is used for processing online comment short text clustering and includes: removing webpage marks, processing the short texts in a standardization manner, processing words separation, converting the texts into word strings, computing edit distance of two sentences by utilizing words as units on the basis of word strings, defining the number of words with content meanings in the sentences as content units, evaluating the content units in the sentences by means of Heap's principle, selecting the larger number of the content units in the two sentences, subjecting the text distance indicated by the edit distance to text length punishment by the larger number of the content units to obtain the text distance punished by the number of the content units. By the method for computing text distance between the short texts based on language content unit number evaluation, errors caused by original sentence length processing in the conventional method are avoided.

Description

Computing method based on text distance between the short text of the real adopted unit number estimation of language
Technical field
The present invention relates to a kind of short text text distance calculating method and system based on the real adopted unit estimation of language, belong to the Word message process field.
Background technology
In recent years, along with popularizing and fast development of information technology of network, network becomes the topmost media type of public's contact.The Web2.0 technology becomes internet information and is easy to more obtain, and makes each user can become the issue source of information, and the internet information amount is also more and more.Analyze by the content to the information on the network especially user issue, can understand the much-talked-about topic of current social and people to viewpoint and the position of various social phenomenons.
Online comment starts from certain public accident or much-talked-about topic usually, although expression content has very strong subjectivity, reflects that the public is to the attitude of event.Its main source has: microblogging, forum's comment, news analysis.Along with the rise of microblogging and forum, online comment becomes the public and expresses the topmost mode of viewpoint.Online comment has quick propagation and influences characteristics widely, and it has not only represented reviewer's self viewpoint, also can influence other participants' viewpoint, is the importance that network public-opinion is analyzed so it is analyzed and researched.Government is by monitoring public opinion, and the correct guidance public opinion is maintained social stability; Enterprise grasps the up-to-date feedback of product by product review is analyzed.Therefore the analysis and research of online comment, all significant for country, society and enterprise, caused the great attention of government, academia and industrial community.
The text that application such as note, microblogging, forum's comment and news analysis produce, generally in 100 Chinese characters, we are referred to as short text (Short Text) to these text sizes.At the ever-increasing demand of user, a lot of information filtering systems at short text have appearred, comprise public sentiment monitoring system, recommendation of personalized information system, product quality investigating system etc.No matter be which kind of system, all must solve basic problem, i.e. a text cluster.Its basic process is the similarity of calculating between the short text, and the text of similarity height (the text distance is little) is gathered in the theme.Wherein Text similarity computing is the technical matters of a most critical in the text cluster.Because the text distance belongs to same concept with text similarity in text calculates, the two is opposite relation on the numerical value, and the present invention does not distinguish in discussing afterwards, is referred to as the text distance.Traditional short text text distance calculating method is many to go to weigh difference between sentence from the angle of syntactic structure, as interdependent based on semanteme, based on the computing method of editing distance.Because influenced by text size bigger for traditional text distance calculating method, therefore the result of calculation for the text of different text sizes exists than mistake.In order to alleviate the error that the short text difference in length causes result of calculation, the present invention has introduced a kind of mode that text size is punished, has overcome the error that classic method exists apart from calculating at the short text text.
Summary of the invention
The objective of the invention is at the short text on the network, propose computing method and the system of text distance between a kind of short text.The present invention is on the basis of traditional text apart from calculating, introduce real adopted unit number (Distinct Words Length) the estimation mode of a kind of text effectively, utilize real adopted unit number that the text distance is punished, the error that has overcome in traditional short text distance calculating method distich long process not or utilized the punishment of primitive sentence progress row to produce.
Computing method based on text distance between the short text of the real adopted unit number estimation of language is characterized in that, realize according to the following steps successively in computing machine:
Step (1) computer initialization
Input: from the two class online comment short texts that network obtains, each is made up of described two class online comment short texts the sentence of some numbers;
Chinese lexical analytic system ICTCLAS participle algorithm software module;
Function match instrument Curve Fitting Tool in the Matlab tool box;
Step (2) text pre-service
Comprise in the described two class online comment short texts of step (2.1) removal<html 〉,<body 〉,<div〉html webpage mark;
Step (2.2) is done the variation short text to the described two class online comment short texts of having removed html webpage mark in the step (2.1) and is handled: nonstandard alphabetic writing, complex form of Chinese characters standardization, the symbol lack of standardization that remove to be used for an expression of expression standardizes to the use of numeral and punctuation mark;
Step (3) is calculated according to the following steps by the text distance between the pretreated described two class online comment short texts of step (2);
Step (3.1) divides word algorithm that the pretreated described two class online comment short texts of the process described in the step (3) are carried out word segmentation processing with ICTCLAS, and the short text sentence is converted into the word string;
Step (3.2) is unit with the word in the two class online comment short texts in the step (3.1), calculate editing distance between described two class online comment short texts with the editing distance algorithm, and with this as the text distance: a sentence is converted to the needed minimum word editing operation number of times of another sentence; Calculate the editing distance compute matrix of two sentences, the value of the last cell unit of this matrix is two text distances between sentence: dis (S 1, S 2), " dis " expression text distance, S 1, S 2Represent described two sentences;
Step (4) is successively by following step, and the text between described two sentences that step (3.2) obtained with the adopted unit number of reality in the online comment short text described in the step (1) is apart from dis (S 1, S 2) punish that the text that obtains punishing through real adopted unit number is apart from dis ' (S 1, S 2);
Step (4.1) is carried out word frequency statistics to the word in the two class online comment short texts described in the step (1), sorts according to the descending of the word frequency of occurrences, obtains word frequency descending table separately;
Step (4.2) imports separately word frequency descending table in the described two class online comment short texts in the described Matlab tool box as data set, the word frequency f in contrast Zipf ' the s rule and the formula of corresponding ranking r:
f(r)=f max·r (1)
Wherein, f MaxBe coefficient, greater than 0; α is the Zipf index, greater than 0;
Select a ﹒ x in the Power option in the Matlab tool box b, with this objective function as data fitting, carry out data fitting, obtain b, b<0, obtain separately Zipf index α of described two class online comment short texts=| b|;
Step (4.3) obtains the adopted unit number N of every sentence reality separately (t) in the described two class online comment short texts by following formula:
N ( t ) = ( &alpha; - 1 ) 1 / &alpha; t 1 / &alpha; , &alpha; > 1 ( 1 - &alpha; ) t , 0 < &alpha; &le; 1 - - - ( 2 )
The adopted unit number of wherein said reality refers to have in each sentences of described two class online comment short texts the content number of real adopted content, and wherein not comprising not have the text message contributed to expressing content of text; T is for the word being the length of each sentence in the described two class online comment short texts of unit;
Step (4.4) utilizes original text that step (3.2) obtains apart from dis (S 1, S 2) the adopted unit number N of reality (t) that obtains with step (4.3) calculates final text apart from dis ' (S 1, S 2);
Select the adopted unit number max of reality (N (t bigger in the adopted unit number of reality of two sentences in the described two class online comment short texts 1), N (t 2)), to described dis (S 1, S 2) carry out length punishment, obtain final text apart from dis ' (S 1, S 2).
dis'(S 1,S 2)=dis(S 1,S 2)max(N(t 1),N(t 2)) (3)
Compare with classic method and to have the following advantages:
Utilize real adopted unit that the text distance is punished.The present invention pass through Heap ' s rule and Zipf ' ' s rule to valency relation, introduced and a kind of real adopted unit number in the text carried out estimation approach.Utilize the number of real adopted unit in the text that initial semantic distance is punished, overcome the error of utilizing the primitive sentence long process in the classic method.
Description of drawings
Fig. 1 is based on the short text semantic distance method of adjustment of the real adopted unit estimation of language and the process flow diagram of system.
Fig. 2 is word frequency descending table
Fig. 3 distance calculating method-relative accuracy curve
Figure GDA00002957829400032
Embodiment
So-called real adopted unit (Distinct Word), the word that namely has real adopted content in the text.Owing to be stored in the word that some do not have real adopted content in the text, as " ", " " etc., to the not contribution of content information of expressing text, the adopted element number of reality of text is less than text size usually, but concrete estimated value need have a rational foundation.
The present invention comprises following components:
At first, carry out the text pretreatment operation, carrying out the pretreated purpose of text is standardization text data form.Online comment for directly extracting from network comprises a large amount of webpage marks in the text, and has a lot of variation short text contents, and these noises have a significant impact apart from result of calculation text.The data pretreatment operation that the present invention comprehensively uses always is formed the text pretreatment module.Pretreatment operation commonly used comprises removes webpage mark, the processing of variation short text, text participle.By the online comment short text being removed mark, variation short text processing operation, the online comment short text is carried out standardization processing.To utilize and divide word algorithm to be converted to the word string through the text of standardization processing.
The variation short text refers to reach with symbol, colloquial style vocabulary the text of conventional meaning.This phenomenon is particularly evident in online comment, live chat.The variation short text has following characteristics usually:
1, phonetic is replaced, as: " not having " is written as " meiyou ";
2, simplified and traditional body cross occurrence, as: " PLA " is written as " liberation Army ";
3, special symbol, as: with "〉_<" represent an expression;
4, numeral and punctuation mark use confusion, as " 8...8. ";
Secondly, the text after handling is calculated the text distance.The present invention utilizes the traditional editing distance algorithm after the improvement to calculate the distance of text, also can utilize other classic algorithm as calculate the distance of text based on the interdependent method of sentence structure.Editing distance refers to the word to be unit, by " insertion ", " deletion " and " replacement " three kinds of editing operations, is the needed minimum editing operation number of times of another text with a text transform.Being unit of account with the word among the present invention, is the needed minimum word editing operation number of times of another text with a text transform, is the editing distance of text.Analyze the algorithm of editing distance, we find that difference in length has a significant impact result of calculation.
At last, in order to alleviate the error of calculation that the traditional text distance calculating method exists, the present invention utilizes in the text real adopted unit number that the text distance is punished, and has introduced the evaluation method of real adopted unit number in a kind of text.The present invention is by the word frequency of word in the statistics online comment short text, simulate the Zipf index according to Zip ' s rule, and valency is concerned according to Heap ' s rule and Zipf ' s rule, provide a kind of method of estimation to real adopted unit number in the text, utilized the adopted unit number of reality that obtains that text size is punished.Experimental result shows that the effect of utilizing the real adopted unit number of text that the text distance is punished is better than classic method.
The present invention introduces the evaluation method of the adopted unit number of a kind of reality.We find that the adopted unit number of the reality of text and text size satisfy Heap ' s rule:
N(t)~t λ,λ<1 (4)
Wherein N (t) is real adopted unit number, and t is that text is the length of unit with the word, and λ is the Heap index.
Owing to be difficult to calculate Heap ' s index λ by direct method, therefore according to Heap ' s rule and Zipf ' s rule to the valency relation, utilize Zip ' s rule that adopted unit is in fact estimated.
By the word frequency of statistics through word in later all short texts of participle, all words according to the word frequency descending sort that occurs, are formed word word frequency descending table in short text, and utilize the formula of word frequency f and ranking r in Zip ' the s rule:
f(r)=f max·r (5)
Wherein r is the ranking of word, and α is the Zipf index, f MaxBe coefficient, f (r) is the word frequency of ranking r corresponding word.
Utilize Zipf index α, provided a kind of method of estimation to real adopted unit number in the text:
N ( t ) = ( &alpha; - 1 ) 1 / &alpha; t 1 / &alpha; , &alpha; > 1 ( 1 - &alpha; ) t , &alpha; &le; 1 - - - ( 6 )
The real adopted unit number N (t) of the text that utilization obtains is to initial semantic distance dis (S 1, S 2) punish, obtain final text distance.
The present invention is described further below in conjunction with the concrete operations step:
Experimental data is the online comment short text that obtains from network, wherein a class be in Sina's microblogging (http://weibo.com) about the microblogging information of Netease's mailbox, another kind of in war network game play forum's World of Warcraft's column (http://www.battlenet.com.cn/wow/zh/forum/) about the comment of World of Warcraft.Select two classes each 210 texts, totally 420 short texts respectively.
To these 420 online comment raw data, at first carry out step 1 pretreatment operation.The present invention utilizes the text pretreatment module that the online comment short text is carried out pre-service, and the text pretreatment module comprises 3 treatment steps:
Step 1.1 is removed the webpage mark.Owing to have a large amount of html marks in the online comment raw data, need at first the html mark to be removed.With the html mark in the text as<html,<body,<div etc. the html mark remove, extract the comment content in the html page.
Step 1.2 variation short text is handled.The online comment short text that step 1.1 the is obtained short text that makes a variation is handled.Variation short text in the comment is cleaned, as remove " meiyong ", " orz ", "〉_<" content of text.
Step 1.3 utilization divides word algorithm that the short text text of handling through the variation short text is carried out participle, and text is converted into the word string.Branch word algorithm commonly used has methods such as MMSEG4, ICTCLAS, Pan Gu's participle, and the present invention adopts ICTCLAS to divide word algorithm to carry out word segmentation processing.Utilize the participle interface of ICTCLAS, text is input in the word-dividing mode, return results is the word string of short text.As Netease's mailbox class comment " number of the account of seeking help problem " and World of Warcraft's class comment " comrade-in-arms recruits the award problem " are converted to " number of the account of seeking help problem " and " comrade-in-arms recruits the award problem " respectively.
Step 2 is through after the word segmentation processing, and the online comment data are converted to 420 word string records, utilizes the text distance calculation module to calculate the editing distance of text.Being unit of account with the word among the present invention, is the needed minimum word editing operation number of times of another text with a text transform, is the editing distance of text.As Netease's mailbox class comment " number of the account of seeking help problem " and World of Warcraft's class comment " comrade-in-arms recruits the award problem ", by " insertion ", " deletion " and " replacement " three kinds of editing operations, calculate the editing distance of two texts, calculate short text S 1, S 2The editing distance matrix of editing distance as follows:
S 2 The comrade-in-arms Recruit Reward Problem
S 1 0 1 2 3 4
Seek help 1 1 2 3 4
Number of the account 2 2 2 3 4
Problem 3 3 3 3 3
By editing distance matrix last cell unit numerical value, obtain two short text S 1, S 2Editing distance dis (S 1, S 2)=3.
Step 3 utilizes in the text real adopted unit number that the text distance is punished, and has provided the algorithm for estimating of the adopted unit number of a kind of reality.
Step 3.1 is at first carried out word frequency statistics to 420 text word strings, and word is arranged according to the descending of frequency, obtains word frequency descending table, as shown in Figure 2.
Step 3.2 is utilized the relation of the middle word frequency f of formula (2) and ranking r, simulates Zipf index α.
Utilize Curve FittingTool in the matlab tool box to simulate the value of Zipf index α according to word frequency descending table among the present invention.At first word frequency descending table is imported matlab as data set, the a*x^b option in the selection Power option carries out data fitting and obtains b=-0.697 as objective function, obtains Zipf index α=0.697 according to formula (2)
Step 3.3 has provided a kind of method of estimation to real adopted unit number in the text.The Zipf index α I that utilization simulates and formula (3) calculate the adopted element number of reality of text.
As Netease's mailbox class comment " number of the account of seeking help problem " and World of Warcraft's class comment " comrade-in-arms recruits the award problem ", text size t is respectively t 1=3, t 2=4, according to α=0.67 that step 3.2 obtains, utilize formula (3) to obtain the adopted element number of reality of two texts, N (t 1The * of)=(1-0.67) 3=0.99, N (t 2The * of)=(1-0.67) 4=1.32
Step 3.4 utilizes real adopted unit number that the distance between text is punished.Utilize the editing distance algorithm to obtain Netease mailbox class comment " number of the account of seeking help problem " and World of Warcraft's class in the step 2 and comment on editing distance dis (s between " comrade-in-arms recruits the award problem " 1, s 2Bigger value (max (N (t in the real adopted unit number of two short texts is selected in)=3 1), N (t 2))=N (t 2)) to dis (s 1, s 2) carry out length punishment, punished through real adopted unit number that text was apart from dis ' (s 1, s 2)=dis (s 1, s 2)/N (t 2)=3/1.32=2.273, and editing distance dis (s between original text 1, s 2)=3.
At last, improve to some extent with respect to classic method in order to check method of the present invention, the present invention has carried out following experiment.
Experimental data is the online comment short text that obtains from network, wherein a class be in Sina's microblogging (http://weibo.com) about the microblogging information of Netease's mailbox, another kind of in war network game play forum's World of Warcraft's column (http://www.battlenet.com.cn/wow/zh/forum/) about the comment of World of Warcraft.Select each 210 text of two classes respectively, totally 420 short texts carry out ten folding cross-over experiments.
The present invention at first with step 2 calculate between the arbitrary text apart from dis (s 1, s 2), the text between arbitrary text is apart from forming distance matrix DisMatrix 1Between recycling step 3 pair text apart from dis (s 1, s 2) punish and obtain dis ' (s 1, s 2), the text of all process punishment is apart from forming distance matrix DisMatrix 2Utilize traditional clustering algorithm to carry out cluster respectively in two matrixes, traditional having levels of clustering algorithm cluster, K mean cluster, Affinity Propagation cluster etc., the present invention selects the Affinity Propagation algorithm matrix of adjusting the distance to carry out cluster analysis.
Because experimental data comes from mailbox, two classifications of recreation respectively, therefore be 2 for Affinity Propagation algorithm target setting class number, namely in cluster process with sample is poly-be two classes in, algorithm finishes automatically, and returns cluster result.
For experimental result is compared, the present invention adopts relative accuracy (relative accuracy rate) to come the cluster performance of comparative experiments.
Ra=(n 1+n 2)/N (7)
Wherein, n1 is the number that same subject is got together in the 1st class text, and n2 is the number that same subject is got together in the 2nd class text, and N is whole short text numbers of experiment.
Utilize two kinds of algorithms, 10 group data sets are calculated, obtain corresponding distance matrix respectively.Utilize the Affinity Propagation algorithm matrix of adjusting the distance to carry out cluster analysis, obtain results of property.Therefore come the validity of two kinds of text distance calculating methods of comparison.
Table 1 is handled the relative accuracy that experimental data obtains for utilizing two kinds of texts apart from the computing method method.Fig. 3 is the diagrammatic representation of relative accuracy.
Table 1 is accuracy (%) relatively
The editing distance algorithm Editing distance+the present invention handles
Dataset1 63.49 70.63
Dataset2 61.64 71.16
Dataset3 62.17 72.22
Dataset4 66.67 72.49
Dataset5 61.64 65.54
Dataset6 62.7 75.66
Dataset7 62.43 70.37
Dataset8 63.49 66.40
Dataset9 66.40 72.22
Dataset10 63.23 72.75
Experimental result shows that the effect of utilizing the real adopted unit of text that the text distance is punished has bigger improvement than classic method.

Claims (1)

1. based on the computing method of text distance between the short text of the real adopted unit number estimation of language, it is characterized in that, in computing machine, realize according to the following steps successively:
Step (1) computer initialization
Input: from the two class online comment short texts that network obtains, each is made up of described two class online comment short texts the sentence of some numbers;
Chinese lexical analytic system ICTCLAS participle algorithm software module;
Function match instrument Curve Fitting Tool in the Matlab tool box;
Step (2) text pre-service
Comprise in the described two class online comment short texts of step (2.1) removal<html 〉,<body 〉,<div〉html webpage mark;
Step (2.2) is done the variation short text to the described two class online comment short texts of having removed html webpage mark in the step (2.1) and is handled: nonstandard alphabetic writing, complex form of Chinese characters standardization, the symbol lack of standardization that remove to be used for an expression of expression standardizes to the use of numeral and punctuation mark;
Step (3) is calculated according to the following steps by the text distance between the pretreated described two class online comment short texts of step (2);
Step (3.1) divides word algorithm that the pretreated described two class online comment short texts of the process described in the step (3) are carried out word segmentation processing with ICTCLAS, and the short text sentence is converted into the word string;
Step (3.2) is unit with the word in the two class online comment short texts in the step (3.1), calculate editing distance between described two class online comment short texts with the editing distance algorithm, and with this as the text distance: a sentence is converted to the needed minimum word editing operation number of times of another sentence; Calculate the editing distance compute matrix of two sentences, the value of the last cell unit of this matrix is two text distances between sentence: dis (S 1, S 2), " dis " expression text distance, S 1, S 2Represent described two sentences;
Step (4) is successively by following step, and the text between described two sentences that step (3.2) obtained with the adopted unit number of reality in the online comment short text described in the step (1) is apart from dis (S 1, S 2) punish that the text that obtains punishing through real adopted unit number is apart from dis ' (S 1, S 2);
Step (4.1) is carried out word frequency statistics to the word in the two class online comment short texts described in the step (1), sorts according to the descending of the word frequency of occurrences, obtains word frequency descending table separately;
Step (4.2) imports separately word frequency descending table in the described two class online comment short texts in the described Matlab tool box as data set, the word frequency f in contrast Zipf ' the s rule and the formula of corresponding ranking r:
f(r)=f max·r
Wherein, f MaxBe coefficient, greater than 0; α is the Zipf index, greater than 0;
Select a ﹒ x in the Power option in the Matlab tool box b, with this objective function as data fitting, carry out data fitting, obtain b, b<0, obtain separately Zipf index α of described two class online comment short texts=| b|;
Step (4.3) obtains the adopted unit number N of every sentence reality separately (t) in the described two class online comment short texts by following formula:
The adopted unit number of wherein said reality refers to have in each sentences of described two class online comment short texts the content number of real adopted content, and wherein not comprising not have the text message contributed to expressing content of text; T is for the word being the length of each sentence in the described two class online comment short texts of unit;
Step (4.4) utilizes original text that step (3.2) obtains apart from dis (S 1, S 2) the adopted unit number N of reality (t) that obtains with step (4.3) calculates final text apart from dis ' (S 1, S 2);
Select the adopted unit number max of reality (N (t bigger in the adopted unit number of reality of two sentences in the described two class online comment short texts 1), N (t 2)), to described dis (S 1, S 2) carry out length punishment, obtain final text apart from dis ' (S 1, S 2):
dis'(S 1,S 2)=dis(S 1,S 2)max(N(t 1),N(t 2)) 。
CN 201210012475 2012-01-16 2012-01-16 Method for computing text distance between short texts based on language content unit number evaluation Active CN102622405B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201210012475 CN102622405B (en) 2012-01-16 2012-01-16 Method for computing text distance between short texts based on language content unit number evaluation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201210012475 CN102622405B (en) 2012-01-16 2012-01-16 Method for computing text distance between short texts based on language content unit number evaluation

Publications (2)

Publication Number Publication Date
CN102622405A CN102622405A (en) 2012-08-01
CN102622405B true CN102622405B (en) 2013-08-21

Family

ID=46562325

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201210012475 Active CN102622405B (en) 2012-01-16 2012-01-16 Method for computing text distance between short texts based on language content unit number evaluation

Country Status (1)

Country Link
CN (1) CN102622405B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679462B (en) * 2012-08-31 2019-01-15 阿里巴巴集团控股有限公司 A kind of comment data treating method and apparatus, a kind of searching method and system
CN104008166B (en) * 2014-05-30 2017-05-24 华东师范大学 Dialogue short text clustering method based on form and semantic similarity
CN112148947B (en) * 2020-09-28 2024-03-22 微梦创科网络科技(中国)有限公司 Method and system for excavating and brushing users in batches

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6643641B1 (en) * 2000-04-27 2003-11-04 Russell Snyder Web search engine with graphic snapshots
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
CN101609472A (en) * 2009-08-13 2009-12-23 腾讯科技(深圳)有限公司 A kind of keyword evaluation method and device based on the question and answer platform

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090240498A1 (en) * 2008-03-19 2009-09-24 Microsoft Corporation Similiarity measures for short segments of text

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6643641B1 (en) * 2000-04-27 2003-11-04 Russell Snyder Web search engine with graphic snapshots
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
CN101609472A (en) * 2009-08-13 2009-12-23 腾讯科技(深圳)有限公司 A kind of keyword evaluation method and device based on the question and answer platform

Also Published As

Publication number Publication date
CN102622405A (en) 2012-08-01

Similar Documents

Publication Publication Date Title
CN102622338B (en) Computer-assisted computing method of semantic distance between short texts
CN102831184B (en) According to the method and system text description of social event being predicted to social affection
CN105808768B (en) A kind of construction method of the concept based on books-descriptor knowledge network
Soliman et al. Sentiment analysis of Arabic slang comments on facebook
CN111061861B (en) Text abstract automatic generation method based on XLNet
CN103154936A (en) Methods and systems for automated text correction
CN105183833A (en) User model based microblogging text recommendation method and recommendation apparatus thereof
CN101520802A (en) Question-answer pair quality evaluation method and system
CN106202584A (en) A kind of microblog emotional based on standard dictionary and semantic rule analyzes method
CN107943824A (en) A kind of big data news category method, system and device based on LDA
CN104899188A (en) Problem similarity calculation method based on subjects and focuses of problems
CN110134799B (en) BM25 algorithm-based text corpus construction and optimization method
CN108733644A (en) A kind of text emotion analysis method, computer readable storage medium and terminal device
Qiu et al. Advanced sentiment classification of tibetan microblogs on smart campuses based on multi-feature fusion
CN109657064A (en) A kind of file classification method and device
Bilgin et al. Sentiment analysis with term weighting and word vectors
CN115186654B (en) Method for generating document abstract
Silveira et al. Combining a double clustering approach with sentence simplification to produce highly informative multi-document summaries
CN104462408A (en) Topic modeling based multi-granularity sentiment analysis method
CN110457711A (en) A kind of social media event topic recognition methods based on descriptor
CN103608805B (en) Dictionary generation and method
Liu et al. Extract Product Features in Chinese Web for Opinion Mining.
CN102622405B (en) Method for computing text distance between short texts based on language content unit number evaluation
Touahri et al. Deep analysis of an Arabic sentiment classification system based on lexical resource expansion and custom approaches building
CN104794209A (en) Chinese microblog sentiment classification method and system based on Markov logic network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant