Search Images Maps Play YouTube News Gmail Drive More »
Advanced Patent Search | Page images | Web History | Sign in

Patents

  

Illlllllllllllllllllllllllllllllllllllllllllllllll

US006917936B2

(12) United States Patent ao) Patent No.: us 6,917,936 B2

Cancedda (45) Date of Patent: Jul. 12,2005

(54) METHOD AND APPARATUS FOR

MEASURING SIMILARITY BETWEEN
DOCUMENTS

(75) Inventor: Nicola Cancedda, Grenoble (FR)

(73) Assignee: Xerox Corporation, Stamford, CT (US)

( * ) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 U.S.C. 154(b) by 201 days.

(21) Appl. No.: 10/321,869

(22) Filed: Dec. 18, 2002

(65) Prior Publication Data

US 2004/0128288 Al Jul. 1, 2004

(51) Int. CI.7 G06F 17/30

(52) U.S. CI 707/4; 707/5; 707/6

(58) Field of Search 707/1-100; 704/9,

704/10; 715/513

(56) References Cited

U.S. PATENT DOCUMENTS

[blocks in formation]

Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins, "Text classification using string kernels", in NeuroCOLT2 Technical Report Series NC-TR-2000-079, 2000.

Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Chrstopher J. C. H. Watkins, "Text classification using string kernels", in Advances in Neural Information Processing Systems, pp. 563-569, Cambridge, MA, 2001.

Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins, "Text classification using string kernels", in Journal of Machine Learning Research, 2:419-444, Feb. 2002.

Chris Watkins, "Dynamic Alignment Kernels", in Technical Report CSD-TR-98-11, Department of Computer Science, Royal Holloway University of London, Jan. 1999.

* cited by examiner

[blocks in formation]

A measure of similarity between a first sequence of symbols and a second sequence of symbols is computed. Memory is allocated for a computational unit for storing values that are computed using a recursive formulation that computes the measure of similarity based on matching subsequences of symbols between the first sequence of symbols and the second sequence of symbols. A processor computes for the computational unit the values for the measure of similarity using the recursive formulation within which functions are computed using nested loops. The measure of similarity is output by the computational unit to an information processing application.

21 Claims, 7 Drawing Sheets

[merged small][merged small][merged small][merged small][merged small][merged small][merged small][graphic][merged small][merged small][merged small][merged small][merged small][merged small][merged small][graphic][subsumed][merged small]
[merged small][merged small][merged small][merged small][merged small][graphic][merged small][merged small][merged small][merged small][graphic][subsumed][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small]
[merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][merged small][graphic][merged small][merged small]

Direct [

Input:

j,/eE+ # s & t are two sequences of symbols

X e [0,1] # decay factor

N € K + # subsequence length

# measure of similarity

Output:

KN(s,t)

]

{

K\(i,j)=lOZiZ\s\,0<j$\t\ JCXn-\,j)=0,\<n<N,n-i<j<\t\ I Initialize base values K'\(i,n~\)=o,\<n<N,n-\<i<\s\ ( of K, K\ and K'

KAn-ljj=0,lSn<N,n-]Zj<\t\

for n = 1 to N-\ # for increasing subsequence length n for j: = n to s # for increasing prefixes s

for j = nto |t| { # for increasing prefixes t
if si = tj then

else

}

for J: = n to J| { # compute K
K = 0

for j = 1 to r|
if Sj=//_then

}

}

F/G. 5

« PreviousContinue »