Illlllllllllllllllllllllllllllllllllllllllllllllll
US006917936B2
(12) United States Patent ao) Patent No.: us 6,917,936 B2
Cancedda (45) Date of Patent: Jul. 12,2005
(54) METHOD AND APPARATUS FOR
MEASURING SIMILARITY BETWEEN DOCUMENTS
(75) Inventor: Nicola Cancedda, Grenoble (FR)
(73) Assignee: Xerox Corporation, Stamford, CT (US)
( * ) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 U.S.C. 154(b) by 201 days.
(21) Appl. No.: 10/321,869
(22) Filed: Dec. 18, 2002
(65) Prior Publication Data
US 2004/0128288 Al Jul. 1, 2004
(51) Int. CI.7 G06F 17/30
(52) U.S. CI 707/4; 707/5; 707/6
(58) Field of Search 707/1-100; 704/9,
704/10; 715/513
(56) References Cited
U.S. PATENT DOCUMENTS
![[blocks in formation]](http://www.google.ca/patents?id=AZYVAAAAEBAJ&ie=ISO-8859-1&output=text&pg=PA1&img=1&zoom=3&hl=en&q=&cds=1&sig=ACfU3U0jBxvZYKizCEAwvi880ylEKW8fCQ&edge=0&edge=stretch&ci=127,626,382,229)
Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins, "Text classification using string kernels", in NeuroCOLT2 Technical Report Series NC-TR-2000-079, 2000.
Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Chrstopher J. C. H. Watkins, "Text classification using string kernels", in Advances in Neural Information Processing Systems, pp. 563-569, Cambridge, MA, 2001.
Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins, "Text classification using string kernels", in Journal of Machine Learning Research, 2:419-444, Feb. 2002.
Chris Watkins, "Dynamic Alignment Kernels", in Technical Report CSD-TR-98-11, Department of Computer Science, Royal Holloway University of London, Jan. 1999.
* cited by examiner
![[blocks in formation]](http://www.google.ca/patents?id=AZYVAAAAEBAJ&ie=ISO-8859-1&output=text&pg=PA1&img=1&zoom=3&hl=en&q=&cds=1&sig=ACfU3U0jBxvZYKizCEAwvi880ylEKW8fCQ&edge=0&edge=stretch&ci=482,529,284,70)
A measure of similarity between a first sequence of symbols and a second sequence of symbols is computed. Memory is allocated for a computational unit for storing values that are computed using a recursive formulation that computes the measure of similarity based on matching subsequences of symbols between the first sequence of symbols and the second sequence of symbols. A processor computes for the computational unit the values for the measure of similarity using the recursive formulation within which functions are computed using nested loops. The measure of similarity is output by the computational unit to an information processing application.
21 Claims, 7 Drawing Sheets
Direct [
Input:
j,/eE+ # s & t are two sequences of symbols
X e [0,1] # decay factor
N € K + # subsequence length
Output:
KN(s,t)
]
{
K\(i,j)=lOZiZ\s\,0<j$\t\ JCXn-\,j)=0,\<n<N,n-i<j<\t\ I Initialize base values K'\(i,n~\)=o,\<n<N,n-\<i<\s\ ( of K, K\ and K'
KAn-ljj=0,lSn<N,n-]Zj<\t\
for n = 1 to N-\ # for increasing subsequence length n for j: = n to s # for increasing prefixes s
for j = nto |t| { # for increasing prefixes t if si = tj then
else
}
for J: = n to J| { # compute K K = 0
for j = 1 to r| if Sj=//_then
}
« PreviousContinue » |