CA1306062C - Computer information retrieval using latent semantic structure - Google Patents
Computer information retrieval using latent semantic structureInfo
- Publication number
- CA1306062C CA1306062C CA000596524A CA596524A CA1306062C CA 1306062 C CA1306062 C CA 1306062C CA 000596524 A CA000596524 A CA 000596524A CA 596524 A CA596524 A CA 596524A CA 1306062 C CA1306062 C CA 1306062C
- Authority
- CA
- Canada
- Prior art keywords
- term
- data
- pseudo
- matrix
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Lifetime
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99935—Query augmenting and refining, e.g. inexact access
Abstract
Abstract of the Disclosure A methodology for retrieving textual data objects is disclosed. The information is treated in the statistical domain by presuming that there is an underlying, latent semantic structure in the usage of words in the data objects. Estimates to this latent structure are utilized to represent and retrieve objects. A user query is recouched in the new statistical domain and then processed in the computer system to extract the underlying meaning to respond to the query.
Description
~3~
This invention relates generally to computer-based i~formation retrieval and, in particular, to user accessibility to and display of textual material stored in computer files.
S E~çkg~ of the Invention Increases in computer storage capacity, transmission rates and processing speed mean that many large and important collections of data are now available electronically, such as via bulletin boards, mail, and on-line texts, documents and directories. While many of the technological barriers to information access and display have I0 been removed, the human/system interface problem of being able to locate what one really needs from the collections remains. Methods for storing, organizing and accessing this information range from electronic analogs of familiar paper-based techniques, such as tables of contents or indices to richer associative connections that are feasible only wi1h computers, such as hypertext and full-context addressability. While these techniques may provide 15 retrieval benefits over the prior paper-based techniques, many advantages of electronic storage are yet unrealized. Most systems still require a user or provider of information to specify explicit relationships and links b~ tween data objects or text objects, thereby making the systems tedious to use or to apply to large, heterogeneous computer information files whose content may be unfamiliar to the user.
To exemplify one standard approach whose difficulties and deficiencies are representative of conventioDal approaches, the retrieval of information using keyword matching is considered. This technique depends Dn matching individual words in a user's request vith individual words in the total database of textual material. Text objects that contain one or more words in common with those in the user's query are return~d as 25 relevant. Keyword-based retrieval systems like this are, however, far from ideal. Many objects relevant to the query may be missed, and oftentimes unrelated objects are retrieved.
The fundamental deficiency of current information retrieval methods is that the words a searcher uses are often not the same as those by which the information sought has been indexed. There are actually t vo aspects to the problem. First, there is a 30 tremendous diversity in the words people use to describe the same object or concept; this is called synonymy. Users in different contexts, or with different needs, knowledge or linguistic habits will describe the same information using different terms. For example, it has been demonstrated that any two people choose the same main keyword for a single, well-known object less than 20% of the time on average. Indeed, this variability is much ~3~
This invention relates generally to computer-based i~formation retrieval and, in particular, to user accessibility to and display of textual material stored in computer files.
S E~çkg~ of the Invention Increases in computer storage capacity, transmission rates and processing speed mean that many large and important collections of data are now available electronically, such as via bulletin boards, mail, and on-line texts, documents and directories. While many of the technological barriers to information access and display have I0 been removed, the human/system interface problem of being able to locate what one really needs from the collections remains. Methods for storing, organizing and accessing this information range from electronic analogs of familiar paper-based techniques, such as tables of contents or indices to richer associative connections that are feasible only wi1h computers, such as hypertext and full-context addressability. While these techniques may provide 15 retrieval benefits over the prior paper-based techniques, many advantages of electronic storage are yet unrealized. Most systems still require a user or provider of information to specify explicit relationships and links b~ tween data objects or text objects, thereby making the systems tedious to use or to apply to large, heterogeneous computer information files whose content may be unfamiliar to the user.
To exemplify one standard approach whose difficulties and deficiencies are representative of conventioDal approaches, the retrieval of information using keyword matching is considered. This technique depends Dn matching individual words in a user's request vith individual words in the total database of textual material. Text objects that contain one or more words in common with those in the user's query are return~d as 25 relevant. Keyword-based retrieval systems like this are, however, far from ideal. Many objects relevant to the query may be missed, and oftentimes unrelated objects are retrieved.
The fundamental deficiency of current information retrieval methods is that the words a searcher uses are often not the same as those by which the information sought has been indexed. There are actually t vo aspects to the problem. First, there is a 30 tremendous diversity in the words people use to describe the same object or concept; this is called synonymy. Users in different contexts, or with different needs, knowledge or linguistic habits will describe the same information using different terms. For example, it has been demonstrated that any two people choose the same main keyword for a single, well-known object less than 20% of the time on average. Indeed, this variability is much ~3~
greater than commonly believed and this places strict, low limits on the expected performance of word-matching systems.
The second aspect relates to polysemy, a word havin~ more than one distinct meaning. In different contexts or when used by different people the same word S takes on varying referential significance (e.g., "bank" in river bank versus "bank" in a saviDgs bank). Thus the use of a term in a search query does not necessarily mean that a text object containing or labeled by the same term is of interest.
Because human word use is characterized by extensive synonymy and polysemy, straightforward term-matching schemes have serious shortcomings -- relevant 10 materials will be missed because different people describe the same topic using different words and, because the same word can have different meanings, irrelevant material will be retrieved. The basic problem may be simply summarized by stating that people want to access information based on meaning, but the words they select do not adequately e~press intended meaning. Previous attempts to improve standard word searching and overcome the 15 diversity in human word usage have involved: restricting the allowable vocabulary and training intermediaries to generate indexing and search keys; hand-crafting thesauri to provide synonyms; or coDstructing explicit models of the relPvant domain knowledge. Not only are these methods expert-labor intensive, but they are often not very successful.
SumrrlaIy Q~h~;~nvention These shortcomings as well as other deficiencies and limitations of information retrieval are obviated, in accordance with the present invention, byautomatically constructing a semaDtic space for rstrieval. This is effected by treating the unreliability of observed word-to-text object association data as a statistical problem. The basic postulate is that there is an Imderlying latent semantic structure in word usage data 25 that is partially hiddeD or obscured by the variability of word choice. A statistical approach is utilized to estimate this latent structure and uncover the latent meaning. Words, the te~t objects and, later, user queries are processed to extract this underlying meaning and the new, latent semantic structure domain is then used to represent and retrieve information.
The organization and operation of this invention will be better 30 understood from a consideration of the detailed description of the illustrative embodiment thereof, which follows, when taken iD conjunction with the accompanying drawing.
6~
The second aspect relates to polysemy, a word havin~ more than one distinct meaning. In different contexts or when used by different people the same word S takes on varying referential significance (e.g., "bank" in river bank versus "bank" in a saviDgs bank). Thus the use of a term in a search query does not necessarily mean that a text object containing or labeled by the same term is of interest.
Because human word use is characterized by extensive synonymy and polysemy, straightforward term-matching schemes have serious shortcomings -- relevant 10 materials will be missed because different people describe the same topic using different words and, because the same word can have different meanings, irrelevant material will be retrieved. The basic problem may be simply summarized by stating that people want to access information based on meaning, but the words they select do not adequately e~press intended meaning. Previous attempts to improve standard word searching and overcome the 15 diversity in human word usage have involved: restricting the allowable vocabulary and training intermediaries to generate indexing and search keys; hand-crafting thesauri to provide synonyms; or coDstructing explicit models of the relPvant domain knowledge. Not only are these methods expert-labor intensive, but they are often not very successful.
SumrrlaIy Q~h~;~nvention These shortcomings as well as other deficiencies and limitations of information retrieval are obviated, in accordance with the present invention, byautomatically constructing a semaDtic space for rstrieval. This is effected by treating the unreliability of observed word-to-text object association data as a statistical problem. The basic postulate is that there is an Imderlying latent semantic structure in word usage data 25 that is partially hiddeD or obscured by the variability of word choice. A statistical approach is utilized to estimate this latent structure and uncover the latent meaning. Words, the te~t objects and, later, user queries are processed to extract this underlying meaning and the new, latent semantic structure domain is then used to represent and retrieve information.
The organization and operation of this invention will be better 30 understood from a consideration of the detailed description of the illustrative embodiment thereof, which follows, when taken iD conjunction with the accompanying drawing.
6~
Br;ef nescription of the Drawin~
FIG. 1 is a plot of the "term" coordinates and the "document"
coordinates based s~n a two-dimensional singular value decomposition of an original "term-by-document" matrix; and FIG. 2 is a flow diagram depicting the processiDg to generate the "term" and "document" matrices using singular valui: decomposition as well as the processing of a user's query.
Detailed Descripti~
Before discussing the principles and operational characteristics of this 10 invention in detail, it i9 helpful to present a motivating e~ample. This also aids in introducing terminology utilized later in the discussion.
Simple Example Illustrafing the Method The coDtents of Table 1 are used to illustrate how semantic structure analysis works and to point out the diffe}ences between this method and conventional 15 keyword matching.
DOCUMENT S~3T BASED ON TITLES
c1: Human machine interface fo} Lab ABC computer applications c2: A survey of user opinion of computer system response time ~ -c3: The EPS user mterface management system c4: Systems and human systems engineering testing of EPS-2 cS: Relation of user-perceived response time to error measurement ml: The generation of random, binary, unordered trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and well-quasi-ordering m4: Graph minors: A survey In this example, a file of text objects consists of nine titles of technical documents with titles cl-cS concerned with human/computer interaction and titles ml-m4 concerned with mathematical graph theory. In Table 1, words occurring in more than one 30 title are italicized. Using conventional keyword retrieval, if a user requested papers dealing with "human compu~er interaction," titles c1, c2, and c4 would be returned, since these titles 6;~
FIG. 1 is a plot of the "term" coordinates and the "document"
coordinates based s~n a two-dimensional singular value decomposition of an original "term-by-document" matrix; and FIG. 2 is a flow diagram depicting the processiDg to generate the "term" and "document" matrices using singular valui: decomposition as well as the processing of a user's query.
Detailed Descripti~
Before discussing the principles and operational characteristics of this 10 invention in detail, it i9 helpful to present a motivating e~ample. This also aids in introducing terminology utilized later in the discussion.
Simple Example Illustrafing the Method The coDtents of Table 1 are used to illustrate how semantic structure analysis works and to point out the diffe}ences between this method and conventional 15 keyword matching.
DOCUMENT S~3T BASED ON TITLES
c1: Human machine interface fo} Lab ABC computer applications c2: A survey of user opinion of computer system response time ~ -c3: The EPS user mterface management system c4: Systems and human systems engineering testing of EPS-2 cS: Relation of user-perceived response time to error measurement ml: The generation of random, binary, unordered trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and well-quasi-ordering m4: Graph minors: A survey In this example, a file of text objects consists of nine titles of technical documents with titles cl-cS concerned with human/computer interaction and titles ml-m4 concerned with mathematical graph theory. In Table 1, words occurring in more than one 30 title are italicized. Using conventional keyword retrieval, if a user requested papers dealing with "human compu~er interaction," titles c1, c2, and c4 would be returned, since these titles 6;~
contain at least one keyword from the user request. However, c3 and cS, while related to the query, would not be returned since they share no words in cs~mmon with the request. It is now shown how latent semantic structure analysis treats this request to return titles c3 and cS.
Table 2 depicts the l'term-by-document" matrix for the 9 technical document titles. Eaeh cell entry, (i,j), is the frequeDey of occurrence of term i in document j. This basic term-by-document matrix or a mathematical transformation thereof is used as input to the statistical procedure described below.
10 TERMS DO~UMENT~
cl c2 c3 c4 cS ml m2 m3 m4 human 1 0 0 1 0 0 0 0 0 interface 1 0 1 0 0 0 0 0 0 15 computer 1 1 0 0 0 0 0 0 0 user 0 1 1 0 1 0 0 0 0 system 0 1 1 2 0 0 0 0 0 response 0 1 0 0 1 0 0 0 0 time 0 1 0 0 1 0 0 0 0 survey 0 1 0 0 0 0 0 0 tree 0 0 0 û O 1 1 1 0 graph O O O O O 0 minor 0 0 0 0 0 0 0 For this example the documents and terms have been carefully selected to yield a good approximatiDn in just two dimensions for expository purposes. FIG. 1 is a two dimensional graphical representation of the two largest dimensions resulting from the statistical process, singular value decomposition. Both document titles and the terms used in them are fit into the same space. Terms are shown as circles and labeled by number.
30 Document titles are represented by squares with the numbers of constituent terms indicated parenthetically. The cosiIIe or dot product between two objects (terms or documents) describe their estimated similarity. In this rep}esentation, the two types of documents form two distinct groups: all the mathematical graph theory titles occupy the same region in ~L~0~i~6~
space (basically along Dimension 1 of FIG. 1,) whereas a quite distinct group is formed for human/computer interaction titles (essentially along Dimension 2 of FIG. 1).
To respond to a user query about "human compllter interaction," the query is first folded into this two-dimeDsional space using those query terms that occur in 5 the space (namely, "human" and "computer"). The query vector is located in the direction of the weighted average of these constituent terms, and is denoted by a directional arrow labeled "Q" in FIG. 1. A measure of closeness or similarity is related to the angle between the query vector and any given term or document vector. One such measure is the cosine between the query vector and a given term or document vector. In FIG. 1 the cosine 10 between the query vector and each c1-cS titles is greater than 0.90; the angle corresponding to the cosine value of 0.90 with the query is shown by the dashed lines in FIG. 1. With this technique, documents c3 and cS would be returned as matches to the user query, even though they share no common terms with the qllery. This is because the latent semantic structure (represented in FIG. 1) fits the overall pattern of term usage across documents.
Descr)p~ion of Singular Value Decomp~s~t~on To obtain the data to plot FIG. 1, the "term-by-document" matrix of Table 2 is decomposed using singular value decomposition (SVD). A reduced SVD isemployed to approximate the original matrix in terms of a much smaller number oforthogonal dimensions. This reduced SVD is used for retrieval; it describes major 20 associational structures in the matrix but it ignores small variations in word usage. The number of dimensions to represent adequately a particular domain is largely an empirical matter. If the number of dimensions is too large, random noise or variations in ~ord usage will be modeled. If the number of dimensions is too small, significant semantic content will remain uncaptured. ~or diverse information sources, 100 or more dimensions may be 25 needed.
To illustrate the decomposition technique, the term-by-document matrix, denoted Y, is decomposed into three other matrices, namely, the term matrix (TERM), the document matrix (DOCUMENT)~ and a diagonal matrix of singular values(DIAGONAL), as follows:
t,d TERl!v[t,m DIAGONALm m DOCUMENTl'm d where Y is the original t-by-d matrix, TERM is the t-by-m matrix that has unit-length orthogonal columns, DOCUMEiNT~ is the transpose of the d-by-m DOCUMENT matrix with unit-length orthogoDal columns, and DIAGONAL is the m-by-m diagonal matrix of singular values typically ordered by magnitude.
~3~ 6~
Table 2 depicts the l'term-by-document" matrix for the 9 technical document titles. Eaeh cell entry, (i,j), is the frequeDey of occurrence of term i in document j. This basic term-by-document matrix or a mathematical transformation thereof is used as input to the statistical procedure described below.
10 TERMS DO~UMENT~
cl c2 c3 c4 cS ml m2 m3 m4 human 1 0 0 1 0 0 0 0 0 interface 1 0 1 0 0 0 0 0 0 15 computer 1 1 0 0 0 0 0 0 0 user 0 1 1 0 1 0 0 0 0 system 0 1 1 2 0 0 0 0 0 response 0 1 0 0 1 0 0 0 0 time 0 1 0 0 1 0 0 0 0 survey 0 1 0 0 0 0 0 0 tree 0 0 0 û O 1 1 1 0 graph O O O O O 0 minor 0 0 0 0 0 0 0 For this example the documents and terms have been carefully selected to yield a good approximatiDn in just two dimensions for expository purposes. FIG. 1 is a two dimensional graphical representation of the two largest dimensions resulting from the statistical process, singular value decomposition. Both document titles and the terms used in them are fit into the same space. Terms are shown as circles and labeled by number.
30 Document titles are represented by squares with the numbers of constituent terms indicated parenthetically. The cosiIIe or dot product between two objects (terms or documents) describe their estimated similarity. In this rep}esentation, the two types of documents form two distinct groups: all the mathematical graph theory titles occupy the same region in ~L~0~i~6~
space (basically along Dimension 1 of FIG. 1,) whereas a quite distinct group is formed for human/computer interaction titles (essentially along Dimension 2 of FIG. 1).
To respond to a user query about "human compllter interaction," the query is first folded into this two-dimeDsional space using those query terms that occur in 5 the space (namely, "human" and "computer"). The query vector is located in the direction of the weighted average of these constituent terms, and is denoted by a directional arrow labeled "Q" in FIG. 1. A measure of closeness or similarity is related to the angle between the query vector and any given term or document vector. One such measure is the cosine between the query vector and a given term or document vector. In FIG. 1 the cosine 10 between the query vector and each c1-cS titles is greater than 0.90; the angle corresponding to the cosine value of 0.90 with the query is shown by the dashed lines in FIG. 1. With this technique, documents c3 and cS would be returned as matches to the user query, even though they share no common terms with the qllery. This is because the latent semantic structure (represented in FIG. 1) fits the overall pattern of term usage across documents.
Descr)p~ion of Singular Value Decomp~s~t~on To obtain the data to plot FIG. 1, the "term-by-document" matrix of Table 2 is decomposed using singular value decomposition (SVD). A reduced SVD isemployed to approximate the original matrix in terms of a much smaller number oforthogonal dimensions. This reduced SVD is used for retrieval; it describes major 20 associational structures in the matrix but it ignores small variations in word usage. The number of dimensions to represent adequately a particular domain is largely an empirical matter. If the number of dimensions is too large, random noise or variations in ~ord usage will be modeled. If the number of dimensions is too small, significant semantic content will remain uncaptured. ~or diverse information sources, 100 or more dimensions may be 25 needed.
To illustrate the decomposition technique, the term-by-document matrix, denoted Y, is decomposed into three other matrices, namely, the term matrix (TERM), the document matrix (DOCUMENT)~ and a diagonal matrix of singular values(DIAGONAL), as follows:
t,d TERl!v[t,m DIAGONALm m DOCUMENTl'm d where Y is the original t-by-d matrix, TERM is the t-by-m matrix that has unit-length orthogonal columns, DOCUMEiNT~ is the transpose of the d-by-m DOCUMENT matrix with unit-length orthogoDal columns, and DIAGONAL is the m-by-m diagonal matrix of singular values typically ordered by magnitude.
~3~ 6~
The dimensionality of the full solution, denoted m, is the rank of the t-by-d matrix, that is, m ~; min(t,d). Tables 3, 4 aDd S below show the TE2M and DOCUMFNT matrices and the diagonal elements of the DIAGONAL matrix, respectively, as fou~d via SVD.
T~BLI~ 3 S TE~M MATRI2~ (12 terms by 9 dimen~ions) human 0.22 -0.11 0.29 -0.41 -0.11 -0.34 -.52 -0.06 -0.41 interface 0.20 -0.07 0.14 -0.55 0.28 0.50 -0.07 -0.01 -0.11 computer 0.24 0.04 -0.16 -0.59 -0.11 -0.2S -0.30 0.06 0.49 10 user 0.40 0.06 -0.34 D.10 0.33 0.38 0.00 0.00 0.01 systern 0.64 -0.17 0.36 0.33 -0.16 -0.21 -0.16 0.03 0.27 response 0.26 0.11 -0.42 0.07 0.08 -0.17 0.28 -0.02 -0.05 time 0.26 0.11 -0.42 0.07 0.08 -0.17 0.28 -0.02 -0.05 EPS 0.30 -0.14 0.33 0.19 0.11 0.27 0.03 -0.02 -0.16 15 survey 0.20 0.27 -0.18 -0.03 -0.54 0.08 -0.47 -0.04 -0.58 tree 0.01 0.4S 0.23 0.02 0.59 -0.39 -0.29 0.25 -0.22 graph 0.04 0.62 0.22 0.00 -0.07 0.11 0.16 -0.68 0.23 minor 0.03 0.45 0.14 -0.01 -0.30 0.28 0.34 0.68 0.18 ~3~
T~BLI~ 3 S TE~M MATRI2~ (12 terms by 9 dimen~ions) human 0.22 -0.11 0.29 -0.41 -0.11 -0.34 -.52 -0.06 -0.41 interface 0.20 -0.07 0.14 -0.55 0.28 0.50 -0.07 -0.01 -0.11 computer 0.24 0.04 -0.16 -0.59 -0.11 -0.2S -0.30 0.06 0.49 10 user 0.40 0.06 -0.34 D.10 0.33 0.38 0.00 0.00 0.01 systern 0.64 -0.17 0.36 0.33 -0.16 -0.21 -0.16 0.03 0.27 response 0.26 0.11 -0.42 0.07 0.08 -0.17 0.28 -0.02 -0.05 time 0.26 0.11 -0.42 0.07 0.08 -0.17 0.28 -0.02 -0.05 EPS 0.30 -0.14 0.33 0.19 0.11 0.27 0.03 -0.02 -0.16 15 survey 0.20 0.27 -0.18 -0.03 -0.54 0.08 -0.47 -0.04 -0.58 tree 0.01 0.4S 0.23 0.02 0.59 -0.39 -0.29 0.25 -0.22 graph 0.04 0.62 0.22 0.00 -0.07 0.11 0.16 -0.68 0.23 minor 0.03 0.45 0.14 -0.01 -0.30 0.28 0.34 0.68 0.18 ~3~
DOCUMENT MAT~IX (9 documents by 9 dimensions) -c1 0.20 -0.06 0.11 -0.95 0.04 -0.08 0.18-0.01 -0.06 c2 0.60 0.16 -0.50 -0.~3 -0.21 -0.02-0.43 0.05 0.2~
c3 0.~6 -0.13 0.21 0.04 0.38 0.07 -0.240.01 0.02 c4 0.54 -0.23 0.57 0.27 -0.20 -0.04 0.2S-0.02 -0.08 cS 0.28 0.11 -0.50 0.15 0.33 0.03 0.67-0.06 -0.26 ml 0.00 0.19 0.10 0.02 0.39 -0.30 -0.340.45 -0.62 m2 0.01 0.44 0.19 0.02 0.35 -0.21-0.15 -û.76 0.02 m3 0.02 0.62 0.25 0.01 0.15 0.000.25 0.45 0.S2 m4 0.08 0.53 0.08 -0.02 -0.60 0.360.04 -0.07 -0.45 DIAGONAL (9 singul~r vallles) ~
3.34 2.54 2.35 1.64 1.50 1.31 0.84 0.56 0.36 As alluded to ea}lier, data to plot FIG. 1 was obtained by presuming that two-dimensions are sufficient to capture the major associational structure of the t-by-d matrix, that is, m is set to two in the e~pression for Yt d~ yielding an approximation of the 20 original matrL~. Only the first two columns of the TERM and DOCUMENT matrices are considered with the remaining columns being igDored. Thus, the term data point corresponding to "human" in FIC;. 1 is plotted with coordinates (0.22,-0.11), which are extracted from the first row and the two left-most columns of the TERM matrix. Similarly, the document data point corresponding to title ml has cooIdinates (0.00,0.19), coming from 25 row six and the two left-most columns of the DOCUMENT matrix.
General ~odel Details It is now elucidating to describe in somewhat more detail the mathematical model underlying the latent structure, singular value decomposition technique.
606~:
c3 0.~6 -0.13 0.21 0.04 0.38 0.07 -0.240.01 0.02 c4 0.54 -0.23 0.57 0.27 -0.20 -0.04 0.2S-0.02 -0.08 cS 0.28 0.11 -0.50 0.15 0.33 0.03 0.67-0.06 -0.26 ml 0.00 0.19 0.10 0.02 0.39 -0.30 -0.340.45 -0.62 m2 0.01 0.44 0.19 0.02 0.35 -0.21-0.15 -û.76 0.02 m3 0.02 0.62 0.25 0.01 0.15 0.000.25 0.45 0.S2 m4 0.08 0.53 0.08 -0.02 -0.60 0.360.04 -0.07 -0.45 DIAGONAL (9 singul~r vallles) ~
3.34 2.54 2.35 1.64 1.50 1.31 0.84 0.56 0.36 As alluded to ea}lier, data to plot FIG. 1 was obtained by presuming that two-dimensions are sufficient to capture the major associational structure of the t-by-d matrix, that is, m is set to two in the e~pression for Yt d~ yielding an approximation of the 20 original matrL~. Only the first two columns of the TERM and DOCUMENT matrices are considered with the remaining columns being igDored. Thus, the term data point corresponding to "human" in FIC;. 1 is plotted with coordinates (0.22,-0.11), which are extracted from the first row and the two left-most columns of the TERM matrix. Similarly, the document data point corresponding to title ml has cooIdinates (0.00,0.19), coming from 25 row six and the two left-most columns of the DOCUMENT matrix.
General ~odel Details It is now elucidating to describe in somewhat more detail the mathematical model underlying the latent structure, singular value decomposition technique.
606~:
Any rectangular matrix Y of t rows and d columns, for example, a t-by-d matrix of terms and documeDts, can be decomposed h~to a product of three other matrices:
Y = To SO D O~ (1) 5 such that To and Do have unlt-length orthogonal columns (i.e. To~To = I; DoTDo = I) and SO is diagonal. This is called the singular value decomposition (SVD) of Y. (A procedure for SVD is described in the text ~umerical ~ec~pes, by Press, Flannery, Teukolsky and Vetterling, 1986, Cambridge University Press, Csmbridge, England). To and Do are the matrices of left and right singular vectors and SO is the diagonal matrix of singular values.
10 By convention, the diagonal elements of SO are ordered in decreasing magnitude.
With SVD, It is possible to devise a simple strategy for an optimal approximation to Y using smaller matrices. The k largest singular values and their associated columns in TD and Do may be kept and the remaining entries set to zero. The product of the resulting matrices is a matrix YR which is approximately equal to Y, and is 15 of rank k. The new matri~ Y~ is the matrix of rank k which is the closest in the least squares sense to Y. Since zeros were introduced into SO' the representation of SO can be simplified by deleting the rows and columns having these zeros to obtain a new diagonal matrix S, and then deleting the corresponding columns of To and Do to define new matrices T and D, respectively. The result is a reduced model such that YR = TSDT. (2) The value of k is chosen for each application; it is generally such that k 2100 for collections of 1000-3000 data objects.
For discussion purposes, it is useful to interpret the SVD
geometrically. The rows of the reduced matrices T and D may be taken as vectors 25 representiDg the terms a31d documents, respectively, in a k-dimensional spacs. With appropriate rescaling of the axes, by quantities related to the associated diagonal values of S, dot products between points in the space can be used to access and compare objects. (A
simplified approach which did not involve rescaling was used to plot the data of FIG. 1, but this was strictly for expository purposes.) These techniques are now discussed.
~3060~;~
Fundamental ~omparisons There are basically three types of comparisons of interest: (i) those comparing two terms; (ii) those comparing two doc~lments or text objects; aDd (iii) thosc comparing a term and a document or text object. As used throughout, the notion of a text S object or data object is general whereas a document is a specific instance of a te~t object or data object. Also, text or data objects are stored in the computer system in files.
Two Terms: In the data, the dot product between two row vectors of YR tells the extent to which two terms have a similar pattern of occurrence across the set of documents. The matrix YRYTR is the square symmetric matrix approximation containing all the term-by-10 term dot products. Using equation (2), y yT = (TSDT)(TSDT)T = TS2TT= (TS)(TS) . (3) This means that the dot product between the i-th row and j-th row of YR can be obtained by calculating the dot product between the i-th and j-th rows of the TS matrix. That is, considering the rows of TS as vectors representing the terms, dot products between these 15 vectors give the comparison between the terms. The relation between taking the rows of T
as vectors and those of TS as vectors is simple since S is a diagonal rnatrix; each vector element has been stretched or shrunk by the corresponding element of S.
Tw~ Doc~ments: In this case, the dot product is between two column vectors of Y. The document-to-document dot product is approximated by yTRYR = (TSDT)T(TSDT) = DS2DT = ~Ds)(Ds)T (4) Thus the rows of the DS matri2~ are taken as vectors representing the documents, and the comparison is via the dot product between the rows of the DS matrix.
Term and Document: This comparison is somewhat different. Instead of trying to estimate the dot product betweeD rows or between columns of Y, the Eundamental comparison25 between a term and a document is the value of an individual cell in Y. The approximation of Y is simply equation t2), i.e., YR - TSDT. The i,j cell of YR may therefore be obtained by taking the dot product between the i-th row of the matrix TS1/2 and the j-th row of the matrix DS1/2. While the "within" ~term or document) comparisons involved using rows of TS and DS as vectors, the "between" comparison requires TS1/2 and DS1/2 for 30 coordinates. Thus it is not possible to make a single configuration of points in a space that ~1.3~ 2 will allow both "between" and "withhl" comparisons. Tbey will be similar, however, differing only by a stretching or shrinking of the dimensional elements by a factor S1/2.
Representatfons of Pseudo-Obje~ts The previous results show how it is possible to compute comparisons S between the various objects associated with the ro vs or columns of Y. It is very important in information retrieval applications to compute similar comparison quantities for objects such as queries that do not appear explicitly in Y. For e~ample, it is necessary to be able to take a completely novel query, find a location in the k-dimensional latent semantic space for it, and then evaluate its cosine or inner product with respect to terms or objects in the space.
10 Another example would be trying, after-the-fact, to find representations for documents that did not appear in the original space. The new objects for both these examples are equivalent to objects in the matrix Y in that they may be represented as vectors of terms.
For this reason they are called pseudo-documents specifically or pseudo-objects generically.
In order to compare pseudo-documents to other documents, the starting point is defining a 15 pseudo-document vector, designated Yq. Then a representation Dq is derived such that Dq can be used just like a row of D in the comparison relationships described in the foregoing sections. One criterion for such a derivation is that the insertion of a real document Yi should give Di when the model is ideal (i.e., Y=YR). With this constraint, Y = TSD 1' q q 20 or, smce I T equals the identlty matri~, D T S-lTTY
or, finally, D = yT TS-l. (5) Thus, with appropriate rescaling of the a~es, this amounts to placing the pseudo-object at 25 the vector sum of its corresponding term points. Then Dq may be used like any row of D
and, appropriately scaled by S or S1/2, can be used like a usual document vector for making "withiD" and "between" comparisons. It is to be noted that if the measure of similarity to be used in comparing the query against all the documents is one in which only the angle between the vectors is important (such as the cosin0), there is no difference for comparison :ll3~
11 ~
purposes betveen placing the query at the vector average or the vector sum of its terms.
3~rativ~ Embodiment The foundation principles presented in the foregoing sections are now applied to a practical example by way of teaching an illustrative embodiment in accordance S with the present invention.
The system under consideration is one that receives a request for technical information from a user and returns as a response display the most appropriate groups in a large, technically diverse company dealing with that technical information. The size of each group is from five to ten people. There is no expert who understands iD detail 10 what every group is accomplishing. ~ach person's understanding or knowledge of the company's technical vork tends to be myopic, that is, each one knows their particular group's work, less about neighboring groups and their knowledge becomes less precise or even none~istent as one moves further a~ay from the core group.
~f each group can be described by a set of terms, then the latent 15 semantic indexing procedure can be applied. For instance, one set of textual descriptions might include annual write-ups each group member must prepare in describing the planned activity for the coming year. Another input could be the abstracts of technical memoranda written by members of each group.
The technique for processing the documents gathered together to 20 represent the company technical information is shown in block diagram form in FIG. 2. The first processing activity, as illustrated by processing block 100, is that of text preprocessing.
All the combined text i9 preprocessed to identify terms and possible compound noun phrases. First, phrases are found by identifying all words between (1) a precompiled list of stop words; or (2) punctuation marks, or (3) parenthetical remarks.
To obtain more stable estimates of word frequencies, all inflectional suffixss (past tense, plurals, adverbials, progressive tense, and so forth) are removed from the words. Inflectional suffixes, in contrast to derivational suffixes, are those that do not usually change the meaning of the base word. (For example, removing the "s" from "boys"
does not change the meaning of the base word whereas stripping "ation" from "information"
30 does change the meaning). Since no single set of pattern-action rules can correctly descril~e English language, the suffix stripper sub-program may contain an cxception list.The next step to the processing is represented by block 110 in FIG. 2.
Based upon the earlier text preprocessing, a system le~icon is created. The le~icon includes both single vvord and noun phases. The noun phrases provide for a richer semantic space.
35 For e~ample, the "information" in "information retrieval" and "information theory" have different meanings. Treating these as separate terms places each of the compounds at different places in the k-dimensional syace. (For a word in radically different semantic environments, treating it as a single word tends to place ehe word in a meaningless place in k-dimensional space, whereas trcating each of its different semantic environments separately using separate compounds yields spatial differentiation).
Compound noun phrases may be extracted using a simplifiedJ
automatic procedure. First, phrases are found using the "pseudo" parsing techDique described with respect to step 100. Then all left and right branching subphrases are found.
Any phrase or subphrase that occurs in more than one document is a potential compound phrase. Compound phrases may range from tvo to many words ~e.g., "semi-insulating Fe-10 doped InP current blocking layer"). From these potential compolmd phrases, all longest-matching phrases as well as single words making up the compounds are entered into the lexicon base to obtain spatial separation.
In the illustrative embodiment, all inflectionally stripped single vords occurring in more than one document and that are not on the list of most frequently used 15 words in English (such as "the", "and") are also included in the system lexicon. Typically, the e~clusion list comprises about 150 common words.
~rom the list of lexicon terms, the Term-by-Document matrix is created, as depicted by processing block 120 in FI&. 2. In one exemplary situation, the matrix contained 7100 terms and 728 documents representing 480 groups.
The next step is to perform the singular value decomposition on the Term-by-Document matrix, as depicted by processing block 130. This analysis is only effected once (or each time there is a significant update in the storage files).The last step in processing the documPnts prior to a user query is depicted by block 140. In order to relate a selected document to the group responsible for 25 that document, an organizational database is constructed. This latter database may contain, for instance, the group manager's name and the manager's mail address.
The user query processing activity is depicted on the right-hand side of FIG. 2. The first step, as represented by processing block 200, is to preprocess the query iD
the same way as the original documents.
As then depicted by block 210 the longest matching compound phrases as well as single words not part of compound phrases are extracted from the query. For each query term also contained in the system lexicon, the k-dimensional vector is located.
The query vector is the weighted vector average of the k-dimensional vectors. Processing block 220 depicts the generation step for the query vector.
The next step in the query processing is depicted by processing block 230. In order that the best matching document is located, the query vector is compared to all documents in the space. The similarity metris used is the cosine betveen the query )6~
vector and the document vectors. A cosine of 1.0 would indicate that the query vector and the document vector were on top of one another in the space. The cosine metric is similar to a dot product measure except that it lgnores the magnitude of the vectors and simply uses the angle between the vectors being compared.
The cosines are sorted, as depicted by processiDg block 240, and for each of the best N matching documents (typically N=8), the value of the cosine along with organizational information corresponding to the documentls group are displayed to the user, as depicted by prOCeSSiDg block 250. Table 6 shows a typical input and output for N= 5.
10 INPUT QUERY: An ExpertlExpert-Locating System Based on Automatic Representation of Semantic Structure OUTPUT RESULTS:
1. Group: B
Group Title: Artificial Intelligence and Information Science Research Group Manager: D. E. Walker, Address B, Phone B
Fit (Cosine): 0.67 2. Group: A
Group Title: Artificial Intelligence and Communications Research Group MaDager: L. A. Streeter, Address A, Phone A
Fit (Cosine): 0.64 3. Group: E
Group Title: Cognitive Science Research Group Manager: T. K. Laudauer, Address E, Phone E
Fit (Cosine): 0.63 4. Group: C
Group Title: Experimental Systems Group Manager: C. A. Riley, Address C, Phone C
Fit (Cosine): 0.62 5. Group: D
Group Title: Software Technology Group Manager: C. P. Lewis, Address D, Phone D
Fit (Cosine): 0.55 It is to be further understood that the metbodology described herein is not limited to the specific forms disclosed by way of illustration, but may assume other embodiments limited only by the scope of the appended claims.
Y = To SO D O~ (1) 5 such that To and Do have unlt-length orthogonal columns (i.e. To~To = I; DoTDo = I) and SO is diagonal. This is called the singular value decomposition (SVD) of Y. (A procedure for SVD is described in the text ~umerical ~ec~pes, by Press, Flannery, Teukolsky and Vetterling, 1986, Cambridge University Press, Csmbridge, England). To and Do are the matrices of left and right singular vectors and SO is the diagonal matrix of singular values.
10 By convention, the diagonal elements of SO are ordered in decreasing magnitude.
With SVD, It is possible to devise a simple strategy for an optimal approximation to Y using smaller matrices. The k largest singular values and their associated columns in TD and Do may be kept and the remaining entries set to zero. The product of the resulting matrices is a matrix YR which is approximately equal to Y, and is 15 of rank k. The new matri~ Y~ is the matrix of rank k which is the closest in the least squares sense to Y. Since zeros were introduced into SO' the representation of SO can be simplified by deleting the rows and columns having these zeros to obtain a new diagonal matrix S, and then deleting the corresponding columns of To and Do to define new matrices T and D, respectively. The result is a reduced model such that YR = TSDT. (2) The value of k is chosen for each application; it is generally such that k 2100 for collections of 1000-3000 data objects.
For discussion purposes, it is useful to interpret the SVD
geometrically. The rows of the reduced matrices T and D may be taken as vectors 25 representiDg the terms a31d documents, respectively, in a k-dimensional spacs. With appropriate rescaling of the axes, by quantities related to the associated diagonal values of S, dot products between points in the space can be used to access and compare objects. (A
simplified approach which did not involve rescaling was used to plot the data of FIG. 1, but this was strictly for expository purposes.) These techniques are now discussed.
~3060~;~
Fundamental ~omparisons There are basically three types of comparisons of interest: (i) those comparing two terms; (ii) those comparing two doc~lments or text objects; aDd (iii) thosc comparing a term and a document or text object. As used throughout, the notion of a text S object or data object is general whereas a document is a specific instance of a te~t object or data object. Also, text or data objects are stored in the computer system in files.
Two Terms: In the data, the dot product between two row vectors of YR tells the extent to which two terms have a similar pattern of occurrence across the set of documents. The matrix YRYTR is the square symmetric matrix approximation containing all the term-by-10 term dot products. Using equation (2), y yT = (TSDT)(TSDT)T = TS2TT= (TS)(TS) . (3) This means that the dot product between the i-th row and j-th row of YR can be obtained by calculating the dot product between the i-th and j-th rows of the TS matrix. That is, considering the rows of TS as vectors representing the terms, dot products between these 15 vectors give the comparison between the terms. The relation between taking the rows of T
as vectors and those of TS as vectors is simple since S is a diagonal rnatrix; each vector element has been stretched or shrunk by the corresponding element of S.
Tw~ Doc~ments: In this case, the dot product is between two column vectors of Y. The document-to-document dot product is approximated by yTRYR = (TSDT)T(TSDT) = DS2DT = ~Ds)(Ds)T (4) Thus the rows of the DS matri2~ are taken as vectors representing the documents, and the comparison is via the dot product between the rows of the DS matrix.
Term and Document: This comparison is somewhat different. Instead of trying to estimate the dot product betweeD rows or between columns of Y, the Eundamental comparison25 between a term and a document is the value of an individual cell in Y. The approximation of Y is simply equation t2), i.e., YR - TSDT. The i,j cell of YR may therefore be obtained by taking the dot product between the i-th row of the matrix TS1/2 and the j-th row of the matrix DS1/2. While the "within" ~term or document) comparisons involved using rows of TS and DS as vectors, the "between" comparison requires TS1/2 and DS1/2 for 30 coordinates. Thus it is not possible to make a single configuration of points in a space that ~1.3~ 2 will allow both "between" and "withhl" comparisons. Tbey will be similar, however, differing only by a stretching or shrinking of the dimensional elements by a factor S1/2.
Representatfons of Pseudo-Obje~ts The previous results show how it is possible to compute comparisons S between the various objects associated with the ro vs or columns of Y. It is very important in information retrieval applications to compute similar comparison quantities for objects such as queries that do not appear explicitly in Y. For e~ample, it is necessary to be able to take a completely novel query, find a location in the k-dimensional latent semantic space for it, and then evaluate its cosine or inner product with respect to terms or objects in the space.
10 Another example would be trying, after-the-fact, to find representations for documents that did not appear in the original space. The new objects for both these examples are equivalent to objects in the matrix Y in that they may be represented as vectors of terms.
For this reason they are called pseudo-documents specifically or pseudo-objects generically.
In order to compare pseudo-documents to other documents, the starting point is defining a 15 pseudo-document vector, designated Yq. Then a representation Dq is derived such that Dq can be used just like a row of D in the comparison relationships described in the foregoing sections. One criterion for such a derivation is that the insertion of a real document Yi should give Di when the model is ideal (i.e., Y=YR). With this constraint, Y = TSD 1' q q 20 or, smce I T equals the identlty matri~, D T S-lTTY
or, finally, D = yT TS-l. (5) Thus, with appropriate rescaling of the a~es, this amounts to placing the pseudo-object at 25 the vector sum of its corresponding term points. Then Dq may be used like any row of D
and, appropriately scaled by S or S1/2, can be used like a usual document vector for making "withiD" and "between" comparisons. It is to be noted that if the measure of similarity to be used in comparing the query against all the documents is one in which only the angle between the vectors is important (such as the cosin0), there is no difference for comparison :ll3~
11 ~
purposes betveen placing the query at the vector average or the vector sum of its terms.
3~rativ~ Embodiment The foundation principles presented in the foregoing sections are now applied to a practical example by way of teaching an illustrative embodiment in accordance S with the present invention.
The system under consideration is one that receives a request for technical information from a user and returns as a response display the most appropriate groups in a large, technically diverse company dealing with that technical information. The size of each group is from five to ten people. There is no expert who understands iD detail 10 what every group is accomplishing. ~ach person's understanding or knowledge of the company's technical vork tends to be myopic, that is, each one knows their particular group's work, less about neighboring groups and their knowledge becomes less precise or even none~istent as one moves further a~ay from the core group.
~f each group can be described by a set of terms, then the latent 15 semantic indexing procedure can be applied. For instance, one set of textual descriptions might include annual write-ups each group member must prepare in describing the planned activity for the coming year. Another input could be the abstracts of technical memoranda written by members of each group.
The technique for processing the documents gathered together to 20 represent the company technical information is shown in block diagram form in FIG. 2. The first processing activity, as illustrated by processing block 100, is that of text preprocessing.
All the combined text i9 preprocessed to identify terms and possible compound noun phrases. First, phrases are found by identifying all words between (1) a precompiled list of stop words; or (2) punctuation marks, or (3) parenthetical remarks.
To obtain more stable estimates of word frequencies, all inflectional suffixss (past tense, plurals, adverbials, progressive tense, and so forth) are removed from the words. Inflectional suffixes, in contrast to derivational suffixes, are those that do not usually change the meaning of the base word. (For example, removing the "s" from "boys"
does not change the meaning of the base word whereas stripping "ation" from "information"
30 does change the meaning). Since no single set of pattern-action rules can correctly descril~e English language, the suffix stripper sub-program may contain an cxception list.The next step to the processing is represented by block 110 in FIG. 2.
Based upon the earlier text preprocessing, a system le~icon is created. The le~icon includes both single vvord and noun phases. The noun phrases provide for a richer semantic space.
35 For e~ample, the "information" in "information retrieval" and "information theory" have different meanings. Treating these as separate terms places each of the compounds at different places in the k-dimensional syace. (For a word in radically different semantic environments, treating it as a single word tends to place ehe word in a meaningless place in k-dimensional space, whereas trcating each of its different semantic environments separately using separate compounds yields spatial differentiation).
Compound noun phrases may be extracted using a simplifiedJ
automatic procedure. First, phrases are found using the "pseudo" parsing techDique described with respect to step 100. Then all left and right branching subphrases are found.
Any phrase or subphrase that occurs in more than one document is a potential compound phrase. Compound phrases may range from tvo to many words ~e.g., "semi-insulating Fe-10 doped InP current blocking layer"). From these potential compolmd phrases, all longest-matching phrases as well as single words making up the compounds are entered into the lexicon base to obtain spatial separation.
In the illustrative embodiment, all inflectionally stripped single vords occurring in more than one document and that are not on the list of most frequently used 15 words in English (such as "the", "and") are also included in the system lexicon. Typically, the e~clusion list comprises about 150 common words.
~rom the list of lexicon terms, the Term-by-Document matrix is created, as depicted by processing block 120 in FI&. 2. In one exemplary situation, the matrix contained 7100 terms and 728 documents representing 480 groups.
The next step is to perform the singular value decomposition on the Term-by-Document matrix, as depicted by processing block 130. This analysis is only effected once (or each time there is a significant update in the storage files).The last step in processing the documPnts prior to a user query is depicted by block 140. In order to relate a selected document to the group responsible for 25 that document, an organizational database is constructed. This latter database may contain, for instance, the group manager's name and the manager's mail address.
The user query processing activity is depicted on the right-hand side of FIG. 2. The first step, as represented by processing block 200, is to preprocess the query iD
the same way as the original documents.
As then depicted by block 210 the longest matching compound phrases as well as single words not part of compound phrases are extracted from the query. For each query term also contained in the system lexicon, the k-dimensional vector is located.
The query vector is the weighted vector average of the k-dimensional vectors. Processing block 220 depicts the generation step for the query vector.
The next step in the query processing is depicted by processing block 230. In order that the best matching document is located, the query vector is compared to all documents in the space. The similarity metris used is the cosine betveen the query )6~
vector and the document vectors. A cosine of 1.0 would indicate that the query vector and the document vector were on top of one another in the space. The cosine metric is similar to a dot product measure except that it lgnores the magnitude of the vectors and simply uses the angle between the vectors being compared.
The cosines are sorted, as depicted by processiDg block 240, and for each of the best N matching documents (typically N=8), the value of the cosine along with organizational information corresponding to the documentls group are displayed to the user, as depicted by prOCeSSiDg block 250. Table 6 shows a typical input and output for N= 5.
10 INPUT QUERY: An ExpertlExpert-Locating System Based on Automatic Representation of Semantic Structure OUTPUT RESULTS:
1. Group: B
Group Title: Artificial Intelligence and Information Science Research Group Manager: D. E. Walker, Address B, Phone B
Fit (Cosine): 0.67 2. Group: A
Group Title: Artificial Intelligence and Communications Research Group MaDager: L. A. Streeter, Address A, Phone A
Fit (Cosine): 0.64 3. Group: E
Group Title: Cognitive Science Research Group Manager: T. K. Laudauer, Address E, Phone E
Fit (Cosine): 0.63 4. Group: C
Group Title: Experimental Systems Group Manager: C. A. Riley, Address C, Phone C
Fit (Cosine): 0.62 5. Group: D
Group Title: Software Technology Group Manager: C. P. Lewis, Address D, Phone D
Fit (Cosine): 0.55 It is to be further understood that the metbodology described herein is not limited to the specific forms disclosed by way of illustration, but may assume other embodiments limited only by the scope of the appended claims.
Claims (11)
1. An information retrieval method comprising the steps of generating term-by-data object matrix data to represent information files stored in a computer system, said matrix data being indicative of the frequency of occurrence of selected terms contained in the data objects stored in the information files, decomposing said matrix into a reduced singular value representation composed of distinct term and data object files, in response to a user query, generating a pseudo-object utilizing said selected terms and inserting said pseudo-object into said matrix data, and examining the similarity between said pseudo-object and said term and data object files to generate an information response and storing said response in the system in a form accessible by the user.
2. The method as recited in claim 1 wherein said step of generating said matrix data includes the step of producing a lexicon database defining said selected terms.
3. The method as recited in claim 2 wherein said step of producing said lexicon database includes the step of parsing the data objects.
4. The method as recited in claim 3 wherein said step of parsing includes the steps of removing inflectional suffixes and isolating phrases in the data objects.
5. The method as recited in claim 2 wherein said step of generating said pseudo-object includes the step of parsing said pseudo-object with reference to said lexicon database.
6. The method as recited in claim 1 further including the step of generating an organizational database associated with the authorship of the data objects and storing said organizational database in the system and said response includes information from said organizational database based on said similarity.
7. The method as recited in claim 1 wherein said matrix database is expressed as Y, said step of decomposing produces said representation in the form Y=T0S0DT0 of rank m, and an approximation representation YR=TSDT of rank k < m, where T0 and D0 represent said term and data object databases and S0 corresponds to said singular value representation and where T, D and S represent reduced forms of T0, D0 and S0, respectively, said pseudo-object is expressible as Yq and said step of inserting includes the step of computing Dq=YqTTS-1, and said step of examining includes the step of evaluating the dot products between said pseudo-object and said term and document matrices.
8. The method as recited in claim 7 wherein the degree of similarity is measured by said dot products exceeding a predetermined threshold.
9. The method as recited in claim 8 wherein said approximation representation is obtained by setting (k+1) through m diagonal values of S0 to zero.
10. The method as recited in claim 1 wherein said matrix database is expressed as Y, said step of decomposing produces said representation in the form Y=T0S0DT0 of rank m, and an approximation representation YR=TSDT of rank k < m, where T0 and D0 represent said term and data object databases and S0 corresponds to said singular value representation and where T, D and S represent reduced forms of T0, D0 and S0, respectively, said pseudo-object is expressible as Yq and said step of inserting includes the step of computing Dq=YqTTS-1, and said step of examining includes the step of evaluating the cosines between said pseudo-object and said term and document matrices.
11. A method for retrieving information from an information file stored in a computer system comprising the steps of generating term by-data object matrix data by processing the information file, performing a singular value decomposition on said matrix data to obtain the reduced term and data object vectors and diagonal values, in response to a user query, generating a pseudo-object vector and augmenting said matrix data with said pseudo-vector using reduced forms of said term vector and said diagonal values and storing said augmented data in the system, and examining the similarities between said pseudo-object vector and said reduced term vector and a reduced form of said data object vector to generate the information and storing the information in a response file accessible to the user.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US07/244,349 US4839853A (en) | 1988-09-15 | 1988-09-15 | Computer information retrieval using latent semantic structure |
US07/244,349 | 1988-09-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
CA1306062C true CA1306062C (en) | 1992-08-04 |
Family
ID=22922358
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA000596524A Expired - Lifetime CA1306062C (en) | 1988-09-15 | 1989-04-12 | Computer information retrieval using latent semantic structure |
Country Status (2)
Country | Link |
---|---|
US (1) | US4839853A (en) |
CA (1) | CA1306062C (en) |
Families Citing this family (467)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5142681A (en) * | 1986-07-07 | 1992-08-25 | International Business Machines Corporation | APL-to-Fortran translators |
US5408655A (en) * | 1989-02-27 | 1995-04-18 | Apple Computer, Inc. | User interface system and method for traversing a database |
US5197005A (en) * | 1989-05-01 | 1993-03-23 | Intelligent Business Systems | Database retrieval system having a natural language interface |
US5241671C1 (en) | 1989-10-26 | 2002-07-02 | Encyclopaedia Britannica Educa | Multimedia search system using a plurality of entry path means which indicate interrelatedness of information |
US6978277B2 (en) * | 1989-10-26 | 2005-12-20 | Encyclopaedia Britannica, Inc. | Multimedia search system |
US5301109A (en) * | 1990-06-11 | 1994-04-05 | Bell Communications Research, Inc. | Computerized cross-language document retrieval using latent semantic indexing |
US5321833A (en) * | 1990-08-29 | 1994-06-14 | Gte Laboratories Incorporated | Adaptive ranking system for information retrieval |
US5559940A (en) * | 1990-12-14 | 1996-09-24 | Hutson; William H. | Method and system for real-time information analysis of textual material |
US5348020A (en) * | 1990-12-14 | 1994-09-20 | Hutson William H | Method and system for near real-time analysis and display of electrocardiographic signals |
US5490516A (en) * | 1990-12-14 | 1996-02-13 | Hutson; William H. | Method and system to enhance medical signals for real-time analysis and high-resolution display |
DE69229521T2 (en) * | 1991-04-25 | 2000-03-30 | Nippon Steel Corp | Database discovery system |
US6643656B2 (en) | 1991-07-31 | 2003-11-04 | Richard Esty Peterson | Computerized information retrieval system |
US5265065A (en) * | 1991-10-08 | 1993-11-23 | West Publishing Company | Method and apparatus for information retrieval from a database by replacing domain specific stemmed phases in a natural language to create a search query |
JP2792293B2 (en) * | 1991-11-29 | 1998-09-03 | 日本電気株式会社 | Information retrieval device |
US5369575A (en) * | 1992-05-15 | 1994-11-29 | International Business Machines Corporation | Constrained natural language interface for a computer system |
US5598557A (en) * | 1992-09-22 | 1997-01-28 | Caere Corporation | Apparatus and method for retrieving and grouping images representing text files based on the relevance of key words extracted from a selected file to the text files |
US5440481A (en) * | 1992-10-28 | 1995-08-08 | The United States Of America As Represented By The Secretary Of The Navy | System and method for database tomography |
JP3025724B2 (en) * | 1992-11-24 | 2000-03-27 | 富士通株式会社 | Synonym generation processing method |
DE69426541T2 (en) * | 1993-03-12 | 2001-06-13 | Toshiba Kawasaki Kk | Document detection system with presentation of the detection result to facilitate understanding of the user |
US5652897A (en) * | 1993-05-24 | 1997-07-29 | Unisys Corporation | Robust language processor for segmenting and parsing-language containing multiple instructions |
US5544352A (en) * | 1993-06-14 | 1996-08-06 | Libertech, Inc. | Method and apparatus for indexing, searching and displaying data |
JPH07105239A (en) * | 1993-09-30 | 1995-04-21 | Omron Corp | Data base managing method and data base retrieving method |
US5873056A (en) * | 1993-10-12 | 1999-02-16 | The Syracuse University | Natural language processing system for semantic vector representation which accounts for lexical ambiguity |
US5692176A (en) * | 1993-11-22 | 1997-11-25 | Reed Elsevier Inc. | Associative text search and retrieval system |
US5584024A (en) * | 1994-03-24 | 1996-12-10 | Software Ag | Interactive database query system and method for prohibiting the selection of semantically incorrect query parameters |
US5630125A (en) * | 1994-05-23 | 1997-05-13 | Zellweger; Paul | Method and apparatus for information management using an open hierarchical data structure |
US5745745A (en) * | 1994-06-29 | 1998-04-28 | Hitachi, Ltd. | Text search method and apparatus for structured documents |
US5706497A (en) * | 1994-08-15 | 1998-01-06 | Nec Research Institute, Inc. | Document retrieval using fuzzy-logic inference |
US7467137B1 (en) | 1994-09-02 | 2008-12-16 | Wolfe Mark A | System and method for information retrieval employing a preloading procedure |
US6604103B1 (en) | 1994-09-02 | 2003-08-05 | Mark A. Wolfe | System and method for information retrieval employing a preloading procedure |
US7103594B1 (en) | 1994-09-02 | 2006-09-05 | Wolfe Mark A | System and method for information retrieval employing a preloading procedure |
US5715445A (en) * | 1994-09-02 | 1998-02-03 | Wolfe; Mark A. | Document retrieval system employing a preloading procedure |
US5687364A (en) * | 1994-09-16 | 1997-11-11 | Xerox Corporation | Method for learning to infer the topical content of documents based upon their lexical content |
US5659766A (en) * | 1994-09-16 | 1997-08-19 | Xerox Corporation | Method and apparatus for inferring the topical content of a document based upon its lexical content without supervision |
US5855015A (en) * | 1995-03-20 | 1998-12-29 | Interval Research Corporation | System and method for retrieval of hyperlinked information resources |
US5870770A (en) * | 1995-06-07 | 1999-02-09 | Wolfe; Mark A. | Document research system and method for displaying citing documents |
US7302638B1 (en) * | 1995-06-07 | 2007-11-27 | Wolfe Mark A | Efficiently displaying and researching information about the interrelationships between documents |
US5675710A (en) * | 1995-06-07 | 1997-10-07 | Lucent Technologies, Inc. | Method and apparatus for training a text classifier |
US7246310B1 (en) * | 1995-06-07 | 2007-07-17 | Wolfe Mark A | Efficiently displaying and researching information about the interrelationships between documents |
US5724571A (en) | 1995-07-07 | 1998-03-03 | Sun Microsystems, Inc. | Method and apparatus for generating query responses in a computer-based document retrieval system |
US5787422A (en) * | 1996-01-11 | 1998-07-28 | Xerox Corporation | Method and apparatus for information accesss employing overlapping clusters |
US5787450A (en) * | 1996-05-29 | 1998-07-28 | International Business Machines Corporation | Apparatus and method for constructing a non-linear data object from a common gateway interface |
US5926812A (en) * | 1996-06-20 | 1999-07-20 | Mantra Technologies, Inc. | Document extraction and comparison method with applications to automatic personalized database searching |
US5778362A (en) * | 1996-06-21 | 1998-07-07 | Kdl Technologies Limted | Method and system for revealing information structures in collections of data items |
US5813002A (en) * | 1996-07-31 | 1998-09-22 | International Business Machines Corporation | Method and system for linearly detecting data deviations in a large database |
JP3916007B2 (en) * | 1996-08-01 | 2007-05-16 | 高嗣 北川 | Semantic information processing method and apparatus |
US5765149A (en) * | 1996-08-09 | 1998-06-09 | Digital Equipment Corporation | Modified collection frequency ranking method |
US6745194B2 (en) * | 2000-08-07 | 2004-06-01 | Alta Vista Company | Technique for deleting duplicate records referenced in an index of a database |
US5765150A (en) * | 1996-08-09 | 1998-06-09 | Digital Equipment Corporation | Method for statistically projecting the ranking of information |
US5745890A (en) | 1996-08-09 | 1998-04-28 | Digital Equipment Corporation | Sequential searching of a database index using constraints on word-location pairs |
US5909680A (en) * | 1996-09-09 | 1999-06-01 | Ricoh Company Limited | Document categorization by word length distribution analysis |
US5857179A (en) * | 1996-09-09 | 1999-01-05 | Digital Equipment Corporation | Computer method and apparatus for clustering documents and automatic generation of cluster keywords |
US5987446A (en) * | 1996-11-12 | 1999-11-16 | U.S. West, Inc. | Searching large collections of text using multiple search engines concurrently |
US5915001A (en) * | 1996-11-14 | 1999-06-22 | Vois Corporation | System and method for providing and using universally accessible voice and speech data files |
US6415319B1 (en) | 1997-02-07 | 2002-07-02 | Sun Microsystems, Inc. | Intelligent network browser using incremental conceptual indexer |
US6076051A (en) * | 1997-03-07 | 2000-06-13 | Microsoft Corporation | Information retrieval utilizing semantic representation of text |
US5996011A (en) * | 1997-03-25 | 1999-11-30 | Unified Research Laboratories, Inc. | System and method for filtering data received by a computer system |
US6539430B1 (en) | 1997-03-25 | 2003-03-25 | Symantec Corporation | System and method for filtering data received by a computer system |
US8626763B1 (en) | 1997-05-22 | 2014-01-07 | Google Inc. | Server-side suggestion of preload operations |
US6356864B1 (en) | 1997-07-25 | 2002-03-12 | University Technology Corporation | Methods for analysis and evaluation of the semantic content of a writing based on vector length |
US6078878A (en) * | 1997-07-31 | 2000-06-20 | Microsoft Corporation | Bootstrapping sense characterizations of occurrences of polysemous words |
US6112304A (en) * | 1997-08-27 | 2000-08-29 | Zipsoft, Inc. | Distributed computing architecture |
US6122628A (en) * | 1997-10-31 | 2000-09-19 | International Business Machines Corporation | Multidimensional data clustering and dimension reduction for indexing and searching |
US6134541A (en) * | 1997-10-31 | 2000-10-17 | International Business Machines Corporation | Searching multidimensional indexes using associated clustering and dimension reduction information |
US7257604B1 (en) | 1997-11-17 | 2007-08-14 | Wolfe Mark A | System and method for communicating information relating to a network resource |
US6272531B1 (en) * | 1998-03-31 | 2001-08-07 | International Business Machines Corporation | Method and system for recognizing and acting upon dynamic data on the internet |
US7194471B1 (en) | 1998-04-10 | 2007-03-20 | Ricoh Company, Ltd. | Document classification system and method for classifying a document according to contents of the document |
US6211876B1 (en) * | 1998-06-22 | 2001-04-03 | Mitsubishi Electric Research Laboratories, Inc. | Method and system for displaying icons representing information items stored in a database |
US6173441B1 (en) | 1998-10-16 | 2001-01-09 | Peter A. Klein | Method and system for compiling source code containing natural language instructions |
US6256629B1 (en) * | 1998-11-25 | 2001-07-03 | Lucent Technologies Inc. | Method and apparatus for measuring the degree of polysemy in polysemous words |
US6868389B1 (en) | 1999-01-19 | 2005-03-15 | Jeffrey K. Wilkins | Internet-enabled lead generation |
US6574378B1 (en) | 1999-01-22 | 2003-06-03 | Kent Ridge Digital Labs | Method and apparatus for indexing and retrieving images using visual keywords |
US6282540B1 (en) * | 1999-02-26 | 2001-08-28 | Vicinity Corporation | Method and apparatus for efficient proximity searching |
US6584464B1 (en) | 1999-03-19 | 2003-06-24 | Ask Jeeves, Inc. | Grammar template query system |
US6510406B1 (en) * | 1999-03-23 | 2003-01-21 | Mathsoft, Inc. | Inverse inference engine for high performance web search |
US6862710B1 (en) | 1999-03-23 | 2005-03-01 | Insightful Corporation | Internet navigation using soft hyperlinks |
US6629097B1 (en) | 1999-04-28 | 2003-09-30 | Douglas K. Keith | Displaying implicit associations among items in loosely-structured data sets |
US6493702B1 (en) * | 1999-05-05 | 2002-12-10 | Xerox Corporation | System and method for searching and recommending documents in a collection using share bookmarks |
US6611825B1 (en) | 1999-06-09 | 2003-08-26 | The Boeing Company | Method and system for text mining using multidimensional subspaces |
US6701305B1 (en) | 1999-06-09 | 2004-03-02 | The Boeing Company | Methods, apparatus and computer program products for information retrieval and document classification utilizing a multidimensional subspace |
KR20010004404A (en) | 1999-06-28 | 2001-01-15 | 정선종 | Keyfact-based text retrieval system, keyfact-based text index method, and retrieval method using this system |
US6598047B1 (en) * | 1999-07-26 | 2003-07-22 | David W. Russell | Method and system for searching text |
US8914361B2 (en) * | 1999-09-22 | 2014-12-16 | Google Inc. | Methods and systems for determining a meaning of a document to match the document to content |
US6816857B1 (en) * | 1999-11-01 | 2004-11-09 | Applied Semantics, Inc. | Meaning-based advertising and document relevance determination |
US6453315B1 (en) * | 1999-09-22 | 2002-09-17 | Applied Semantics, Inc. | Meaning-based information organization and retrieval |
US7925610B2 (en) * | 1999-09-22 | 2011-04-12 | Google Inc. | Determining a meaning of a knowledge item using document-based information |
US8051104B2 (en) | 1999-09-22 | 2011-11-01 | Google Inc. | Editing a network of interconnected concepts |
JP3335602B2 (en) | 1999-11-26 | 2002-10-21 | 株式会社クリエイティブ・ブレインズ | Thinking system analysis method and analyzer |
US6480837B1 (en) * | 1999-12-16 | 2002-11-12 | International Business Machines Corporation | Method, system, and program for ordering search results using a popularity weighting |
US6751621B1 (en) | 2000-01-27 | 2004-06-15 | Manning & Napier Information Services, Llc. | Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors |
US8645137B2 (en) | 2000-03-16 | 2014-02-04 | Apple Inc. | Fast, language-independent method for user authentication by voice |
US6757646B2 (en) | 2000-03-22 | 2004-06-29 | Insightful Corporation | Extended functionality for an inverse inference engine based web search |
US6925427B1 (en) | 2000-04-04 | 2005-08-02 | Ford Global Technologies, Llc | Method of determining a switch sequence plan for an electrical system |
US7912868B2 (en) * | 2000-05-02 | 2011-03-22 | Textwise Llc | Advertisement placement method and system using semantic analysis |
US6728695B1 (en) * | 2000-05-26 | 2004-04-27 | Burning Glass Technologies, Llc | Method and apparatus for making predictions about entities represented in documents |
JP3672234B2 (en) | 2000-06-12 | 2005-07-20 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Method for retrieving and ranking documents from a database, computer system, and recording medium |
JP3573688B2 (en) | 2000-06-28 | 2004-10-06 | 松下電器産業株式会社 | Similar document search device and related keyword extraction device |
CN100437574C (en) * | 2000-07-06 | 2008-11-26 | 金时焕 | Information searching system and method thereof |
DE10033612B4 (en) * | 2000-07-11 | 2004-05-13 | Siemens Ag | Method for controlling access to a storage device |
US6687696B2 (en) * | 2000-07-26 | 2004-02-03 | Recommind Inc. | System and method for personalized search, information filtering, and for generating recommendations utilizing statistical latent class models |
US7024407B2 (en) | 2000-08-24 | 2006-04-04 | Content Analyst Company, Llc | Word sense disambiguation |
WO2002021335A1 (en) * | 2000-09-01 | 2002-03-14 | Telcordia Technologies, Inc. | Automatic recommendation of products using latent semantic indexing of content |
US6615208B1 (en) | 2000-09-01 | 2003-09-02 | Telcordia Technologies, Inc. | Automatic recommendation of products using latent semantic indexing of content |
AU2001296304A1 (en) * | 2000-09-25 | 2002-04-08 | Insightful Corporation | Extended functionality for an inverse inference engine based web search |
US6678679B1 (en) * | 2000-10-10 | 2004-01-13 | Science Applications International Corporation | Method and system for facilitating the refinement of data queries |
JP2002157270A (en) * | 2000-11-17 | 2002-05-31 | Mitsubishi Space Software Kk | System and method for distributing interesting article |
US6937986B2 (en) * | 2000-12-28 | 2005-08-30 | Comverse, Inc. | Automatic dynamic speech recognition vocabulary based on external sources of information |
US20030083860A1 (en) * | 2001-03-16 | 2003-05-01 | Eli Abir | Content conversion method and apparatus |
US20030093261A1 (en) * | 2001-03-16 | 2003-05-15 | Eli Abir | Multilingual database creation system and method |
US7711547B2 (en) * | 2001-03-16 | 2010-05-04 | Meaningful Machines, L.L.C. | Word association method and apparatus |
US8874431B2 (en) * | 2001-03-16 | 2014-10-28 | Meaningful Machines Llc | Knowledge system method and apparatus |
US7860706B2 (en) | 2001-03-16 | 2010-12-28 | Eli Abir | Knowledge system method and appparatus |
US8744835B2 (en) * | 2001-03-16 | 2014-06-03 | Meaningful Machines Llc | Content conversion method and apparatus |
US7062572B1 (en) | 2001-03-19 | 2006-06-13 | Microsoft Corporation | Method and system to determine the geographic location of a network user |
US7120646B2 (en) * | 2001-04-09 | 2006-10-10 | Health Language, Inc. | Method and system for interfacing with a multi-level data structure |
US7062220B2 (en) | 2001-04-18 | 2006-06-13 | Intelligent Automation, Inc. | Automated, computer-based reading tutoring systems and methods |
US7627588B1 (en) | 2001-05-07 | 2009-12-01 | Ixreveal, Inc. | System and method for concept based analysis of unstructured data |
US7194483B1 (en) | 2001-05-07 | 2007-03-20 | Intelligenxia, Inc. | Method, system, and computer program product for concept-based multi-dimensional analysis of unstructured information |
US7536413B1 (en) | 2001-05-07 | 2009-05-19 | Ixreveal, Inc. | Concept-based categorization of unstructured objects |
USRE46973E1 (en) * | 2001-05-07 | 2018-07-31 | Ureveal, Inc. | Method, system, and computer program product for concept-based multi-dimensional analysis of unstructured information |
US6654740B2 (en) | 2001-05-08 | 2003-11-25 | Sunflare Co., Ltd. | Probabilistic information retrieval based on differential latent semantic space |
US7050964B2 (en) * | 2001-06-01 | 2006-05-23 | Microsoft Corporation | Scaleable machine translation system |
US7734459B2 (en) * | 2001-06-01 | 2010-06-08 | Microsoft Corporation | Automatic extraction of transfer mappings from bilingual corpora |
US7430562B1 (en) | 2001-06-19 | 2008-09-30 | Microstrategy, Incorporated | System and method for efficient date retrieval and processing |
US8005870B1 (en) | 2001-06-19 | 2011-08-23 | Microstrategy Incorporated | System and method for syntax abstraction in query language generation |
US7003512B1 (en) * | 2001-06-20 | 2006-02-21 | Microstrategy, Inc. | System and method for multiple pass cooperative processing |
US6820073B1 (en) | 2001-06-20 | 2004-11-16 | Microstrategy Inc. | System and method for multiple pass cooperative processing |
US20030004996A1 (en) * | 2001-06-29 | 2003-01-02 | International Business Machines Corporation | Method and system for spatial information retrieval for hyperlinked documents |
US8301503B2 (en) * | 2001-07-17 | 2012-10-30 | Incucomm, Inc. | System and method for providing requested information to thin clients |
KR20030009704A (en) * | 2001-07-23 | 2003-02-05 | 한국전자통신연구원 | System for drawing patent map using technical field word, its method |
US20020010715A1 (en) * | 2001-07-26 | 2002-01-24 | Garry Chinn | System and method for browsing using a limited display device |
US7526425B2 (en) * | 2001-08-14 | 2009-04-28 | Evri Inc. | Method and system for extending keyword searching to syntactically and semantically annotated data |
US7398201B2 (en) | 2001-08-14 | 2008-07-08 | Evri Inc. | Method and system for enhanced data searching |
US7283951B2 (en) * | 2001-08-14 | 2007-10-16 | Insightful Corporation | Method and system for enhanced data searching |
US6978275B2 (en) * | 2001-08-31 | 2005-12-20 | Hewlett-Packard Development Company, L.P. | Method and system for mining a document containing dirty text |
US8078545B1 (en) | 2001-09-24 | 2011-12-13 | Aloft Media, Llc | System, method and computer program product for collecting strategic patent data associated with an identifier |
US7124081B1 (en) * | 2001-09-28 | 2006-10-17 | Apple Computer, Inc. | Method and apparatus for speech recognition using latent semantic adaptation |
ITFI20010199A1 (en) | 2001-10-22 | 2003-04-22 | Riccardo Vieri | SYSTEM AND METHOD TO TRANSFORM TEXTUAL COMMUNICATIONS INTO VOICE AND SEND THEM WITH AN INTERNET CONNECTION TO ANY TELEPHONE SYSTEM |
JP3953295B2 (en) * | 2001-10-23 | 2007-08-08 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Information search system, information search method, program for executing information search, and recording medium on which program for executing information search is recorded |
US20070156665A1 (en) * | 2001-12-05 | 2007-07-05 | Janusz Wnek | Taxonomy discovery |
US6965900B2 (en) * | 2001-12-19 | 2005-11-15 | X-Labs Holdings, Llc | Method and apparatus for electronically extracting application specific multidimensional information from documents selected from a set of documents electronically extracted from a library of electronically searchable documents |
US7137062B2 (en) | 2001-12-28 | 2006-11-14 | International Business Machines Corporation | System and method for hierarchical segmentation with latent semantic indexing in scale space |
US7124073B2 (en) * | 2002-02-12 | 2006-10-17 | Sunflare Co., Ltd | Computer-assisted memory translation scheme based on template automaton and latent semantic index principle |
US8589413B1 (en) | 2002-03-01 | 2013-11-19 | Ixreveal, Inc. | Concept-based method and system for dynamically analyzing results from search engines |
US6847966B1 (en) | 2002-04-24 | 2005-01-25 | Engenium Corporation | Method and system for optimally searching a document database using a representative semantic space |
US7158983B2 (en) | 2002-09-23 | 2007-01-02 | Battelle Memorial Institute | Text analysis technique |
US20040133574A1 (en) * | 2003-01-07 | 2004-07-08 | Science Applications International Corporaton | Vector space method for secure information sharing |
US7421418B2 (en) | 2003-02-19 | 2008-09-02 | Nahava Inc. | Method and apparatus for fundamental operations on token sequences: computing similarity, extracting term values, and searching efficiently |
US7557805B2 (en) * | 2003-04-01 | 2009-07-07 | Battelle Memorial Institute | Dynamic visualization of data streams |
US7152065B2 (en) * | 2003-05-01 | 2006-12-19 | Telcordia Technologies, Inc. | Information retrieval and text mining using distributed latent semantic indexing |
US7734627B1 (en) * | 2003-06-17 | 2010-06-08 | Google Inc. | Document similarity detection |
US20040260551A1 (en) * | 2003-06-19 | 2004-12-23 | International Business Machines Corporation | System and method for configuring voice readers using semantic analysis |
GB0322600D0 (en) * | 2003-09-26 | 2003-10-29 | Univ Ulster | Thematic retrieval in heterogeneous data repositories |
JP4428036B2 (en) | 2003-12-02 | 2010-03-10 | ソニー株式会社 | Information processing apparatus and method, program, information processing system and method |
US7689536B1 (en) | 2003-12-18 | 2010-03-30 | Google Inc. | Methods and systems for detecting and extracting information |
US7299110B2 (en) * | 2004-01-06 | 2007-11-20 | Honda Motor Co., Ltd. | Systems and methods for using statistical techniques to reason with noisy data |
US20050175972A1 (en) * | 2004-01-13 | 2005-08-11 | Neuroscience Solutions Corporation | Method for enhancing memory and cognition in aging adults |
US20070111173A1 (en) * | 2004-01-13 | 2007-05-17 | Posit Science Corporation | Method for modulating listener attention toward synthetic formant transition cues in speech stimuli for training |
US8210851B2 (en) * | 2004-01-13 | 2012-07-03 | Posit Science Corporation | Method for modulating listener attention toward synthetic formant transition cues in speech stimuli for training |
US20060177805A1 (en) * | 2004-01-13 | 2006-08-10 | Posit Science Corporation | Method for enhancing memory and cognition in aging adults |
US20060073452A1 (en) * | 2004-01-13 | 2006-04-06 | Posit Science Corporation | Method for enhancing memory and cognition in aging adults |
US20060051727A1 (en) * | 2004-01-13 | 2006-03-09 | Posit Science Corporation | Method for enhancing memory and cognition in aging adults |
US20070065789A1 (en) * | 2004-01-13 | 2007-03-22 | Posit Science Corporation | Method for enhancing memory and cognition in aging adults |
US20060105307A1 (en) * | 2004-01-13 | 2006-05-18 | Posit Science Corporation | Method for enhancing memory and cognition in aging adults |
US20060047441A1 (en) * | 2004-08-31 | 2006-03-02 | Ramin Homayouni | Semantic gene organizer |
US20070011155A1 (en) * | 2004-09-29 | 2007-01-11 | Sarkar Pte. Ltd. | System for communication and collaboration |
US20060074980A1 (en) * | 2004-09-29 | 2006-04-06 | Sarkar Pte. Ltd. | System for semantically disambiguating text information |
US7680648B2 (en) * | 2004-09-30 | 2010-03-16 | Google Inc. | Methods and systems for improving text segmentation |
US20070266020A1 (en) * | 2004-09-30 | 2007-11-15 | British Telecommunications | Information Retrieval |
US7996208B2 (en) | 2004-09-30 | 2011-08-09 | Google Inc. | Methods and systems for selecting a language for text segmentation |
US8051096B1 (en) | 2004-09-30 | 2011-11-01 | Google Inc. | Methods and systems for augmenting a token lexicon |
US7814105B2 (en) * | 2004-10-27 | 2010-10-12 | Harris Corporation | Method for domain identification of documents in a document database |
US7984388B2 (en) | 2004-12-10 | 2011-07-19 | International Business Machines Corporation | System and method for partially collapsing a hierarchical structure for information navigation |
US8843536B1 (en) | 2004-12-31 | 2014-09-23 | Google Inc. | Methods and systems for providing relevant advertisements or other content for inactive uniform resource locators using search queries |
JP2008538019A (en) * | 2005-01-31 | 2008-10-02 | ムスグローブ テクノロジー エンタープライジィーズ,エルエルシー | System and method for generating linked classification structures |
EP1846815A2 (en) * | 2005-01-31 | 2007-10-24 | Textdigger, Inc. | Method and system for semantic search and retrieval of electronic documents |
JP4524640B2 (en) * | 2005-03-31 | 2010-08-18 | ソニー株式会社 | Information processing apparatus and method, and program |
US20060224584A1 (en) * | 2005-03-31 | 2006-10-05 | Content Analyst Company, Llc | Automatic linear text segmentation |
US7720792B2 (en) * | 2005-04-05 | 2010-05-18 | Content Analyst Company, Llc | Automatic stop word identification and compensation |
US7580910B2 (en) * | 2005-04-06 | 2009-08-25 | Content Analyst Company, Llc | Perturbing latent semantic indexing spaces |
JP2008537225A (en) * | 2005-04-11 | 2008-09-11 | テキストディガー,インコーポレイテッド | Search system and method for queries |
US20060242190A1 (en) * | 2005-04-26 | 2006-10-26 | Content Analyst Comapny, Llc | Latent semantic taxonomy generation |
US7765098B2 (en) * | 2005-04-26 | 2010-07-27 | Content Analyst Company, Llc | Machine translation using vector space representations |
US7844566B2 (en) * | 2005-04-26 | 2010-11-30 | Content Analyst Company, Llc | Latent semantic clustering |
US20060253423A1 (en) * | 2005-05-07 | 2006-11-09 | Mclane Mark | Information retrieval system and method |
US8312034B2 (en) | 2005-06-24 | 2012-11-13 | Purediscovery Corporation | Concept bridge and method of operating the same |
US20060294101A1 (en) * | 2005-06-24 | 2006-12-28 | Content Analyst Company, Llc | Multi-strategy document classification system and method |
US8677377B2 (en) | 2005-09-08 | 2014-03-18 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US7747618B2 (en) * | 2005-09-08 | 2010-06-29 | Microsoft Corporation | Augmenting user, query, and document triplets using singular value decomposition |
US20080215614A1 (en) * | 2005-09-08 | 2008-09-04 | Slattery Michael J | Pyramid Information Quantification or PIQ or Pyramid Database or Pyramided Database or Pyramided or Selective Pressure Database Management System |
US8688673B2 (en) * | 2005-09-27 | 2014-04-01 | Sarkar Pte Ltd | System for communication and collaboration |
US7562074B2 (en) * | 2005-09-28 | 2009-07-14 | Epacris Inc. | Search engine determining results based on probabilistic scoring of relevance |
US7633076B2 (en) | 2005-09-30 | 2009-12-15 | Apple Inc. | Automated response to and sensing of user activity in portable devices |
JP5368100B2 (en) * | 2005-10-11 | 2013-12-18 | アイエックスリビール インコーポレイテッド | System, method, and computer program product for concept-based search and analysis |
US9069847B2 (en) * | 2005-10-21 | 2015-06-30 | Battelle Memorial Institute | Data visualization methods, data visualization devices, data visualization apparatuses, and articles of manufacture |
DE102005054510A1 (en) * | 2005-11-16 | 2007-05-24 | Voith Patent Gmbh | tissue machine |
WO2007059287A1 (en) * | 2005-11-16 | 2007-05-24 | Evri Inc. | Extending keyword searching to syntactically and semantically annotated data |
WO2007064375A2 (en) * | 2005-11-30 | 2007-06-07 | Selective, Inc. | Selective latent semantic indexing method for information retrieval applications |
US20070134635A1 (en) * | 2005-12-13 | 2007-06-14 | Posit Science Corporation | Cognitive training using formant frequency sweeps |
US20070143307A1 (en) * | 2005-12-15 | 2007-06-21 | Bowers Matthew N | Communication system employing a context engine |
US8694530B2 (en) | 2006-01-03 | 2014-04-08 | Textdigger, Inc. | Search system with query refinement and search method |
US7676485B2 (en) * | 2006-01-20 | 2010-03-09 | Ixreveal, Inc. | Method and computer program product for converting ontologies into concept semantic networks |
US20070219946A1 (en) * | 2006-03-15 | 2007-09-20 | Emmanuel Roche | Information repository and answering system |
WO2007114932A2 (en) | 2006-04-04 | 2007-10-11 | Textdigger, Inc. | Search system and method with text function tagging |
US8060567B2 (en) * | 2006-04-12 | 2011-11-15 | Google Inc. | Method, system, graphical user interface, and data structure for creating electronic calendar entries from email messages |
JP2009540398A (en) * | 2006-06-02 | 2009-11-19 | テルコーディア テクノロジーズ インコーポレイテッド | Concept-based cross-media indexing and retrieval of audio documents |
US8401841B2 (en) | 2006-08-31 | 2013-03-19 | Orcatec Llc | Retrieval of documents using language models |
US9318108B2 (en) | 2010-01-18 | 2016-04-19 | Apple Inc. | Intelligent automated assistant |
US20080086490A1 (en) * | 2006-10-04 | 2008-04-10 | Sap Ag | Discovery of services matching a service request |
US8024193B2 (en) * | 2006-10-10 | 2011-09-20 | Apple Inc. | Methods and apparatus related to pruning for concatenative text-to-speech synthesis |
US9165040B1 (en) | 2006-10-12 | 2015-10-20 | Google Inc. | Producing a ranking for pages using distances in a web-link graph |
US8672055B2 (en) | 2006-12-07 | 2014-03-18 | Canrig Drilling Technology Ltd. | Automated directional drilling apparatus and methods |
US11725494B2 (en) | 2006-12-07 | 2023-08-15 | Nabors Drilling Technologies Usa, Inc. | Method and apparatus for automatically modifying a drilling path in response to a reversal of a predicted trend |
US7860593B2 (en) * | 2007-05-10 | 2010-12-28 | Canrig Drilling Technology Ltd. | Well prog execution facilitation system and method |
US8065307B2 (en) * | 2006-12-20 | 2011-11-22 | Microsoft Corporation | Parsing, analysis and scoring of document content |
US8954469B2 (en) | 2007-03-14 | 2015-02-10 | Vcvciii Llc | Query templates and labeled search tip system, methods, and techniques |
US8977255B2 (en) | 2007-04-03 | 2015-03-10 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
US8279465B2 (en) | 2007-05-01 | 2012-10-02 | Kofax, Inc. | Systems and methods for routing facsimiles based on content |
US8451475B2 (en) | 2007-05-01 | 2013-05-28 | Kofax, Inc. | Systems and methods for routing a facsimile confirmation based on content |
US9069861B2 (en) | 2007-05-29 | 2015-06-30 | Brainspace Corporation | Query generation system for an information retrieval system |
US20080312985A1 (en) * | 2007-06-18 | 2008-12-18 | Microsoft Corporation | Computerized evaluation of user impressions of product artifacts |
US8006121B1 (en) * | 2007-06-28 | 2011-08-23 | Apple Inc. | Systems and methods for diagnosing and fixing electronic devices |
US20090228777A1 (en) * | 2007-08-17 | 2009-09-10 | Accupatent, Inc. | System and Method for Search |
US9053089B2 (en) | 2007-10-02 | 2015-06-09 | Apple Inc. | Part-of-speech tagging using latent analogy |
US8594996B2 (en) | 2007-10-17 | 2013-11-26 | Evri Inc. | NLP-based entity recognition and disambiguation |
US8700604B2 (en) * | 2007-10-17 | 2014-04-15 | Evri, Inc. | NLP-based content recommender |
US8694483B2 (en) | 2007-10-19 | 2014-04-08 | Xerox Corporation | Real-time query suggestion in a troubleshooting context |
WO2009059297A1 (en) * | 2007-11-01 | 2009-05-07 | Textdigger, Inc. | Method and apparatus for automated tag generation for digital content |
US8580149B2 (en) | 2007-11-16 | 2013-11-12 | Lawrence Livermore National Security, Llc | Barium iodide and strontium iodide crystals and scintillators implementing the same |
US8620662B2 (en) | 2007-11-20 | 2013-12-31 | Apple Inc. | Context-aware unit selection |
US10002189B2 (en) | 2007-12-20 | 2018-06-19 | Apple Inc. | Method and apparatus for searching using an active ontology |
US9330720B2 (en) | 2008-01-03 | 2016-05-03 | Apple Inc. | Methods and apparatus for altering audio output signals |
US20090226872A1 (en) * | 2008-01-16 | 2009-09-10 | Nicholas Langdon Gunther | Electronic grading system |
US8065143B2 (en) | 2008-02-22 | 2011-11-22 | Apple Inc. | Providing text input using speech data and non-speech data |
US20090228296A1 (en) * | 2008-03-04 | 2009-09-10 | Collarity, Inc. | Optimization of social distribution networks |
US8996376B2 (en) | 2008-04-05 | 2015-03-31 | Apple Inc. | Intelligent text-to-speech conversion |
US20090276694A1 (en) * | 2008-05-02 | 2009-11-05 | Accupatent, Inc. | System and Method for Document Display |
US10496753B2 (en) | 2010-01-18 | 2019-12-03 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US8103669B2 (en) | 2008-05-23 | 2012-01-24 | Xerox Corporation | System and method for semi-automatic creation and maintenance of query expansion rules |
US8464150B2 (en) | 2008-06-07 | 2013-06-11 | Apple Inc. | Automatic language identification for dynamic text processing |
US8438178B2 (en) * | 2008-06-26 | 2013-05-07 | Collarity Inc. | Interactions among online digital identities |
US20100030549A1 (en) | 2008-07-31 | 2010-02-04 | Lee Michael M | Mobile device having human language translation capability with positional feedback |
WO2010021530A1 (en) * | 2008-08-20 | 2010-02-25 | Instituto Tecnologico Y De Estudios Superiores De Monterrey | System and method for displaying relevant textual advertising based on semantic similarity |
US8768702B2 (en) | 2008-09-05 | 2014-07-01 | Apple Inc. | Multi-tiered voice feedback in an electronic device |
US8898568B2 (en) | 2008-09-09 | 2014-11-25 | Apple Inc. | Audio user interface |
TW201013430A (en) | 2008-09-17 | 2010-04-01 | Ibm | Method and system for providing suggested tags associated with a target page for manipulation by a user |
US8583418B2 (en) | 2008-09-29 | 2013-11-12 | Apple Inc. | Systems and methods of detecting language and natural language strings for text to speech synthesis |
US8712776B2 (en) | 2008-09-29 | 2014-04-29 | Apple Inc. | Systems and methods for selective text to speech synthesis |
US8676904B2 (en) | 2008-10-02 | 2014-03-18 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US20100094814A1 (en) * | 2008-10-13 | 2010-04-15 | James Alexander Levy | Assessment Generation Using the Semantic Web |
US8156120B2 (en) | 2008-10-22 | 2012-04-10 | James Brady | Information retrieval using user-generated metadata |
US20100114890A1 (en) * | 2008-10-31 | 2010-05-06 | Purediscovery Corporation | System and Method for Discovering Latent Relationships in Data |
US20100131569A1 (en) * | 2008-11-21 | 2010-05-27 | Robert Marc Jamison | Method & apparatus for identifying a secondary concept in a collection of documents |
WO2010067118A1 (en) | 2008-12-11 | 2010-06-17 | Novauris Technologies Limited | Speech recognition involving a mobile device |
US9311391B2 (en) * | 2008-12-30 | 2016-04-12 | Telecom Italia S.P.A. | Method and system of content recommendation |
US8862252B2 (en) | 2009-01-30 | 2014-10-14 | Apple Inc. | Audio user interface for displayless electronic device |
US8380507B2 (en) | 2009-03-09 | 2013-02-19 | Apple Inc. | Systems and methods for determining the language to use for speech generated by a text to speech engine |
US8166032B2 (en) * | 2009-04-09 | 2012-04-24 | MarketChorus, Inc. | System and method for sentiment-based text classification and relevancy ranking |
US9245243B2 (en) * | 2009-04-14 | 2016-01-26 | Ureveal, Inc. | Concept-based analysis of structured and unstructured data using concept inheritance |
CA2796408A1 (en) * | 2009-04-16 | 2010-10-21 | Evri Inc. | Enhanced advertisement targeting |
US8346685B1 (en) | 2009-04-22 | 2013-01-01 | Equivio Ltd. | Computerized system for enhancing expert-based processes and methods useful in conjunction therewith |
US8527523B1 (en) | 2009-04-22 | 2013-09-03 | Equivio Ltd. | System for enhancing expert-based computerized analysis of a set of digital documents and methods useful in conjunction therewith |
US8533194B1 (en) | 2009-04-22 | 2013-09-10 | Equivio Ltd. | System for enhancing expert-based computerized analysis of a set of digital documents and methods useful in conjunction therewith |
WO2010134885A1 (en) * | 2009-05-20 | 2010-11-25 | Farhan Sarwar | Predicting the correctness of eyewitness' statements with semantic evaluation method (sem) |
US10540976B2 (en) | 2009-06-05 | 2020-01-21 | Apple Inc. | Contextual voice commands |
US9858925B2 (en) | 2009-06-05 | 2018-01-02 | Apple Inc. | Using context information to facilitate processing of commands in a virtual assistant |
US10241644B2 (en) | 2011-06-03 | 2019-03-26 | Apple Inc. | Actionable reminder entries |
US10241752B2 (en) | 2011-09-30 | 2019-03-26 | Apple Inc. | Interface for a virtual digital assistant |
US20120311585A1 (en) | 2011-06-03 | 2012-12-06 | Apple Inc. | Organizing task items that represent tasks to perform |
US8510308B1 (en) * | 2009-06-16 | 2013-08-13 | Google Inc. | Extracting semantic classes and instances from text |
US9431006B2 (en) | 2009-07-02 | 2016-08-30 | Apple Inc. | Methods and apparatuses for automatic speech recognition |
GB2472250A (en) * | 2009-07-31 | 2011-02-02 | Stephen Timothy Morris | Method for determining document relevance |
US8666994B2 (en) | 2009-09-26 | 2014-03-04 | Sajari Pty Ltd | Document analysis and association system and method |
US8645372B2 (en) * | 2009-10-30 | 2014-02-04 | Evri, Inc. | Keyword-based search engine results using enhanced query strategies |
US8682649B2 (en) | 2009-11-12 | 2014-03-25 | Apple Inc. | Sentiment prediction from textual data |
KR20120113717A (en) * | 2009-12-04 | 2012-10-15 | 소니 주식회사 | Search device, search method, and program |
US8600743B2 (en) | 2010-01-06 | 2013-12-03 | Apple Inc. | Noise profile determination for voice-related feature |
US8311838B2 (en) | 2010-01-13 | 2012-11-13 | Apple Inc. | Devices and methods for identifying a prompt corresponding to a voice input in a sequence of prompts |
US8381107B2 (en) | 2010-01-13 | 2013-02-19 | Apple Inc. | Adaptive audio feedback system and method |
US10276170B2 (en) | 2010-01-18 | 2019-04-30 | Apple Inc. | Intelligent automated assistant |
US10679605B2 (en) | 2010-01-18 | 2020-06-09 | Apple Inc. | Hands-free list-reading by intelligent automated assistant |
US10705794B2 (en) | 2010-01-18 | 2020-07-07 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US8661361B2 (en) | 2010-08-26 | 2014-02-25 | Sitting Man, Llc | Methods, systems, and computer program products for navigating between visual components |
US9715332B1 (en) | 2010-08-26 | 2017-07-25 | Cypress Lake Software, Inc. | Methods, systems, and computer program products for navigating between visual components |
US10553209B2 (en) | 2010-01-18 | 2020-02-04 | Apple Inc. | Systems and methods for hands-free notification summaries |
US8780130B2 (en) | 2010-11-30 | 2014-07-15 | Sitting Man, Llc | Methods, systems, and computer program products for binding attributes between visual components |
WO2011089450A2 (en) | 2010-01-25 | 2011-07-28 | Andrew Peter Nelson Jerram | Apparatuses, methods and systems for a digital conversation management platform |
US9183288B2 (en) * | 2010-01-27 | 2015-11-10 | Kinetx, Inc. | System and method of structuring data for search using latent semantic analysis techniques |
US10397639B1 (en) | 2010-01-29 | 2019-08-27 | Sitting Man, Llc | Hot key systems and methods |
US8682667B2 (en) | 2010-02-25 | 2014-03-25 | Apple Inc. | User profiling for selecting user specific voice input processing information |
US9710556B2 (en) | 2010-03-01 | 2017-07-18 | Vcvc Iii Llc | Content recommendation based on collections of entities |
US8645125B2 (en) | 2010-03-30 | 2014-02-04 | Evri, Inc. | NLP-based systems and methods for providing quotations |
US8255401B2 (en) | 2010-04-28 | 2012-08-28 | International Business Machines Corporation | Computer information retrieval using latent semantic structure via sketches |
US7933859B1 (en) | 2010-05-25 | 2011-04-26 | Recommind, Inc. | Systems and methods for predictive coding |
US8161325B2 (en) | 2010-05-28 | 2012-04-17 | Bank Of America Corporation | Recommendation of relevant information to support problem diagnosis |
US20110302153A1 (en) * | 2010-06-04 | 2011-12-08 | Google Inc. | Service for Aggregating Event Information |
US8713021B2 (en) | 2010-07-07 | 2014-04-29 | Apple Inc. | Unsupervised document clustering using latent semantic density analysis |
US8838633B2 (en) | 2010-08-11 | 2014-09-16 | Vcvc Iii Llc | NLP-based sentiment analysis |
US8719006B2 (en) | 2010-08-27 | 2014-05-06 | Apple Inc. | Combined statistical and rule-based part-of-speech tagging for text-to-speech synthesis |
US9405848B2 (en) | 2010-09-15 | 2016-08-02 | Vcvc Iii Llc | Recommending mobile device activities |
US8719014B2 (en) | 2010-09-27 | 2014-05-06 | Apple Inc. | Electronic device with text error correction based on voice recognition data |
US11398310B1 (en) | 2010-10-01 | 2022-07-26 | Cerner Innovation, Inc. | Clinical decision support for sepsis |
US10431336B1 (en) | 2010-10-01 | 2019-10-01 | Cerner Innovation, Inc. | Computerized systems and methods for facilitating clinical decision making |
US20120089421A1 (en) | 2010-10-08 | 2012-04-12 | Cerner Innovation, Inc. | Multi-site clinical decision support for sepsis |
US10734115B1 (en) | 2012-08-09 | 2020-08-04 | Cerner Innovation, Inc | Clinical decision support for sepsis |
US8725739B2 (en) | 2010-11-01 | 2014-05-13 | Evri, Inc. | Category-based content recommendation |
US10762293B2 (en) | 2010-12-22 | 2020-09-01 | Apple Inc. | Using parts-of-speech tagging and named entity recognition for spelling correction |
US10515147B2 (en) | 2010-12-22 | 2019-12-24 | Apple Inc. | Using statistical language models for contextual lookup |
US10289802B2 (en) | 2010-12-27 | 2019-05-14 | The Board Of Trustees Of The Leland Stanford Junior University | Spanning-tree progression analysis of density-normalized events (SPADE) |
US10628553B1 (en) | 2010-12-30 | 2020-04-21 | Cerner Innovation, Inc. | Health information transformation system |
US8781836B2 (en) | 2011-02-22 | 2014-07-15 | Apple Inc. | Hearing assistance system for providing consistent human speech |
US9262612B2 (en) | 2011-03-21 | 2016-02-16 | Apple Inc. | Device access using voice authentication |
US9116995B2 (en) | 2011-03-30 | 2015-08-25 | Vcvc Iii Llc | Cluster-based identification of news stories |
US20120310642A1 (en) | 2011-06-03 | 2012-12-06 | Apple Inc. | Automatically creating a mapping between text data and audio data |
US10057736B2 (en) | 2011-06-03 | 2018-08-21 | Apple Inc. | Active transport based notifications |
US9785634B2 (en) | 2011-06-04 | 2017-10-10 | Recommind, Inc. | Integration and combination of random sampling and document batching |
US8812294B2 (en) | 2011-06-21 | 2014-08-19 | Apple Inc. | Translating phrases from one language into another using an order-based set of declarative rules |
JP5742506B2 (en) * | 2011-06-27 | 2015-07-01 | 日本電気株式会社 | Document similarity calculation device |
US8983963B2 (en) | 2011-07-07 | 2015-03-17 | Software Ag | Techniques for comparing and clustering documents |
US8706472B2 (en) | 2011-08-11 | 2014-04-22 | Apple Inc. | Method for disambiguating multiple readings in language conversion |
US8994660B2 (en) | 2011-08-29 | 2015-03-31 | Apple Inc. | Text correction processing |
US9442930B2 (en) | 2011-09-07 | 2016-09-13 | Venio Inc. | System, method and computer program product for automatic topic identification using a hypertext corpus |
US9442928B2 (en) | 2011-09-07 | 2016-09-13 | Venio Inc. | System, method and computer program product for automatic topic identification using a hypertext corpus |
US8762156B2 (en) | 2011-09-28 | 2014-06-24 | Apple Inc. | Speech recognition repair using contextual information |
US8856156B1 (en) | 2011-10-07 | 2014-10-07 | Cerner Innovation, Inc. | Ontology mapper |
US9430563B2 (en) | 2012-02-02 | 2016-08-30 | Xerox Corporation | Document processing employing probabilistic topic modeling of documents represented as text words transformed to a continuous space |
US10134385B2 (en) | 2012-03-02 | 2018-11-20 | Apple Inc. | Systems and methods for name pronunciation |
US9483461B2 (en) | 2012-03-06 | 2016-11-01 | Apple Inc. | Handling speech synthesis of content for multiple languages |
US8805842B2 (en) | 2012-03-30 | 2014-08-12 | Her Majesty The Queen In Right Of Canada, As Represented By The Minister Of National Defence, Ottawa | Method for displaying search results |
US10249385B1 (en) | 2012-05-01 | 2019-04-02 | Cerner Innovation, Inc. | System and method for record linkage |
US9280610B2 (en) | 2012-05-14 | 2016-03-08 | Apple Inc. | Crowd sourcing information to fulfill user requests |
US8775442B2 (en) | 2012-05-15 | 2014-07-08 | Apple Inc. | Semantic search using a single-source semantic model |
US10417037B2 (en) | 2012-05-15 | 2019-09-17 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US9721563B2 (en) | 2012-06-08 | 2017-08-01 | Apple Inc. | Name recognition system |
WO2013185109A2 (en) | 2012-06-08 | 2013-12-12 | Apple Inc. | Systems and methods for recognizing textual identifiers within a plurality of words |
US9495129B2 (en) | 2012-06-29 | 2016-11-15 | Apple Inc. | Device, method, and user interface for voice-activated navigation and browsing of a document |
US9002842B2 (en) | 2012-08-08 | 2015-04-07 | Equivio Ltd. | System and method for computerized batching of huge populations of electronic documents |
US9576574B2 (en) | 2012-09-10 | 2017-02-21 | Apple Inc. | Context-sensitive handling of interruptions by intelligent digital assistant |
US9547647B2 (en) | 2012-09-19 | 2017-01-17 | Apple Inc. | Voice-based media searching |
US8935167B2 (en) | 2012-09-25 | 2015-01-13 | Apple Inc. | Exemplar-based latent perceptual modeling for automatic speech recognition |
US9075846B2 (en) | 2012-12-12 | 2015-07-07 | King Fahd University Of Petroleum And Minerals | Method for retrieval of arabic historical manuscripts |
US11894117B1 (en) | 2013-02-07 | 2024-02-06 | Cerner Innovation, Inc. | Discovering context-specific complexity and utilization sequences |
KR102516577B1 (en) | 2013-02-07 | 2023-04-03 | 애플 인크. | Voice trigger for a digital assistant |
US10946311B1 (en) | 2013-02-07 | 2021-03-16 | Cerner Innovation, Inc. | Discovering context-specific serial health trajectories |
US10769241B1 (en) | 2013-02-07 | 2020-09-08 | Cerner Innovation, Inc. | Discovering context-specific complexity and utilization sequences |
US9308446B1 (en) | 2013-03-07 | 2016-04-12 | Posit Science Corporation | Neuroplasticity games for social cognition disorders |
US9972030B2 (en) | 2013-03-11 | 2018-05-15 | Criteo S.A. | Systems and methods for the semantic modeling of advertising creatives in targeted search advertising campaigns |
US10572476B2 (en) | 2013-03-14 | 2020-02-25 | Apple Inc. | Refining a search based on schedule items |
US10652394B2 (en) | 2013-03-14 | 2020-05-12 | Apple Inc. | System and method for processing voicemail |
US9977779B2 (en) | 2013-03-14 | 2018-05-22 | Apple Inc. | Automatic supplementation of word correction dictionaries |
US10642574B2 (en) | 2013-03-14 | 2020-05-05 | Apple Inc. | Device, method, and graphical user interface for outputting captions |
US9733821B2 (en) | 2013-03-14 | 2017-08-15 | Apple Inc. | Voice control to diagnose inadvertent activation of accessibility features |
US9368114B2 (en) | 2013-03-14 | 2016-06-14 | Apple Inc. | Context-sensitive handling of interruptions |
US8788516B1 (en) | 2013-03-15 | 2014-07-22 | Purediscovery Corporation | Generating and using social brains with complimentary semantic brains and indexes |
US10748529B1 (en) | 2013-03-15 | 2020-08-18 | Apple Inc. | Voice activated device for use with a voice-based digital assistant |
CN112230878A (en) | 2013-03-15 | 2021-01-15 | 苹果公司 | Context-sensitive handling of interrupts |
US11151899B2 (en) | 2013-03-15 | 2021-10-19 | Apple Inc. | User training by intelligent digital assistant |
WO2014144949A2 (en) | 2013-03-15 | 2014-09-18 | Apple Inc. | Training an at least partial voice command system |
US9122681B2 (en) | 2013-03-15 | 2015-09-01 | Gordon Villy Cormack | Systems and methods for classifying electronic information using advanced active learning techniques |
WO2014144579A1 (en) | 2013-03-15 | 2014-09-18 | Apple Inc. | System and method for updating an adaptive speech recognition model |
US9760644B2 (en) | 2013-04-17 | 2017-09-12 | Google Inc. | Embedding event creation link in a document |
WO2014197336A1 (en) | 2013-06-07 | 2014-12-11 | Apple Inc. | System and method for detecting errors in interactions with a voice-based digital assistant |
US9582608B2 (en) | 2013-06-07 | 2017-02-28 | Apple Inc. | Unified ranking with entropy-weighted information for phrase-based semantic auto-completion |
WO2014197334A2 (en) | 2013-06-07 | 2014-12-11 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
WO2014197335A1 (en) | 2013-06-08 | 2014-12-11 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
EP3008641A1 (en) | 2013-06-09 | 2016-04-20 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US10176167B2 (en) | 2013-06-09 | 2019-01-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
CN105265005B (en) | 2013-06-13 | 2019-09-17 | 苹果公司 | System and method for the urgent call initiated by voice command |
JP6225543B2 (en) * | 2013-07-30 | 2017-11-08 | 富士通株式会社 | Discussion support program, discussion support apparatus, and discussion support method |
WO2015020942A1 (en) | 2013-08-06 | 2015-02-12 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
US10446273B1 (en) | 2013-08-12 | 2019-10-15 | Cerner Innovation, Inc. | Decision support with clinical nomenclatures |
US10483003B1 (en) | 2013-08-12 | 2019-11-19 | Cerner Innovation, Inc. | Dynamically determining risk of clinical condition |
US10378329B2 (en) | 2013-08-20 | 2019-08-13 | Nabors Drilling Technologies Usa, Inc. | Rig control system and methods |
JP6241211B2 (en) * | 2013-11-06 | 2017-12-06 | 富士通株式会社 | Education support program, method, apparatus and system |
US10224119B1 (en) | 2013-11-25 | 2019-03-05 | Quire, Inc. (Delaware corporation) | System and method of prediction through the use of latent semantic indexing |
US10296160B2 (en) | 2013-12-06 | 2019-05-21 | Apple Inc. | Method for extracting salient dialog usage from live data |
US9620105B2 (en) | 2014-05-15 | 2017-04-11 | Apple Inc. | Analyzing audio input for efficient speech and music recognition |
US10592095B2 (en) | 2014-05-23 | 2020-03-17 | Apple Inc. | Instantaneous speaking of content on touch devices |
US9502031B2 (en) | 2014-05-27 | 2016-11-22 | Apple Inc. | Method for supporting dynamic grammars in WFST-based ASR |
US9842101B2 (en) | 2014-05-30 | 2017-12-12 | Apple Inc. | Predictive conversion of language input |
US10170123B2 (en) | 2014-05-30 | 2019-01-01 | Apple Inc. | Intelligent assistant for home automation |
US9734193B2 (en) | 2014-05-30 | 2017-08-15 | Apple Inc. | Determining domain salience ranking from ambiguous words in natural speech |
US10078631B2 (en) | 2014-05-30 | 2018-09-18 | Apple Inc. | Entropy-guided text prediction using combined word and character n-gram language models |
US9760559B2 (en) | 2014-05-30 | 2017-09-12 | Apple Inc. | Predictive text input |
US10289433B2 (en) | 2014-05-30 | 2019-05-14 | Apple Inc. | Domain specific language for encoding assistant dialog |
US9633004B2 (en) | 2014-05-30 | 2017-04-25 | Apple Inc. | Better resolution when referencing to concepts |
US9715875B2 (en) | 2014-05-30 | 2017-07-25 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US9430463B2 (en) | 2014-05-30 | 2016-08-30 | Apple Inc. | Exemplar-based natural language processing |
EP3149728B1 (en) | 2014-05-30 | 2019-01-16 | Apple Inc. | Multi-command single utterance input method |
US9785630B2 (en) | 2014-05-30 | 2017-10-10 | Apple Inc. | Text prediction using combined word N-gram and unigram language models |
US10049102B2 (en) | 2014-06-26 | 2018-08-14 | Hcl Technologies Limited | Method and system for providing semantics based technical support |
US9338493B2 (en) | 2014-06-30 | 2016-05-10 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US10659851B2 (en) | 2014-06-30 | 2020-05-19 | Apple Inc. | Real-time digital assistant knowledge updates |
US9576023B2 (en) | 2014-07-14 | 2017-02-21 | International Business Machines Corporation | User interface for summarizing the relevance of a document to a query |
US9703858B2 (en) | 2014-07-14 | 2017-07-11 | International Business Machines Corporation | Inverted table for storing and querying conceptual indices |
US10503761B2 (en) | 2014-07-14 | 2019-12-10 | International Business Machines Corporation | System for searching, recommending, and exploring documents through conceptual associations |
US10162882B2 (en) | 2014-07-14 | 2018-12-25 | Nternational Business Machines Corporation | Automatically linking text to concepts in a knowledge base |
US10437869B2 (en) | 2014-07-14 | 2019-10-08 | International Business Machines Corporation | Automatic new concept definition |
US9710570B2 (en) | 2014-07-14 | 2017-07-18 | International Business Machines Corporation | Computing the relevance of a document to concepts not specified in the document |
US10446141B2 (en) | 2014-08-28 | 2019-10-15 | Apple Inc. | Automatic speech recognition based on user feedback |
US9818400B2 (en) | 2014-09-11 | 2017-11-14 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US10789041B2 (en) | 2014-09-12 | 2020-09-29 | Apple Inc. | Dynamic thresholds for always listening speech trigger |
US9734144B2 (en) | 2014-09-18 | 2017-08-15 | Empire Technology Development Llc | Three-dimensional latent semantic analysis |
US9668121B2 (en) | 2014-09-30 | 2017-05-30 | Apple Inc. | Social reminders |
US10127911B2 (en) | 2014-09-30 | 2018-11-13 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US9886432B2 (en) | 2014-09-30 | 2018-02-06 | Apple Inc. | Parsimonious handling of word inflection via categorical stem + suffix N-gram language models |
US10074360B2 (en) | 2014-09-30 | 2018-09-11 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US9646609B2 (en) | 2014-09-30 | 2017-05-09 | Apple Inc. | Caching apparatus for serving phonetic pronunciations |
US10360229B2 (en) | 2014-11-03 | 2019-07-23 | SavantX, Inc. | Systems and methods for enterprise data search and analysis |
US10915543B2 (en) | 2014-11-03 | 2021-02-09 | SavantX, Inc. | Systems and methods for enterprise data search and analysis |
US20160154844A1 (en) * | 2014-11-29 | 2016-06-02 | Infinitt Healthcare Co., Ltd. | Intelligent medical image and medical information search method |
US10552013B2 (en) | 2014-12-02 | 2020-02-04 | Apple Inc. | Data detection |
US9711141B2 (en) | 2014-12-09 | 2017-07-18 | Apple Inc. | Disambiguating heteronyms in speech synthesis |
US9865280B2 (en) | 2015-03-06 | 2018-01-09 | Apple Inc. | Structured dictation using intelligent automated assistants |
US9886953B2 (en) | 2015-03-08 | 2018-02-06 | Apple Inc. | Virtual assistant activation |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US9721566B2 (en) | 2015-03-08 | 2017-08-01 | Apple Inc. | Competing devices responding to voice triggers |
US9899019B2 (en) | 2015-03-18 | 2018-02-20 | Apple Inc. | Systems and methods for structured stem and suffix language models |
US9842105B2 (en) | 2015-04-16 | 2017-12-12 | Apple Inc. | Parsimonious continuous-space phrase representations for natural language processing |
US10083688B2 (en) | 2015-05-27 | 2018-09-25 | Apple Inc. | Device voice control for selecting a displayed affordance |
US10127220B2 (en) | 2015-06-04 | 2018-11-13 | Apple Inc. | Language identification from short strings |
US10101822B2 (en) | 2015-06-05 | 2018-10-16 | Apple Inc. | Language input correction |
US10255907B2 (en) | 2015-06-07 | 2019-04-09 | Apple Inc. | Automatic accent detection using acoustic models |
US10186254B2 (en) | 2015-06-07 | 2019-01-22 | Apple Inc. | Context-based endpoint detection |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US10229117B2 (en) | 2015-06-19 | 2019-03-12 | Gordon V. Cormack | Systems and methods for conducting a highly autonomous technology-assisted review classification |
US10747498B2 (en) | 2015-09-08 | 2020-08-18 | Apple Inc. | Zero latency digital assistant |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US9734141B2 (en) | 2015-09-22 | 2017-08-15 | Yang Chang | Word mapping |
US9697820B2 (en) | 2015-09-24 | 2017-07-04 | Apple Inc. | Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks |
US10366158B2 (en) | 2015-09-29 | 2019-07-30 | Apple Inc. | Efficient word encoding for recurrent neural network language models |
US11010550B2 (en) | 2015-09-29 | 2021-05-18 | Apple Inc. | Unified language modeling framework for word prediction, auto-completion and auto-correction |
US11587559B2 (en) | 2015-09-30 | 2023-02-21 | Apple Inc. | Intelligent device identification |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10003559B2 (en) * | 2015-11-12 | 2018-06-19 | International Business Machines Corporation | Aggregating redundant messages in a group chat |
US10049668B2 (en) | 2015-12-02 | 2018-08-14 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10223066B2 (en) | 2015-12-23 | 2019-03-05 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US9836669B2 (en) | 2016-02-22 | 2017-12-05 | International Business Machines Corporation | Generating a reference digital image based on an indicated time frame and searching for other images using the reference digital image |
US10446143B2 (en) | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
US10372872B2 (en) | 2016-04-22 | 2019-08-06 | The Boeing Company | Providing early warning and assessment of vehicle design problems with potential operational impact |
US9934775B2 (en) | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US9972304B2 (en) | 2016-06-03 | 2018-05-15 | Apple Inc. | Privacy preserving distributed evaluation framework for embedded personalized systems |
US10249300B2 (en) | 2016-06-06 | 2019-04-02 | Apple Inc. | Intelligent list reading |
US10049663B2 (en) | 2016-06-08 | 2018-08-14 | Apple, Inc. | Intelligent automated assistant for media exploration |
DK179309B1 (en) | 2016-06-09 | 2018-04-23 | Apple Inc | Intelligent automated assistant in a home environment |
US10490187B2 (en) | 2016-06-10 | 2019-11-26 | Apple Inc. | Digital assistant providing automated status report |
US10509862B2 (en) | 2016-06-10 | 2019-12-17 | Apple Inc. | Dynamic phrase expansion of language input |
US10586535B2 (en) | 2016-06-10 | 2020-03-10 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10192552B2 (en) | 2016-06-10 | 2019-01-29 | Apple Inc. | Digital assistant providing whispered speech |
US10067938B2 (en) | 2016-06-10 | 2018-09-04 | Apple Inc. | Multilingual word prediction |
DK179343B1 (en) | 2016-06-11 | 2018-05-14 | Apple Inc | Intelligent task discovery |
DK179415B1 (en) | 2016-06-11 | 2018-06-14 | Apple Inc | Intelligent device arbitration and control |
DK179049B1 (en) | 2016-06-11 | 2017-09-18 | Apple Inc | Data driven natural language event detection and classification |
DK201670540A1 (en) | 2016-06-11 | 2018-01-08 | Apple Inc | Application integration with a digital assistant |
US20180173850A1 (en) * | 2016-12-21 | 2018-06-21 | Kevin Erich Heinrich | System and Method of Semantic Differentiation of Individuals Based On Electronic Medical Records |
US10593346B2 (en) | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
US11328128B2 (en) | 2017-02-28 | 2022-05-10 | SavantX, Inc. | System and method for analysis and navigation of data |
EP3590053A4 (en) | 2017-02-28 | 2020-11-25 | SavantX, Inc. | System and method for analysis and navigation of data |
DK179745B1 (en) | 2017-05-12 | 2019-05-01 | Apple Inc. | SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT |
DK201770431A1 (en) | 2017-05-15 | 2018-12-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
CN107943978B (en) * | 2017-11-29 | 2020-11-24 | 北京金堤科技有限公司 | Storage method and device for user access records |
US10902066B2 (en) | 2018-07-23 | 2021-01-26 | Open Text Holdings, Inc. | Electronic discovery using predictive filtering |
JP7255684B2 (en) * | 2019-07-17 | 2023-04-11 | 富士通株式会社 | Specific Programs, Specific Methods, and Specific Devices |
DE102019212421A1 (en) | 2019-08-20 | 2021-02-25 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Method and device for identifying similar documents |
US11730420B2 (en) | 2019-12-17 | 2023-08-22 | Cerner Innovation, Inc. | Maternal-fetal sepsis indicator |
CN113377923B (en) * | 2021-06-25 | 2024-01-09 | 北京百度网讯科技有限公司 | Semantic retrieval method, apparatus, device, storage medium and computer program product |
DE102022203475A1 (en) | 2022-04-07 | 2023-10-12 | Zf Friedrichshafen Ag | System for generating a human-perceptible explanation output for an anomaly predicted by an anomaly detection module on high-frequency sensor data or quantities derived therefrom of an industrial manufacturing process, method and computer program for monitoring artificial intelligence-based anomaly detection in high-frequency sensor data or quantities derived therefrom of an industrial manufacturing process and method and computer program for monitoring artificial intelligence-based anomaly detection during an end-of-line acoustic test of a transmission |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4384325A (en) * | 1980-06-23 | 1983-05-17 | Sperry Corporation | Apparatus and method for searching a data base using variable search criteria |
DE3069324D1 (en) * | 1980-12-19 | 1984-10-31 | Ibm | Interactive data retrieval apparatus |
US4495566A (en) * | 1981-09-30 | 1985-01-22 | System Development Corporation | Method and means using digital data processing means for locating representations in a stored textual data base |
US4506326A (en) * | 1983-02-28 | 1985-03-19 | International Business Machines Corporation | Apparatus and method for synthesizing a query for accessing a relational data base |
US4575798A (en) * | 1983-06-03 | 1986-03-11 | International Business Machines Corporation | External sorting using key value distribution and range formation |
-
1988
- 1988-09-15 US US07/244,349 patent/US4839853A/en not_active Expired - Lifetime
-
1989
- 1989-04-12 CA CA000596524A patent/CA1306062C/en not_active Expired - Lifetime
Also Published As
Publication number | Publication date |
---|---|
US4839853A (en) | 1989-06-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA1306062C (en) | Computer information retrieval using latent semantic structure | |
US5987446A (en) | Searching large collections of text using multiple search engines concurrently | |
Lochbaum et al. | Comparing and combining the effectiveness of latent semantic indexing and the ordinary vector space model for information retrieval | |
Liu et al. | Mining topic-specific concepts and definitions on the web | |
Ding | A similarity-based probability model for latent semantic indexing | |
Wang et al. | Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization | |
Deerwester et al. | Indexing by latent semantic analysis | |
US7269598B2 (en) | Extended functionality for an inverse inference engine based web search | |
Dumais | Improving the retrieval of information from external sources | |
US5301109A (en) | Computerized cross-language document retrieval using latent semantic indexing | |
EP0597630B1 (en) | Method for resolution of natural-language queries against full-text databases | |
Croft | Advances in information retrieval: recent research from the center for intelligent information retrieval | |
US20070143235A1 (en) | Method, system and computer program product for organizing data | |
Cruz et al. | Measuring structural similarity among web documents: preliminary results | |
CA2423476C (en) | Extended functionality for an inverse inference engine based web search | |
Kim et al. | Cluster-based faq retrieval using latent term weights | |
Corston-Oliver et al. | Less is more: eliminating index terms from subordinate clauses | |
Cigarrán et al. | Automatic selection of noun phrases as document descriptors in an FCA-based information retrieval system | |
Feuer et al. | Implementing and evaluating phrasal query suggestions for proximity search | |
Rodrigues et al. | Concept based search using LSI and automatic keyphrase extraction | |
Rungsawang | Dsir: The first trec-7 attempt | |
Rotella et al. | A domain based approach to information retrieval in digital libraries | |
Liao et al. | A domain‐independent software reuse framework based on a hierarchical thesaurus | |
Güran et al. | A comparison of feature and semantic-based summarization algorithms for Turkish | |
Gadge et al. | Query expansion using WordNet in N-layer vector space model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MKEX | Expiry |