WO2007057799A1 - Method, system and device for obtaining a representation of a text - Google Patents

Method, system and device for obtaining a representation of a text Download PDF

Info

Publication number
WO2007057799A1
WO2007057799A1 PCT/IB2006/053975 IB2006053975W WO2007057799A1 WO 2007057799 A1 WO2007057799 A1 WO 2007057799A1 IB 2006053975 W IB2006053975 W IB 2006053975W WO 2007057799 A1 WO2007057799 A1 WO 2007057799A1
Authority
WO
WIPO (PCT)
Prior art keywords
fragments
fragment
candidate files
cluster
character strings
Prior art date
Application number
PCT/IB2006/053975
Other languages
French (fr)
Inventor
Gijs Geleijnse
Johannes H. M. Korst
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Publication of WO2007057799A1 publication Critical patent/WO2007057799A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method of obtaining a data file (20) including a representation of a text, e.g. the lyrics of a song, includes obtaining multiple candidate files (19) having character strings as contents, on the basis of a search query submitted to a server system (5) arranged to permit a search of the contents of at least one server (1-3) to be performed. It also includes partitioning the contents of the multiple candidate files (19) into fragments, and forming the representation of the text as an ordered sequence (43) of fragments, at least one of which is provided on the basis of a cluster (50) of fragments among those of fragments (41,48) selected from the candidate files (19) that satisfy a certain criterion.

Description

Method, system and device for obtaining a representation of a text
The invention relates to a method of obtaining a data file including a representation of a text, e.g. the lyrics of a song, including obtaining multiple candidate files having character strings as contents, on the basis of a search query submitted to a server system arranged to permit a search of the contents of at least one server to be performed.
The invention also relates to a system for obtaining a data file including a representation of a text, e.g. the lyrics of a song, including a client for submitting a search query to a server system arranged to permit a search of the contents of at least one server to be performed, and for obtaining multiple candidate files having character strings as contents in response to the search query.
The invention also relates to a consumer electronics device, comprising a network port and configured for communicating via the network port with a server system arranged to permit a search of the contents of at least one server.
The invention also relates to a computer program.
Respective examples of such a method, system, consumer electronics device and computer program are known from Evillyrics, http://www.evillabs.sk/evillyrics FAQ: "How does it determine where to look for lyrics?": browse candidates manually, 22 November 2003. EvilLyrics uses general search engines (Google, Alltheweb, Altavista) to look for lyrics. From results returned it picks those which are known lyrics sites. It downloads the first of them and tries to parse it using built-in filters. If the page seems to be fitting, it displays what it considers to be the lyrics in a lyrics pane. Sometimes it returns pages from lyrics sites which are not actual lyrics pages but for example list of lyrics for the whole album. In this case EvilLyrics parses the page and tries to find the link to a corresponding lyrics page. If this fails, it resumes with another hit from result set returned by search engine. If all the results are used and none of them seem to be what it was looking for, an error message is displayed and the lyrics page stays blank. A problem of downloading the first of the picked lyrics sites is that the first lyrics site might not host the most complete and error-free rendition of the lyrics that are sought. It is up to the user to assess whether it is plausible that the returned lyrics are complete and error- free, but he or she can only do so without reference to results obtained from other lyrics sites.
It is an object of the invention to provide a method, system, consumer electronics device and computer program of the types mentioned in the opening paragraphs that permit a data processing system to provide a data file including a relatively complete and accurate representation of a text stored in multiple, slightly different versions on at least one server.
This object is achieved by the method according to the invention which is characterised by partitioning the contents of the multiple candidate files into fragments, and forming the representation of the text as an ordered sequence of fragments, at least one of which is provided on the basis of a cluster of fragments among those of fragments selected from the candidate files that satisfy a certain criterion.
Thus, the representation of the text is, in general, based on fragments from several candidate files, so that full use is made of the different versions. The certain criterion is usable to select the best of several available fragments to form a cluster of fragments from the multiple candidate files on which to base the corresponding fragment of the representation of the text. Such a comparison obviates the need to assess how plausible it is that one of the multiple candidate files is correct on the basis of only that one candidate file. The fragments from the other candidate files form a reference, enabling a data processing system to detect missing or aberrant fragments.
In an embodiment, the cluster of fragments is formed by comparing data based on contents of the fragments from the candidate files, and only fragments for which the compared data satisfies a similarity criterion with respect to data based on contents of a fragment from at least one other of the candidate files are included in the cluster.
Thus, a criterion is used that is well-suited to implementation in a data processing system, since it requires no analysis of the meaning of the information contents of the fragments. An embodiment includes extracting a certain number of different character strings from each of the fragments selected from the candidate files, to form respective characterising sets of character strings for the selected fragments, and comparing a plurality of the characterising sets of character strings to at least one of the other characterising sets of character strings, wherein fragments for which the characterising sets of character strings have more than a certain number of character strings in common are added to the cluster.
This embodiment has the advantage of being computationally efficient relative to other possible types of comparison between character strings in data fragments, such as, for example, longest common sub-string comparison. Each comparison of two fragments is linear in the length of the text formed by all character strings in two fragments. To extract a certain, i.e. corresponding, number of character strings, say k character strings from a body of n character strings requires O(n) operations. To sort k character strings in an order, e.g. in alphabetical order, requires O(kΛogk) operations. To compare k character strings requires O(k) operations. The total number of operations for a comparison is thus O(n + k + kΛogk), whereas it is O(n2) for the longest common sub-string comparison.
In an embodiment, a fragment of the representation of the text is provided on the basis of a cluster of fragments of the candidate files by partitioning each fragment in the cluster into sub-fragments, and forming the fragment as an ordered sequence of sub- fragments, at least one of which is constructed on the basis of a cluster of sub-fragments among those of sub-fragments selected from the sub-fragments that satisfy a certain criterion.
This embodiment has the advantage of resulting in a more accurate version of a particular fragment where the corresponding fragments in the multiple candidate files differ widely.
An embodiment includes partitioning the multiple candidate files into fragments included in respective associated ordered sequences of fragments, wherein a fragment of the representation of the text is constructed on the basis of a cluster of fragments among those of a selected plurality of fragments from the ordered sequences that satisfy the certain criterion, wherein, of each of the respective ordered sequences of which at least one fragment is included among the selected plurality, at least one fragment is selected to be included among the selected plurality on the basis of its position in the ordered sequence of fragments. The effects of these measures are on the one hand to make it possible to ensure that fragments from the candidate files that are unlikely to be a version of the fragment of the representation of the text to be constructed are not assessed as to whether they satisfy the certain criterion. This includes fragments at the beginning of candidate files where the fragment of the representation of the text to be constructed is at the end. It also includes fragments at positions before a fragment previously selected from a candidate file. On the other hand, since the multiple candidate files are obtained on the basis of a single search query, it is likely that they are all more or less the same. Thus, corresponding fragments are likely to be at corresponding positions in each partition of a candidate file. Where one or more candidate files have extra information, then a corresponding fragment will be at a corresponding position offset by a certain number of fragments relative to the other candidate files. It is therefore efficient and effective to select fragments for assessment against the certain criterion on the basis of their position.
In a variant, the selected plurality of fragments includes at least one fragment from at least one of the ordered sequences of fragments associated with the candidate files at one of at least one position in the ordered sequence following that of the fragment selected on the basis of its position.
This variant has the effect of allowing full use of the information contents of a candidate file with extra inserted information relative to the other candidate files. Account is taken of the fact that a fragment in such a candidate file is offset relative to the similar fragment in another candidate file.
In an embodiment, for a fragment subsequent to another fragment of the ordered sequence of fragments forming the representation of the text, selection of at least one fragment on the basis of its position from an ordered sequence associated with a candidate file includes determining whether any fragment in the ordered sequence of fragments has been included in a cluster on the basis of which another fragment in the ordered sequence of fragments forming the representation of the text has been constructed, and if such is determined to be the case, selecting a fragment at a position immediately succeeding that of a last selected one of the fragments determined to have been included in a cluster.
This embodiment is relatively efficient, since fragments that are unlikely to be a version of the fragment of the representation of the text that is to be constructed are not considered. In an embodiment, each ordered sequence of fragments associated with one of the multiple candidate files is formed by partitioning the multiple candidate files into respective intermediate ordered sequences of fragments, obtaining a set of at least one dummy fragment, constructed such that the dummy fragments in the set satisfy the certain criterion, and appending a dummy fragment from the set to the intermediate ordered sequence of fragments to form the ordered sequence of fragments associated with the candidate files.
This helps to ensure that fragments appended to only one or a few of the candidate files also find their way into the representation of the text to be constructed. This is the case where the dummy fragment is retrieved from one candidate file and similar fragments are then identified in the other candidate files. It is then clear that there are no further fragments to be selected, since the last fragment from each ordered sequence has been selected.
According to another aspect, the system for obtaining the data file according to the invention is characterised in that the system is configured to partition the contents of the candidate files into fragments, and to form the representation of the text as an ordered sequence of fragments, at least one of which is provided on the basis of a cluster of fragments among those of fragments selected from the candidate files that satisfy a certain criterion.
Preferably, the system is configured to execute a method according to the invention.
According to another aspect, the invention provides a consumer electronics device, comprising a network port and configured for communicating via the network port with a server system arranged to permit a search of the contents of at least one server, wherein the consumer electronics device comprises a system according to the invention.
According to another aspect, the invention provides a computer program including a set of instructions capable, when incorporated in a machine readable medium, of causing a system having information processing capabilities to perform a method according to the invention.
The invention also provides for a device for obtaining a data file including a representation of a text, e.g. the lyrics of a song, the device being configured
- for obtaining multiple candidate files having character strings,
- to partition the candidate files into fragments, and - to form the representation of the text as an ordered sequence of fragments, at least onde of which is provided on the basis of a cluster of fragments among those of fragments selected from the candidate files that satisfy a certain criterion.
The device may be implemented as a lyrics server for communicating with a consumer electronics device and a server system, e.g. a search engine.
The invention will now be explained in further detail with reference to the accompanying drawings, in which
Fig. 1 illustrates schematically an example of a system for application of a method of obtaining a representation of a text,
Fig. 2 is a flow chart showing a first example of a method of obtaining a representation of a text,
Fig. 3 is a flow chart showing a second example of a method of obtaining a representation of a text,
Fig. 4 is a flow chart illustrating an example of a method of forming a representation of a text on the basis of a set of candidate files, for use in either of the methods illustrated in Fig. 2 and Fig. 3,
Fig. 5 is a flow chart illustrating an implementation of the last step in the method illustrated in Fig. 4, and
Fig. 6 is a flow chart illustrating several steps that are conditionally executed when executing the method illustrated in Fig. 5.
In the following description, details will be given of methods wherein a text file containing the lyrics of a song is obtained on the basis of a query to a server system implementing a conventional search engine. The methods are, however, equally suited for obtaining representations of other kinds of text of which different versions are hosted on a plurality of servers, e.g. servers storing HTML files. Examples include files containing the text of well-known speeches or books, e.g. the Gettysburg address, Bible texts, etc.
In Fig. 1, first, second and third web servers 1-3 are connected to a wide area network (WAN) 4, e.g. the Internet. Each of the web servers 1-3 hosts a plurality of HTML files including character strings representing text and strings representing control codes for controlling the presentation of the text by a browser, i.e. a software application that enables a user to display and interact with the HTML documents hosted by the web servers 1-3. Of course, the number of web servers 1-3 is limited to three in Fig. 1 for simplicity, there being many more servers in a practical implementation.
A server system 5 is arranged to permit a search of the contents of files hosted on the web servers 1-3. The server system 5 implements a search engine. The search engine is of a type known per se, for example Google, Yahoo! search, MSN search etc. In alternative embodiments, the server system 5 is of a type submitting a search query to several of such search engines and amalgamating the results. The invention is not limited to HTML documents, but may also use the results of a search query submitted to a search engine arranged to search for other types of content including RSS feeds (a type of extensible Markup Language format for web syndication) and .PDF files (Portable Document Format). Also, although the web servers 1-3 operate in accordance with the HTTP protocol, variants of the methods presented below make use of the results provided by search engines for searching FTP servers or search engines for the Gopher protocol.
Web search engines, such as those of which use is made in the situation depicted in Fig. 1, function by retrieving files from the web servers 1-3. These files are retrieved by a spider or crawler. The retrieved files are first converted to HTML, if they are in another format, and subsequently cached. The contents of the cached HTML files are indexed by analysing their contents. Data resulting from the indexing process is stored in an index database. When a search query is submitted to the server system 5, this search query is compared against data in the index database to return a result including links to the locations at which the indexed files were stored when retrieved by the crawler.
Search queries are submitted to the server system 5 in the form of regular expressions. A regular expression is a string that describes or matches a set of strings according to certain syntax rules. It is an expression that describes a set of strings, and is sometimes known as a pattern.
The system illustrated in Fig. 1 includes a lyrics server 6. The system further includes a mobile content player 7, for example a cellular telephone with a decoder application for decoding compressed music files, such as files in the MP3, WMA or similar format. The mobile content player 7 is connected to the WAN 4 via a gateway 8 and cellular radio communications network 9. The lyrics server 6 is arranged to execute a method as will be described below, in order to provide the mobile content player 7 with a file comprising a representation of the lyrics of a song. The mobile content player 7 sends a message to the lyrics server 6 containing a request for a lyrics file. The request comprises data associated with the song of which the lyrics are requested. For example, the mobile content player 7 may retrieve one or more identification tags from the file containing the compressed audio data. Such identification tags generally include the name of the artist and the name of the track.
The lyrics server 6 receives the request and retrieves the data identifying the requested song from the request. This data is used to formulate a search query, a regular expression, which is submitted to the server system 5 via the WAN 4. A wrapper program is used to obtain search results from the server system 5 comprising the search engine. The wrapper program extracts data from the web-site provided as an interface to the search engine by the server system 5. The wrapper program uses the coherent structure of the web-site provided by the server system 5 to retrieve URLs (Uniform Resource Locators) of the locations at which files are stored that match the search query. The lyrics server 6 preferably uses an API (Application Program Interface) provided by the search engine to retrieve the contents of the URLs indicated as search results.
In an embodiment, the API provides a method referred to as a cache request, with which a URL is submitted to the search engine's API service. The latter returns the contents of the URL as cached by the server system 5 when the search engine's crawler last visited the URL. The effect is that the lyrics server 5 need not handle error message that might occur if it tried to retreive the contents from one of the web servers 1-3 after the contents had been moved. Preferably, the cache maintained by the server system 5 is in the form of only HTML files. This obviates the need for conversion by the lyrics server 6. In one embodiment, illustrated in Fig. 2, the lyrics server 6 retrieves a set 10 of HTML files by submitting a series of cache requests to the server system 5 (step 11).
In a subsequent step 12 the lyrics server 6 generates what will be referred to herein as a super-set 13 of candidate files. It is noted that, as used herein, the term file means a sequence of bits stored as a single unit. The units need not correspond to the files maintained by the file system in use on the lyrics server 6. Nevertheless, in a simple, and for this reason preferred, implementation, the super-set 13 of candidate files is formed by a set of plain text files. Each text file is based on a corresponding one of the set 10 of HTML files.
When executing the step 12 of extracting lyrics from the set 10 of HTML files, the lyrics server analyses the character strings and strings representing control codes for controlling a browser client. The character strings are filtered out to form the super-set 13 of candidate files, each based on a respective one of the set 10 of HTML files. In this process, HTML tags, advertisements and surrounding text are discarded or replaced by the corresponding character code in a plain text file. For example, the <br> tag is replaced by the new- line character. The process of extracting lyrics to form the super-set 13 of candidate files is carried out on the basis of structural characteristics of lyrics so as to identify the lyrics within the total contents of an HTML document. Thus, a set of rules is used to form the super-set 13 of candidate files.
Examples of rules include:
- The lyrics of a song are composed out of blocks of text, separated by blank lines. There are typically one to ten blocks. Each block typically consists of one to ten lines, and each line typically consists of three to sixty characters, of which at least half are letters.
- The lines of the lyrics are explicitly broken by a <BR> tag and do not contain other HTML tags.
- The lyrics are usually preceded by a line containing at least the song title and sometimes the artists' names, the album name, or the term "Lyrics". This line is usually in a different font from that of the lyrics.
In a subsequent step 14 a certain number k of different character strings are extracted from each of the multiple candidate files in the super-set 13 to form a characterising set of character strings for each of the multiple candidate files. These characterising sets are referred to as fingerprints herein, and shown as a table 15 of fingerprints in Fig. 2. Although the term fingerprints is used herein, it should be noted that these are not fingerprints in the conventional sense, as a fingerprint need not be unique for the candidate file for which, and on the basis of which, it is generated. The number k is the same for each of the candidate files in the super-set 13. In this embodiment it is a pre-determined number. It may be a variable, dependent on the number of candidate files in the super-set 13 of candidate files.
One of a number of alternative possible implementations of the step 14 of extracting fingerprints is employed.
In a first embodiment, different character strings in at least part of each of the multiple candidate files in the super-set 13 are sorted according to their length and the k character strings are selected from among the longest. In principle, the k longest are selected. However, there may be one or more rules prohibiting the selection of certain character strings. These might include character strings corresponding to words in the title, for example. In one variant, each of the super-set 13 of candidate files is analysed in its entirety. In another variant only a part of each candidate file is analysed to determined the k longest character strings. If the analysis reveals that there are several different character strings of equal length, then a sufficient number of them are chosen in accordance with a further rule, so as to arrive at a set of k character strings. For example, those of the character strings with equal length appearing with the highest frequency in the part of the candidate file of which the character strings have been sorted according to their length may be chosen to complete the fingerprint.
In a second embodiment, the lyrics server 6 determines a frequency of occurrence of at least selected different character strings in a candidate file. It forms the fingerpritn from those of the selected different character strings having a highest frequency of occurrence, at least within a selected frequency range. To prevent the selection of common stop words, such as "the", "a", conjugations of the verbs "to be" and "to "have", etc., these can be excluded from selection. Alternatively, knowledge of the usual frequency of occurrence of the stop words in texts in the language of the lyrics under consideration can be used to limit the frequency range. The language of the lyrics may be made known to the lyrics server 6 via the request submitted by the mobile content player 7.
Regardless of the way in which the fingerprints in the table 15 of fingerprints are obtained, a table 16 of matching fingerprints is subsequently formed (step 17). In this step 17, the fingerprints based on (i.e. corresponding to) at least some of the character strings in the candidate files are each compared to at least one other of the fingerprints to determine whether they satisfy a measure of similarity. In the embodiment of Fig. 2, in contrast to that of Fig. 3, each fingerprint is compared to each other fingerprint. If Z? of the k character strings in the fingerprint match, then the measure of similarity is satisfied. In one variant, the group of fingerprints satisfying the similarity measure and having most members is selected to form the table 16 of matching fingerprints.
Subsequently (step 18) the candidate files associated with the fingerprints in the table 16 of matching fingerprints are determined. These form a set 19 of candidate files on the basis of which a single lyrics file 20 is formed (step 21) using a method to be explained in more detail below with reference to Figs. 4-6. The lyrics file 20 is provided to the mobile content player 7 via the WAN 4, gateway 8 and cellular radio communications network 9.
A second method of obtaining the lyrics file 20 is illustrated in Fig. 3. A first step 22 corresponds to the first step 11 in the method of Fig. 2, and is used to obtain a set 23 of HTML files. Any of the variants discussed above with regard to the first step 11 of the method illustrated in Fig. 2 is usable to implement the first step 22 shown in Fig. 3. A super-set 24 of candidate files is created (step 25) in exactly the same way as in the corresponding step 12 in the method illustrated in Fig. 2. A first table 26 of fingerprints is created (step 27) as in the corresponding step 14 in the method of Fig. 2.
In the variant of Fig. 3, a clustering algorithm is used, in order to match fingerprints relatively efficiently. In a first step 28, an ordered table 29 of fingerprints is created by ranking the fingerprints in the first table 26 according to significance of at least one of the character strings in each fingerprint, as determined by the criterion for selecting the character strings for inclusion in the fingerprint. Thus, where the character strings in the candidate files of the super-set 24 have been sorted according to their length in order to select from them the longest k character strings, the fingerprints in the first table 26 are now sorted according to the length of the character strings comprised in them. In one variant the length of the longest character string in each fingerprint is used to rank the fingerprints. In another variant, the length of the shortest character string is taken. In another variant, the average length of the character strings in each fingerprint is determined and used to rank the fingerprints. In yet another variant, the sum of the lengths of the respective character strings in the fingerprints is used. In an advantageous variant, the ordering is carried out by first comparing the most significant character string of the fingerprints. When the measures associated therewith are equal (the lengths of the longest character strings in two fingerprints are equal), the next most significant character strings in two fingerprints are compared, etc.
Where, in the step 27 of extracting the fingerprints, the frequency of appearance of selected character strings has been used, the ordered table 29 ranks the fingerprints according to the frequency associated with one or several of the character strings in the respective fingerprints. In one variant, the fingerprints are ranked according to the sum of the frequencies of appearance of the character strings forming the respective fingerprints.
A base set 30 of candidate files is now selected (step 31). The base set 30 starts with at least one candidate file, for which the fingerprint appears at the top of the ordered table 29 of fingerprints. The effect of the sorting operation (step 28) is that the fingerprints appearing at the top of the ordered table 29 are likely to be fingerprints for complete lyrics, whereas those near to the bottom are likely to be fingerprints for incomplete lyrics. Thus, the clustering starts with the candidate files most likely to represent the "correct" lyrics.
In the preferred variant, the top of the ordered table 29 is searched for two fingerprints having at least C character strings in common. The associated candidate files are assigned to the base set 30 as initial candidate files. Because the initial candidate files are selected from those for which the fingerprints appear at the top of the ordered table 29, they are most likely to represent a complete version of the lyrics.
In a next step 32 a further fingerprint is compared to the fingerprints for only those candidate files that have already been added to the base set 30. If the further fingerprint does not satisfy the similarity criterion, a next one of the fingerprints in the ordered table 29 is selected. If the fingerprint does satisfy the similarity criterion, the associated candidate file is added to the base set (step 33).
Assuming that there are N candidate files in the super-set 24, the steps 32,33 to add candidate files to the base set 30 are repeated until the base set is large enough. The criterion for this is that it comprise more than N/i members, with 2 < i < N. If the criterion is not satisfied after all fingerprints have been compared, then a different pair of initial candidate files is selected for inclusion in at least one further base set. This is done in such a way that none of the different pair has been selected as initial candidate file for any of the previously formed base sets.
If the first or any of the further base sets satisfies the criterion of including more than N/i members, then the set 19 of candidate files is formed (step 34), which is constituted by the base set 30 satisfying the criterion of having a sufficient number of members.
If, upon forming a plurality of base sets and determining that each comprises fewer than N/i members, it is found that no more base sets can or should be formed, the largest of the previously formed plurality of base sets is used to constitute the set 19 of candidate files. The number of iterations of the steps 31-33 to form a base set may, for example, be limited to a pre-determined number. Alternatively, the lyrics server 6 may determine that each of the candidate files in the super-set 24 has been selected as initial candidate files for a base set 30.
The lyrics file 20 is formed (step 35) on the basis of the set 19 of candidate files, using a method such as the one illustrated in Figs. 4-6, which will now be discussed. It is noted, however, that the method of Figs. 4-6 are applied directly to the super-sets 13,24 of candidate files in another embodiment, i.e. without reducing the number of candidate files on the basis of a comparison of characterising sets of character strings extracted from them.
As outlined above, the set 19 of candidate files is obtained on the basis of a search query submitted to the server system 5 arranged to permit a search of the contents of the web servers 1-3 to be carried out. The candidate files in the set 19 include character strings within their contents, and are preferably text files. In a first step 36, the contents of the candidate files in the set 19 are partitioned into fragments, to form a set 37 of intermediate ordered sequences of fragments. As illustrated, an index is maintained to associate with each fragment its position in the intermediate ordered sequence.
In a subsequent step 38, a dummy fragment is added to each intermediate ordered sequence of fragments to form a set 39 of ordered sequences of fragments. Each ordered sequence is thus associated with one of the multiple candidate files in the set 19. The composition of the dummy fragments will be detailed below.
In a next step 40, n+1 fragments are selected from each of the ordered sequence of fragments in the set 39. These are the first n+1 fragments. The first fragments from each ordered sequence is selected on the basis of its being at the first position in the ordered sequence. The next n fragments are selected to take account of the fact that some of the multiple candidate files may start with character strings that do not form part of the actual lyrics, for example, headings, biographical information concerning the composer or performing artist, etc. Together, the first n+1 fragments from each of the ordered sequences in the set 39 form a first group 41 of fragments.
The contents of the lyrics file 20 are also formed (step 42) as an ordered sequence 43 of fragments. Each fragment in the ordered sequence 43 of fragments is provided on the basis of a cluster of fragments in a sequence 44 of clusters of fragments. Each cluster is associated with a position in the ordered sequence 43 of fragments.
To form the first cluster of the sequence 44 (step 45), the fragments in the first group 41 of fragments are compared to determine which, if any, satisfy a certain criterion. The criterion could be a criterion based on a heuristic involving an analysis of the build-up of the fragments in the first group in terms of the character strings contained therein (length, number of words between new line characters, etc.). Preferably, however, a certain number k' of different character strings from each of the fragments in the first group 41 is extracted to form respective charactering sets of character strings for the selected fragments forming the first group 41. A plurality of the characterising sets of character strings are compared to at least one of the other characterising sets of character strings. Those fragments for which the characterising sets of character strings have more than a certain number b ' in common are added to the first cluster in the sequence 44. The cluster starts with fragments selected on the basis of their position. The cluster at the first position is expanded by adding only those of the other fragments for which the fingerprints match those of only fragments previously added to the cluster. This ensures that the parts of the lyrics are not "skipped" due to their absence from a majority of the candidate files. In an alternative embodiment, the fragments in the first group 41 are examined for identity. Thus, in such an embodiment, a characterising set of character strings comprises all the character strings of a fragment, but is no longer made up of a certain number of them (i.e. a number common to all the fragments). The cluster comprises only one fragment then.
The position in the ordered sequence in the set 39 from which each fragment selected for the first cluster was taken is recorded. To obtain a next cluster in the sequence 44, starting fragments are selected (step 46) from the respective ordered sequences in the set 39. The starting fragments are selected on the basis of their position, namely one position advanced with respect to the position recorded when the previous cluster in the sequence 44 was formed. In other words, it is determined for each ordered sequence in the set 39 whether any fragment has been included in a cluster already present in the sequence 44 of clusters of fragments. If such is the case, the fragment at a position immediately succeeding that of a last selected one of the fragments determined to have been included in a cluster is selected as starting fragment.
To the starting fragments are added (step 47) at most n of the fragments at the next n positions immediately following upon those of the corresponding starting fragments, in the example all n fragments. Thus a further group 48 of fragments is formed, associated with a position in the ordered sequence 43 of fragments forming the representation of the lyrics in the lyrics file 20.
In a next step 49, the fragments of the further group 48 are compared and clustered, preferably using the fingerprint extraction and comparison method outlined with regard to the corresponding step 45 used for the first group 41 of fragments. The resulting cluster of fragments is added to the sequence 44 of clusters at the second position. Further iterations of the steps 46,47,49 are performed until a dummy fragment is selected as a starting fragment from at least one of the ordered sequences in the set 39 and a dummy fragment is present among at least the n fragments following the starting fragment in each of the other ordered sequences of the set 39. There are then no further starting fragments to be selected in the first step 46 of the iteration.
To ensure that the last cluster in the sequence 44 comprises only dummy fragments, several variants of the step 38 in which the dummy fragments are added are possible. In a first variant, an identical dummy fragment is appended to each intermediate ordered sequence of fragments in the set 37. In another variant, one of a set of several possible dummy fragments is appended. The set comprises only dummy fragments that have the same fingerprint. That is to say that, when a method such as one of those outlined above with regard to the steps 14 and 27 - the steps in which fingerprints are extracted from candidate files - is applied to only the character strings comprised in the dummy fragments, characterising sets of character strings are obtained that satisfy the similarity criterion in use for comparing candidate file fragments in the steps 45 and 49.
Figs. 5 and 6 provide a detailed illustration of an implementation of the step 42 in which the ordered sequence 43 representing the lyrics for the lyrics file 20 is formed on the basis of the ordered sequence 44 of clusters of fragments.
Each cluster 50 of fragments is retrieved in turn (step 51). A set 52 of fingerprints associated with the respective fragments in the cluster 50 is retrieved next (step 53). These fingerprints are preferably retained when the steps 45 and 49 of forming the clusters are performed using a fingerprint comparison.
Next (step 54) the cluster 50 is searched for fragments with identical fingerprints. Fragments in a sub-set 55 of fragments with identical fingerprints are analysed to determine whether they contain exactly the same character strings (step 56). If there are no fragments with identical fingerprints, then the algorithm terminates (not illustrated in Fig. 5). In that case, it is to be assumed that each of the multiple candidate files in the super-set ends differently, for example with an advertisement. In case there are multiple groups of fragments with identical fingerprints, the largest group is used for the sub-set 55.
In case two or more fragments in the sub-set 55 are identical, one is chosen (step 57) to be inserted at the position in the ordered sequence 43 of fragments corresponding to the position of the cluster 50 of fragments in the ordered sequence 44 of clusters. The method continues with the next cluster 50 of fragments in the ordered sequence 44 of clusters.
If it is determined that there no identical fragments in the sub-set 55 of fragments that share a fingerprint, then the steps set out in Fig. 6 are performed.
First (step 58), each fragment in the cluster 50 is partitioned into sub- fragments, so that the cluster 50 becomes a cluster 59 of intermediate ordered sequences of sub-fragments. As in the corresponding step 38 in the method performed on the fragments, as outlined in Fig. 4, dummy sub-fragments are added (step 60) to form a cluster 61 of ordered sequences of sub-fragments.
In a next step 62, m+1 sub-fragments are selected from each of the ordered sequences of sub-fragments in the cluster 61. These are the first m+1 fragments. The first sub-fragment from each ordered sequence is selected on the basis of its being at the first position in the ordered sequence of sub- fragments. The next m sub-fragments are selected to take account of the fact that some of the multiple candidate files may have extra words inserted. Together, the first m+1 sub-fragments from each of the ordered sequences in the cluster 61 form a first group 63 of sub-fragments.
The fragment under construction is also formed (step 64) as an ordered sequence 65 of sub-fragments. Each fragment in the ordered sequence 65 of sub-fragments is provided on the basis of a cluster of sub-fragments in a sequence 66 of clusters of sub- fragments. Each cluster is associated with a position in the ordered sequence 65 of sub- fragments.
To form the first cluster of the sequence 66 (step 67), the sub-fragments in the first group 63 of sub-fragments are compared to determine which, if any, satisfy a certain criterion. Fingerprints could be formed of the sub-fragments and compared. Alternatively, the step 67 comprises examining the sub-fragments in the first group 63 for identity. In the latter embodiment, each cluster in the ordered sequence 66 of clusters of fragments will comprise only identical sub-fragments, so that this step 67 and the step 64 of forming the ordered sequence 65 of sub-fragments could be combined.
The position in the ordered sequence of sub-fragments in the cluster 61 of ordered sequences from which each sub- fragment selected for the first cluster was taken is recorded. To obtain a next cluster in the sequence 66, a further group 68 of starting sub- fragments is selected (step 69) from the respective ordered sequences in the cluster 61. The starting sub-fragments in the further group 68 are selected on the basis of their position, namely one position advanced with respect to the position recorded when the previous cluster in the sequence 66 of clusters was formed. In other words, it is determined for each ordered sequence in the cluster 61 of ordered sequences of sub-fragments whether any sub-fragment has been included in a cluster already present in the sequence 66 of clusters. If such is the case, the sub-fragment at a position immediately succeeding that of a last selected one of the sub-fragments determined to have been included in a cluster is selected as starting sub- fragment.
To the starting sub-fragments are added (step 70) at most m of the sub- fragments at the next m positions immediately following upon those of the corresponding starting sub- fragments, in the example all m sub-fragments. Thus a further group 71 of sub- fragments is formed, associated with a position in the ordered sequence 65 of sub-fragments forming the fragment to be constructed.
In a next step 72, the sub-fragments of the further group 71 are compared and clustered, preferably again by examining them for identity. The resulting cluster of sub- fragments is added to the sequence 66 of clusters of sub-fragments at the second position. Further iterations of the steps 69,70,72 are performed until a dummy sub-fragment is selected as a starting sub-fragment from at least one of the sequences in the cluster 61 of ordered sequences of sub-fragments and a dummy fragment is present among at least the m sub- fragments following the starting sub- fragment in each of the other ordered sequences of sub- fragments. There are then no further starting sub-fragments to be selected in the first step 69 of the iteration.
In the case of the lyrics file 20, the fragments advantageously correspond to paragraphs (chorus, refrain), with the sub-fragments corresponding to individual character strings representing words, or to lines of character strings. The step 58 of partitioning fragments is easily performed by identifying the words by the fact that they are separated by a character representing a <space>. Alternatively, lines are identified by the presence of <new line> characters.
In a further developed embodiment, if necessary, a sub-fragment is formed as an ordered sequence of sub-sub-fragments, at least one of which is provided on the basis of a cluster of sub-sub-fragments among those of sub-fragments into which sub-fragments have been partitioned that satisfy a certain criterion. Thus, where the sub-fragments in the ordered sequence 65 represent words, those words would be formed by partitioning the sub-fragments in each cluster of the ordered sequence 66 of clusters into sequences of sub-sub-fragments representing individual characters. Then, each sub-fragment of the ordered sequence 65, representing a word, is provided on the basis of those sub-sub-fragments among those selected from the sub-fragments in the sequences that are identical. This variant is useful in filtering out misspellings in some of the multiple candidate files in the set 19 with which the method of Figs. 4-6 started.
It should be noted that the above-mentioned embodiments illustrate, rather than limit, the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. For instance, although an embodiment using a mobile content player 7 and a lyrics server 6 has been described, an alternative embodiment includes only a program on a single computer with a network connection, for example a personal computer. Alternatively, the mobile content player 7 may perform the entire method leading to a text file, or the entire method may be performed by the server system 5 that also comprises the search engine for searching the Internet.

Claims

CLAIMS:
1. Method of obtaining a data file (20) including a representation of a text, e.g. the lyrics of a song, including the steps of:
- obtaining multiple candidate files (19) having character strings as contents, on the basis of a search query submitted to a server system (5) arranged to permit a search of the contents of at least one server (1-3) to be performed,
- partitioning the contents of the multiple candidate files (19) into fragments, and
- forming the representation of the text as an ordered sequence (43) of fragments, at least one of which is provided on the basis of a cluster (50) of fragments among those of fragments (41,48) selected from the candidate files (19) that satisfy a certain criterion.
2. Method according to claim 1 , wherein the cluster of fragments is formed by comparing data based on contents of the fragments from the candidate files (19), and wherein only fragments for which the compared data satisfies a similarity criterion with respect to data based on contents of a fragment from at least one other of the candidate files (19) are included in the cluster (50).
3. Method according to claim 2, including extracting a certain number of different character strings from each of the fragments selected from the candidate files (19), to form respective characterising sets of character strings for the selected fragments, and comparing a plurality of the characterising sets of character strings to at least one of the other characterising sets of character strings, wherein fragments for which the characterising sets of character strings have more than a certain number of character strings in common are added to the cluster (50).
4. Method according to any one of claims 1-3, wherein a fragment of the representation of the text is provided on the basis of a cluster (50) of fragments of the candidate files (19) by partitioning each fragment in the cluster (50) into sub- fragments, and forming the fragment as an ordered sequence (65) of sub-fragments, at least one of which is constructed on the basis of a cluster of sub-fragments among those of sub- fragments (61,68) selected from the sub-fragments that satisfy a certain criterion.
5. Method according to any one of claims 1-4, including partitioning the multiple candidate files (19) into fragments included in respective associated ordered sequences (39) of fragments, wherein a fragment of the representation of the text is constructed on the basis of a cluster (50) of fragments among those of a selected plurality (41,48) of fragments from the ordered sequences (39) that satisfy the certain criterion, wherein, of each of the respective ordered sequences (39) of which at least one fragment is included among the selected plurality (41,48), at least one fragment is selected to be included among the selected plurality on the basis of its position in the ordered sequence (39) of fragments.
6. Method according to claim 5, wherein the selected plurality of fragments includes at least one fragment from at least one of the ordered sequences (39) of fragments associated with the candidate files (19) at one of at least one position in the ordered sequence (39) following that of the fragment selected on the basis of its position.
7. Method according to claim 5 or 6, wherein, for a fragment subsequent to another fragment of the ordered sequence (43) of fragments forming the representation of the text, selection of at least one fragment on the basis of its position from an ordered sequence (39) associated with a candidate file (19) includes determining whether any fragment in the ordered sequence of fragments (39) has been included in a cluster (50) on the basis of which another fragment in the ordered sequence (43) of fragments forming the representation of the text has been constructed, and if such is determined to be the case, selecting a fragment at a position immediately succeeding that of a last selected one of the fragments determined to have been included in a cluster (50).
8. Method according to any one of claims 5-7, wherein each ordered sequence (39) of fragments associated with one of the multiple candidate files (19) is formed by partitioning the multiple candidate files (19) into respective intermediate ordered sequences (37) of fragments, obtaining a set of at least one dummy fragment, constructed such that the dummy fragments in the set satisfy the certain criterion, and appending a dummy fragment from the set to each intermediate ordered sequence (37) of fragments to form the ordered sequences (39) of fragments associated with the candidate files (19).
9. System for obtaining a data file (20) including a representation of a text, e.g. the lyrics of a song, including a client for submitting a search query to a server system (5) arranged to permit a search of the contents of at least one server (1-3) to be performed, and for obtaining multiple candidate files (19) having character strings as contents in response to the search query, the system being configured to partition the contents of the candidate files (19) into fragments, and to form the representation of the text as an ordered sequence (43) of fragments, at least one of which is provided on the basis of a cluster (50) of fragments among those of fragments selected from the candidate files (19) that satisfy a certain criterion.
10. System according to claim 9, configured to execute a method according to any one of claims 1-8.
11. Consumer electronics device, comprising a network port and configured for communicating via the network port with a server system (5) arranged to permit a search of the contents of at least one server (1-3), wherein the consumer electronics device comprises a system according to any one of claims 9-10.
12. Computer program including a set of instructions capable, when incorporated in a machine readable medium, of causing a system having information processing capabilities to perform a method according to any one of claims 1-8.
13. Device (6) for obtaining a data file including a representation of a text, e.g. the lyrics of a song, the device being configured for
- obtaining multiple candidate files having character strings,
- to partition the candidate files into fragments, and
- to form the representation of the text as an ordered sequence of fragments, at least onde of which is provided on the basis of a cluster of fragments among those of fragments selected from the candidate files that satisfy a certain criterion.
PCT/IB2006/053975 2005-11-15 2006-10-27 Method, system and device for obtaining a representation of a text WO2007057799A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP05110742.3 2005-11-15
EP05110742 2005-11-15

Publications (1)

Publication Number Publication Date
WO2007057799A1 true WO2007057799A1 (en) 2007-05-24

Family

ID=37776615

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2006/053975 WO2007057799A1 (en) 2005-11-15 2006-10-27 Method, system and device for obtaining a representation of a text

Country Status (1)

Country Link
WO (1) WO2007057799A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5857179A (en) * 1996-09-09 1999-01-05 Digital Equipment Corporation Computer method and apparatus for clustering documents and automatic generation of cluster keywords
WO2004025490A1 (en) * 2002-09-16 2004-03-25 The Trustees Of Columbia University In The City Of New York System and method for document collection, grouping and summarization

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5857179A (en) * 1996-09-09 1999-01-05 Digital Equipment Corporation Computer method and apparatus for clustering documents and automatic generation of cluster keywords
WO2004025490A1 (en) * 2002-09-16 2004-03-25 The Trustees Of Columbia University In The City Of New York System and method for document collection, grouping and summarization
US20050203970A1 (en) * 2002-09-16 2005-09-15 Mckeown Kathleen R. System and method for document collection, grouping and summarization

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PATRÍCIASILVA PERES ET AL: "Application of Clustering Technique in Multiple Sequence Alignment", SPIRE 2005, 5 November 2005 (2005-11-05), pages 202 - 205, XP019023025 *
PETER KNEES, MARKUS SCHEDL, GERHARD WIDMER: "multiple lyrics alignment: automatic retrieval of song lyrics", UNIVERSITY OF LONDON, 30 September 2005 (2005-09-30), ismir 2005, XP002423234 *

Similar Documents

Publication Publication Date Title
US9436781B2 (en) Method and system for autocompletion for languages having ideographs and phonetic characters
US9081851B2 (en) Method and system for autocompletion using ranked results
US8554759B1 (en) Selection of documents to place in search index
US8027974B2 (en) Method and system for URL autocompletion using ranked results
US7072890B2 (en) Method and apparatus for improved web scraping
US7783660B2 (en) System and method for enhanced text matching
US20070250501A1 (en) Search result delivery engine
JP3802813B2 (en) Web page search method, web page search device, program, and recording medium
US20090049020A1 (en) System and method for providing personalized recommended word and computer readable recording medium recording program for implementing the method
JP2009525520A (en) Evaluation method for ranking and sorting electronic documents in search result list based on relevance, and database search engine
JP2007507801A (en) Personalized web search
US20080281811A1 (en) Method of Obtaining a Representation of a Text
JP4021681B2 (en) Page rating / filtering method and apparatus, page rating / filtering program, and computer-readable recording medium storing the program
KR100913733B1 (en) Method for Providing Search Result Using Template
KR100671077B1 (en) Server, Method and System for Providing Information Search Service by Using Sheaf of Pages
JP2008191982A (en) Retrieval result output device
JP4189387B2 (en) Knowledge search system, knowledge search method and program
WO2007057799A1 (en) Method, system and device for obtaining a representation of a text
JP7081155B2 (en) Selection program, selection method, and selection device
KR101120040B1 (en) Apparatus for recommending related query and method thereof
US10592573B1 (en) Interactively suggesting network location
US10061859B2 (en) Computer implemented systems and methods for dynamic and heuristically-generated search returns of particular relevance
JP2004272639A (en) Word extraction method, device and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06821219

Country of ref document: EP

Kind code of ref document: A1