US8977606B2 - Method and apparatus for generating extended page snippet of search result - Google Patents

Method and apparatus for generating extended page snippet of search result Download PDF

Info

Publication number
US8977606B2
US8977606B2 US13/628,077 US201213628077A US8977606B2 US 8977606 B2 US8977606 B2 US 8977606B2 US 201213628077 A US201213628077 A US 201213628077A US 8977606 B2 US8977606 B2 US 8977606B2
Authority
US
United States
Prior art keywords
webpage
page
snippet
column names
generating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related, expires
Application number
US13/628,077
Other versions
US20130086035A1 (en
Inventor
Sheng Hua Bao
Jian Chen
Zhong Su
Xin Ying Yang
Xiang Zhou
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BAO, SHENG HUA, CHEN, JIAN, SU, Zhong, YANG, XIN YING, ZHOU, XIANG
Publication of US20130086035A1 publication Critical patent/US20130086035A1/en
Application granted granted Critical
Publication of US8977606B2 publication Critical patent/US8977606B2/en
Expired - Fee Related legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • G06F17/30864
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • G06F17/30554
    • G06F17/30696

Definitions

  • the invention relates generally to the technical field of generating an extended page snippet of a search result in a search engine, and particularly to a method and apparatus for generating a page snippet in table style.
  • a search engine operates in the following manner: once a user submits an inquiry though a client, the search engine will return searched webpages to the user through a search result page.
  • One important object of the search engine is to provide a link set desired by the user with respect to a specific search inquiry of the user, and another object is that it is required to inform the user of the content associated with each link clearly and quickly. Therefore, when the search result is returned, besides a title and a uniform resource locator (URL) of the webpage, the search result page also contains a short text description related to the webpage. This short text description is usually referred to as page snippet.
  • the search engine extracts the page snippet from the webpage by extracting and combining text segments including a keyword involved in the inquiry.
  • the search engine differentiates the display of the inquired keyword from other texts in the page snippet by various means, such as highlighting, underlining, different font, and the like, in order to draw the user's attention and facilitate the user to determine whether to click the webpage.
  • the page snippet in the prior art reflects a correlation between the webpage and the inquiry to a certain extent.
  • the current page snippet in the prior art consists of the text segments containing the inquired keyword, however, and selecting of the text segment does not take account of the content other than the keyword in the text segment. It also does not take account of the table format information of the text segment.
  • a table is an important data source, and some widely used data types adapted to be presented in a table are listed as follows: traditional Web Table type of data, for example, information such as members, companies, situations, merchandise, movies, and music, including both bordered tables and non-bordered tables.
  • the application of business intelligence (BI) causes a number of enterprise data to be generated in the form of report form (a format such as Web report form, PDF, Excel®, Word and the like), and many BI analysis and presentation tools in an enterprise level such as IBM Cognos® and the like will generate a lot of report forms and publish the same.
  • BI business intelligence
  • report form a format such as Web report form, PDF, Excel®, Word and the like
  • IBM Cognos® and the like will generate a lot of report forms and publish the same.
  • various mainstream search engines have already brought documents in Excel, Word and the like under the retrieval.
  • the prior art also provides a search result preview function which may preview webpage information in the manner of a picture.
  • a search result preview function which may preview webpage information in the manner of a picture.
  • the space for modifying is getting smaller and smaller, and difficulty in improvement and innovation to the search engine is increasing. Therefore, a little modification may mean a great improvement to the user experience.
  • the snippet is different from the preview.
  • the preview does not generate a relative segment for a final user's fast understanding on the basis of the inquiry, but simply outputs the content of the original webpage. Whereas the snippet is used for the user to quickly judge the correlation with the inquired word, the preview is used to further judge the correlation after the judgment through the snippet; the stages of using them are different.
  • a display space of the snippet is very narrow and small, while the display space of the preview is very large.
  • the snippet is displayed as default, but the preview is not and is displayed only after a mouse is moved to a particular position (including a title, a snippet, a network address and the like) to trigger the display, and there is also a delay in showing the display (depending on the displayed content and the network speed).
  • the snippet and the preview are absolutely different technical solutions for those skilled in the art.
  • the table format information thereof is also an extremely important part which facilitates the user to quickly understand the search result through the webpage snippet.
  • the search technology needs to be further improved to at least present the table format formation in the page snippet to a certain extent.
  • the present invention provides a method for generating an extended page snippet in a search engine, comprising: retrieving and returning an associated table webpage having a table related to an inquired keyword; obtaining a parsed result of the table in said associated table webpage, and extracting column names and respective row instances based on said parsed result; determining relative row instances related to said inquired keyword; and generating a page snippet in a table style in accordance with said column names and said relative row instances.
  • the present invention provides an apparatus for generating an extended page snippet in a search engine, comprising: means for retrieving and returning an associated table webpage having a table related to an inquired keyword; means for obtaining a parsed result of the table in said associated table webpage, and extracting column names and respective row instances based on said parsed result; means for determining the relative row instances related to said inquired keyword; means for generating a page snippet in a table style in accordance with said column names and said relative row instances.
  • FIG. 1 shows an exemplary computer system for implementing an embodiment of the present invention
  • FIG. 2 shows a method flowchart for generating an extended snippet of a search result of the present application
  • FIG. 3 shows a schematic diagram of an apparatus for generating an extended snippet of a search result of the present application
  • FIG. 4 shows a schematic diagram of webpage 1 in an embodiment
  • FIG. 5 shows a schematic diagram of webpage 2 in an embodiment
  • FIG. 6 shows a schematic diagram of webpage 3 in an embodiment.
  • the present invention can be embodied as a system, a method or a computer program product. Accordingly, the present invention can be embodied in any one of the following forms, including: an absolute hardware, an absolute software (including a firmware, a resident software, a microcode, etc.), or a combination of a software part and a hardware part referred to as a “circuit,” a “module,” or a “system” in this document.
  • the present invention may also take a form of computer program product embodied in any tangible medium of expression having computer usable non-transient program codes.
  • the computer readable medium can be a computer readable signal medium or a computer readable storage medium.
  • the computer readable storage medium can include, but is not limited to, an electric, magnetic, optical, electromagnetic, infrared or semi-conductive system, apparatus, device or propagation medium, or any appropriate combination thereof.
  • the computer readable storage medium includes the following: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any appropriate combination thereof.
  • the computer readable storage medium can be any tangible medium containing or storing a program for use by or in connection with an instruction executing system, apparatus or device.
  • the computer readable signal medium can include, for example, a data signal propagated in a base band or as part of a carrier wave, which carries the computer readable program codes. Such a propagated signal can adopt any appropriate form including, but not limited to, an electromagnetic signal, an optical signal or any appropriate combination thereof.
  • the computer readable signal medium can be any computer readable medium other than a computer readable storage medium, which is capable of transmitting, propagating or transporting the program for use by or in connection with an instruction executing system, apparatus or device.
  • the non-transient program codes contained on the computer readable medium can be transmitted with any appropriate medium including, but not limited to, a wireless medium, a wire, an optical fiber cable, an RF or the like, or any appropriate combination thereof.
  • Computer non-transient program code for carrying out operations of the present invention can be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
  • the non-transient program code can execute entirely on a user's computer, partly on a user's computer, as a stand-alone software package, partly on a user's computer and partly on a remote computer or entirely on a remote computer or server.
  • the remote computer can be connected to the user's computer through any type of network, including a local area network (LAN), or a wide area network (WAN), or the connection can be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • LAN local area network
  • WAN wide area network
  • Internet Service Provider an Internet Service Provider
  • These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means implementing the functions and operations specified in the block or blocks in the flowcharts and/or block diagrams.
  • the computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operation steps to be performed on the computer or other programmable data processing apparatus to generate a computer implemented process such that the instructions which execute on the computer or other programmable data processing apparatus provide processes for implementing the functions and operations specified in the block or blocks in the flowcharts and/or block diagrams.
  • FIG. 1 it shows a block diagram of an exemplary computer system 100 adapted to implement an embodiment of the present invention.
  • the computer system 100 can include a CPU (Central Processing Unit) 101 , a RAM (Random Access Memory) 102 , a ROM (Read Only Memory) 103 , a system bus 104 , a hard disk controller 105 , a keyboard controller 106 , a serial interface controller 107 , a parallel interface controller 108 , a display controller 109 , a hard disk 110 , a keyboard 111 , a serial peripheral device 112 , a parallel peripheral device 113 , and a display 114 .
  • a CPU Central Processing Unit
  • RAM Random Access Memory
  • ROM Read Only Memory
  • the CPU 101 there are the CPU 101 , the RAM 102 , the ROM 103 , the hard disk controller 105 , the keyboard controller 106 , the serial interface controller 107 , the parallel interface controller 108 and the display controller 109 coupled with the system bus 104 .
  • the hard disk 110 is coupled with the hard disk controller 105
  • the keyboard 111 is coupled with the keyboard controller 106
  • the serial peripheral device 112 is coupled with the serial interface controller 107
  • the parallel peripheral device 113 is coupled with the parallel interface controller 108
  • the display 114 is coupled with the display controller 109 .
  • FIG. 1 the structure block diagram illustrated in FIG. 1 is shown for the purpose of an example only and not as a limitation to the scope of the present invention. In some cases, some devices can be added or removed depending on a specific situation.
  • FIG. 2 it shows a method flowchart for generating an extended snippet of a search result in one embodiment, including the steps as follows:
  • Step 201 retrieving and returning an associated table webpage having a table related to an inquired keyword.
  • a webpage series related to the inquired keyword can be retrieved and returned, and the webpage series includes at least one associated table webpage having a table related to the inquired keyword.
  • the inquired keyword can include one or more keywords, the number of which depends on the user's input.
  • the webpage series related to the inquiry can be determined with a technology in the existing search engines.
  • the table related to the inquired keyword means matching part or all of the keywords in the inquired keywords in the table.
  • a table consists of three parts, i.e. rows, columns and cells, in which the cell in the first row are table header information, contents of the respective cells in the first row are column names of the respective columns, and data in cells of each row in the table are a row instance.
  • the table usually adopts the formats of HTML, Excel, Word, PDF, and so on.
  • Step 202 obtaining a parsed result of the table in the associated table webpage, and extracting the column names and the respective row instances therefrom.
  • the existing search engines can be classified into two types according to the search result source.
  • One type possesses its own webpage snatching, indexing and retrieving system (Indexer), has an independent “Spider” program, or a “Crawler” program, or a “Robot” program (the three titles having the same meaning), and can build a webpage database itself, and the search result is called directly from its own database.
  • the second type rents a database of another search engine and sorts the search results in its self-defined format.
  • the parsed result of the table can also be obtained by a variety of ways.
  • tables in all webpages are parsed when the spider program is used to snatch the webpages.
  • the parsed result is stored in a self-built webpage database, and then the parsed result of the table is returned when the webpage series is returned in step 201 .
  • a real time manner can be employed to parse the tables in the associated table webpage, thereby obtaining the parsed result.
  • the Poor Obfuscation Implementation (POI) of the Apache is a function library with open source codes of the Apache software foundation. It provides an API for a Java program such that the Java program has the function of writing and reading Microsoft office format files.
  • the Apache POI is also open source code software used in many search software and can be used to parse tables in various Office formats in the webpages. For example, for a table in a Word format, the table in the Word format can be read and parsed through the classes of Table, TableCell, TableRow, Tablelterator, and the like in the POI, specifically exemplified as follows:
  • the content of the Excel table can be parsed through elements of HSSFWorkbook, HSSFSheet, HSSFRow, HSSFCell and the like in the POI, specifically exemplified as follows:
  • HTML Parser HTML Parser
  • sourceforge http://htmlparser.sourceforge.net
  • HTML Parser HTML Parser
  • Extracting the column names and the instances in the parsed result of the table also includes a variety of embodiments: in one embodiment, column name information can be extracted according to a column name tag, and instance information can be extracted according to an instance tag. For example, after an HTML table is parsed, a relation of column names is extracted by a ⁇ TH> tag, and the instance information of the respective columns are extracted by a ⁇ TD> tag. In another embodiment, for example, for a table obtained by the POI, it is possible that there is no explicit tag bit representing the column name. In this case, a first non-null row in the table can be verified. Since a data format of the table header is generally different from the data format of the contents of the respective rows in the table, if the element format of the row is obviously distinguished from all the rest of the rows, then that row can be used as the column name row.
  • Step 203 determining a row instance related to the inquired keyword.
  • covered rows are determined, the column name is selected, and the instance rows are selected.
  • the display space is limited, and only a limited number of rows can be displayed.
  • selection of the relative instance rows is very important.
  • the covered column names can be all displayed basically.
  • weight information of the inquired keyword can also be taken into account, thereby assisting selection of relative instances and relative column names.
  • the weight information can also be used to adjust the displayed content and order of the instances and the column names so that the most relative instance is displayed in front.
  • the inquired word weight is one factor that needs to be considered when an adjustment to the snippet display order is made, and is usually the information provided by the search engine provider according to the statistics. As an example, different weights can be assigned according to a frequency that the inquired keyword is searched.
  • Step 204 generating the page snippet in a table style in accordance with the column names and the relative row instances.
  • the step can include: statistically calculating the weights of the inquired keywords in the relative row instances to obtain the correlation of the row instances; and generating the page snippet in the table style in accordance with the column names and at least one relative row instance with the correlation arranged in the top.
  • the selected row instances can be presented according to an original order in the table, or the relative row instances and the corresponding column names can be presented from highest to lowest correlation.
  • the form of the table in the page snippet in the table style can display either a border or no border, but it is at least necessary that the column names in the table correspond to the position of the instances with each other.
  • a flow for generating the snippet in the table style crossing pages is further explained in conjunction with FIG. 2 , and a plurality of associated table webpages are returned in step 201 shown in FIG. 2 .
  • pages having a similarity are aggregated by webpage clustering in accordance with the inquired keyword and the webpage series returned by the search engine.
  • the plurality of associated table webpages are all in the same cluster.
  • the webpage clustering can adopt well known technical means which will not be stated in more detail herein.
  • the webpages from the same website domain name in the webpage series are clustered and the plurality of associated table webpages are included in the clustered result.
  • the webpage aggregation is performed on webpages from the same website because tables having a high correlation usually occur in webpages under the same website domain name. Thus the correlation of the aggregation can be increased. For instance, in the information published in a company website, the information of one employee can be published with a plurality different tables for the same employee. Thus the webpages on which the snippet crossing pages can be performed are found more exactly by aggregating the webpages belonging to the website of the company.
  • the page snippet in the table style crossing pages can be generated in the following two embodiments.
  • the page snippet in the table style crossing pages combines the column names and the instances associated with the inquired keyword in the plurality of associated table webpages.
  • the snippets in the table style are generated for each of the associated table webpages through step 202 to step 204 .
  • This embodiment includes: combining the snippet in the table style of the plurality of associated table webpages to obtain a combined snippet; determining the relative row instances and the column names in the combined snippet in accordance with the inquired keyword; and outputting the page snippet in the table style crossing pages in accordance with the relative row instances and the column names. Referring to Table 1, this embodiment is explained.
  • the snippets in the table style of pages P 1 and P 3 shown in Table 1 match all inquired keywords KEY 1 , KEY 2 and KEY 3 , and the snippet in the table style of page P 2 matches part of the inquired keywords KEY 1 .
  • the combined snippet in the table style is generated.
  • a blend and a concatenation of the column names and the instances occur in the combination of the snippets in the table style, that is, the parts with the same column name and cell data are blended, and the parts with the different column names and cell data are concatenated.
  • Table 2 the combined snippet in the table style is illustrated.
  • New relative instances and new relative column names are selected in the combined snippet in the table style according to the inquired keyword. After a plurality of snippets in the table style are blended, the size thereof may no longer be adapted to be displayed as the snippet, so it is necessary to further select the relative instances and the relative column names. Moreover, a final snippet in the table style is outputted according to the new relative instances and the new relative column names, and the inquiry result including the webpage series and the page snippet is generated.
  • the parsed results of the tables in the associated table webpages are obtained, the parsed results of the tables of the plurality of associated table webpages are combined to obtain a combined parsed result of the table.
  • the row instances and the column names are extracted from the combined parsed result of the table.
  • the page snippet in the table style crossing pages is generated through step 202 to step 204 .
  • the parsed results of the plurality of associated table webpages are combined as the new parsed result, then the instances and the column names related to the inquired keyword are further selected, so the instance is selected only once.
  • FIG. 3 shows an architecture schematic diagram of the apparatus, mainly including: a means 301 for retrieving and returning an associated table webpage having a table related to an inquired keyword; a means 302 for obtaining a parsed result of the table in the associated table webpage, and extracting column names and respective row instances on the basis of the parsed result; a means 303 for determining the row instances related to the inquired keyword; and a means 304 for generating a page snippet in a table style in accordance with the column names and the relative row instances.
  • the means for retrieving and returning an associated table webpage having a table related to an inquired keyword returns a plurality of associated table webpages.
  • the means for obtaining a parsed result of the table in the associated table webpage and extracting column names and respective row instances on the basis of the parsed result includes: a means for combining the parsed results of the tables of the plurality of associated table webpages to obtain a combined parsed result of the table after the parsed results of the tables in the associated table webpages are obtained; and extracting the column names and the respective row instances on the basis of the combined parsed result of the table, wherein the means for generating the page snippet in the table style in accordance with the column names and the relative row instances generates the page snippet in the table style crossing pages.
  • means for combining the page snippets in the table style of the plurality of associated table webpages means for determining the row instances related to the inquired keyword in the combined page snippet in the table style; and means for generating the page snippet in the table style crossing pages in accordance with the column names and the relative row instances.
  • the means for retrieving and returning an associated table webpage having a table related to an inquired keyword clusters the webpages from the same website domain name, and determines the plurality of associated table webpages in the clustering.
  • the column names and the instances from different webpages are visually distinguished in the page snippet in the table style crossing pages.
  • the inquired keywords are plural in the means for retrieving and returning
  • the means for generating the page snippet in the table style in accordance with the column names and the relative row instances include: a means for statistically calculating weights of the inquired keywords in the relative row instances to obtain correlations of the row instances; and a means for generating the page snippet in the table style in accordance with said column names and at least one relative row instance with the correlation arranged in the top.
  • the parsed result of the table is a result which is obtained and stored by parsing the tables in all webpages when a spider program snatches the webpages.
  • the parsed result of the table is obtained by parsing the table in the associated table webpage in real time.
  • the inquiry is understood on the basis of parsing the table information in documents in various formats. Further, the page snippet in the table style reserving the table format information is generated. Therefore, the deficiency is improved that only the keyword in the search result is extracted and no table format information is reserved in the prior art.
  • Page 1 is a webpage in the returned webpage series.
  • the page 1 (Page 1 ) shown in FIG. 4 includes a table related to the inquired keywords.
  • the position of the table is located and acquired by the ⁇ Table> tag from the above parsed structure, and the information of the column names are extracted by the ⁇ TH> tag as follows:
  • the information of the respective row instances are extracted by the ⁇ TD> tag at the same time, for example:
  • HeaderA HeaderB HeaderC HeaderD HeaderE HeaderF a2 b2 c2 d2 e2 f2 a9 b9 c9 d9 e9 f9
  • the part of unrelated table columns in the last can be omitted. See Table 4 for the exemplary snippet in the table style.
  • HeaderA HeaderB HeaderC HeaderD a2 b2 c2 d2 a9 b9 c9 d9
  • FIGS. 5 and 6 show webpage 2 (Page 2 ) and webpage 3 (Page 3 ) including the following table information in the same website, respectively.
  • the inquired keywords are a 2 , b 2 , b 9 , h 2 and j 9
  • the webpage 1 has the table matching with a part of keywords a 2 , b 2 and b 9
  • the webpage 2 has the table matching with a part of keywords a 2 and h 2
  • the webpage 3 has the table matching with a part of keywords b 2 , b 9 and j 9 .
  • Any one of the pages can not satisfy the requirement of matching all keywords in the required keywords. Referring to Table 5, the snippet in the table style crossing pages obtained by the method provided by the present application is shown exemplarily.
  • the parts from different webpages can be visually differentiated in the generated snippet in the table style with different format information, and the user can click the corresponding part and jump to the source webpage to browse the information.
  • each block in the flowcharts or block diagrams may represent a modular, program segment, or part of code, which includes one or more executable instructions for implementing the specified logic function(s).
  • the functions noted in the block can also occur in an order other than as noted in the drawings. For example, two blocks consecutively shown may, in fact, be performed substantially in parallel, or sometimes they can be performed in a reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowcharts and combinations of blocks in the block diagrams and/or flowcharts can be implemented by using a special purpose hardware-based system that executes the specified functions or operations, or by using a combination of a special purpose hardware and computer instructions.

Abstract

A method and apparatus for generating an extended page snippet in a search engine. The method includes: retrieving and returning an associated table webpage having a table related to an inquired keyword; obtaining a parsed result of the table in the associated table webpage, and extracting column names and respective row instances on the basis of the parsed result; determining the row instances related to the inquired keyword; and generating a page snippet in a table style in accordance with the column names and the relative row instances. The page snippet in the table style can be generated by using a solution of the present invention.

Description

CROSS REFERENCE TO RELATED APPLICATION
This application claims priority under 35 U.S.C. 119 from Chinese Application 201110294672.4, filed Sep. 30, 2011, the entire contents of which are incorporated herein by reference.
BACKGROUND OF THE INVENTION
1. Field of the Invention
The invention relates generally to the technical field of generating an extended page snippet of a search result in a search engine, and particularly to a method and apparatus for generating a page snippet in table style.
2. Description of the Related Art
As the Internet business continuously grows, various existing search engines have become indispensable tools that people use to find network resources of interest, for example webpages.
Generally, a search engine operates in the following manner: once a user submits an inquiry though a client, the search engine will return searched webpages to the user through a search result page. One important object of the search engine is to provide a link set desired by the user with respect to a specific search inquiry of the user, and another object is that it is required to inform the user of the content associated with each link clearly and quickly. Therefore, when the search result is returned, besides a title and a uniform resource locator (URL) of the webpage, the search result page also contains a short text description related to the webpage. This short text description is usually referred to as page snippet. In general, the search engine extracts the page snippet from the webpage by extracting and combining text segments including a keyword involved in the inquiry. In the search result page, the search engine differentiates the display of the inquired keyword from other texts in the page snippet by various means, such as highlighting, underlining, different font, and the like, in order to draw the user's attention and facilitate the user to determine whether to click the webpage. The page snippet in the prior art reflects a correlation between the webpage and the inquiry to a certain extent. The current page snippet in the prior art consists of the text segments containing the inquired keyword, however, and selecting of the text segment does not take account of the content other than the keyword in the text segment. It also does not take account of the table format information of the text segment.
However, a table is an important data source, and some widely used data types adapted to be presented in a table are listed as follows: traditional Web Table type of data, for example, information such as members, companies, situations, merchandise, movies, and music, including both bordered tables and non-bordered tables. The application of business intelligence (BI) causes a number of enterprise data to be generated in the form of report form (a format such as Web report form, PDF, Excel®, Word and the like), and many BI analysis and presentation tools in an enterprise level such as IBM Cognos® and the like will generate a lot of report forms and publish the same. There is a strong search demand for such massive data in an enterprise or the Internet. Moreover, on the basis of a file parsing tool, various mainstream search engines have already brought documents in Excel, Word and the like under the retrieval.
In order to improve the user experience, the prior art also provides a search result preview function which may preview webpage information in the manner of a picture. In the field of increasingly mature search engine technology, the space for modifying is getting smaller and smaller, and difficulty in improvement and innovation to the search engine is increasing. Therefore, a little modification may mean a great improvement to the user experience. However, the snippet is different from the preview. The preview does not generate a relative segment for a final user's fast understanding on the basis of the inquiry, but simply outputs the content of the original webpage. Whereas the snippet is used for the user to quickly judge the correlation with the inquired word, the preview is used to further judge the correlation after the judgment through the snippet; the stages of using them are different. A display space of the snippet is very narrow and small, while the display space of the preview is very large. The snippet is displayed as default, but the preview is not and is displayed only after a mouse is moved to a particular position (including a title, a snippet, a network address and the like) to trigger the display, and there is also a delay in showing the display (depending on the displayed content and the network speed). Thus, the snippet and the preview are absolutely different technical solutions for those skilled in the art.
Accordingly, with respect to the table data source, the table format information thereof is also an extremely important part which facilitates the user to quickly understand the search result through the webpage snippet. The search technology needs to be further improved to at least present the table format formation in the page snippet to a certain extent.
BRIEF SUMMARY OF THE INVENTION
In order to overcome these deficiencies, the present invention provides a method for generating an extended page snippet in a search engine, comprising: retrieving and returning an associated table webpage having a table related to an inquired keyword; obtaining a parsed result of the table in said associated table webpage, and extracting column names and respective row instances based on said parsed result; determining relative row instances related to said inquired keyword; and generating a page snippet in a table style in accordance with said column names and said relative row instances.
According to another aspect, the present invention provides an apparatus for generating an extended page snippet in a search engine, comprising: means for retrieving and returning an associated table webpage having a table related to an inquired keyword; means for obtaining a parsed result of the table in said associated table webpage, and extracting column names and respective row instances based on said parsed result; means for determining the relative row instances related to said inquired keyword; means for generating a page snippet in a table style in accordance with said column names and said relative row instances.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
The inventive features regarded as the character of the present invention is stated in the appended claims. However, the present invention and the preferable usage modes, objects, features and advantages thereof can be better understood by reading detailed description of explanatory embodiments below with reference to the appended drawings, wherein:
FIG. 1 shows an exemplary computer system for implementing an embodiment of the present invention;
FIG. 2 shows a method flowchart for generating an extended snippet of a search result of the present application;
FIG. 3 shows a schematic diagram of an apparatus for generating an extended snippet of a search result of the present application;
FIG. 4 shows a schematic diagram of webpage 1 in an embodiment;
FIG. 5 shows a schematic diagram of webpage 2 in an embodiment; and
FIG. 6 shows a schematic diagram of webpage 3 in an embodiment.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Those skilled in the art know the present invention can be embodied as a system, a method or a computer program product. Accordingly, the present invention can be embodied in any one of the following forms, including: an absolute hardware, an absolute software (including a firmware, a resident software, a microcode, etc.), or a combination of a software part and a hardware part referred to as a “circuit,” a “module,” or a “system” in this document. In addition, the present invention may also take a form of computer program product embodied in any tangible medium of expression having computer usable non-transient program codes.
Any combination of one or more computer readable medium(s) can be used. The computer readable medium can be a computer readable signal medium or a computer readable storage medium. For example, the computer readable storage medium can include, but is not limited to, an electric, magnetic, optical, electromagnetic, infrared or semi-conductive system, apparatus, device or propagation medium, or any appropriate combination thereof. More specific examples (a non-exhaustive list) of the computer readable storage medium include the following: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any appropriate combination thereof. In the context of this document, the computer readable storage medium can be any tangible medium containing or storing a program for use by or in connection with an instruction executing system, apparatus or device.
The computer readable signal medium can include, for example, a data signal propagated in a base band or as part of a carrier wave, which carries the computer readable program codes. Such a propagated signal can adopt any appropriate form including, but not limited to, an electromagnetic signal, an optical signal or any appropriate combination thereof. The computer readable signal medium can be any computer readable medium other than a computer readable storage medium, which is capable of transmitting, propagating or transporting the program for use by or in connection with an instruction executing system, apparatus or device.
The non-transient program codes contained on the computer readable medium can be transmitted with any appropriate medium including, but not limited to, a wireless medium, a wire, an optical fiber cable, an RF or the like, or any appropriate combination thereof.
Computer non-transient program code for carrying out operations of the present invention can be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The non-transient program code can execute entirely on a user's computer, partly on a user's computer, as a stand-alone software package, partly on a user's computer and partly on a remote computer or entirely on a remote computer or server. In the latter scenario, the remote computer can be connected to the user's computer through any type of network, including a local area network (LAN), or a wide area network (WAN), or the connection can be made to an external computer (for example, through the Internet using an Internet Service Provider).
The present invention is described below with reference to the flowcharts and/or block diagrams of the method, apparatus (system) and computer program product according to the embodiments of the present invention. It is understood that each block of the flowcharts and/or block diagrams and combinations of blocks in the flowcharts and/or block diagrams, can be both implemented by computer program instructions. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, thereby producing a machine, such that the instructions, which are executed by the computer or the other programmable data processing apparatus, create means for implementing the functions and operations specified in the block or blocks in the flowcharts and/or block diagrams.
These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means implementing the functions and operations specified in the block or blocks in the flowcharts and/or block diagrams.
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operation steps to be performed on the computer or other programmable data processing apparatus to generate a computer implemented process such that the instructions which execute on the computer or other programmable data processing apparatus provide processes for implementing the functions and operations specified in the block or blocks in the flowcharts and/or block diagrams.
Now referring to FIG. 1, it shows a block diagram of an exemplary computer system 100 adapted to implement an embodiment of the present invention. As shown, the computer system 100 can include a CPU (Central Processing Unit) 101, a RAM (Random Access Memory) 102, a ROM (Read Only Memory) 103, a system bus 104, a hard disk controller 105, a keyboard controller 106, a serial interface controller 107, a parallel interface controller 108, a display controller 109, a hard disk 110, a keyboard 111, a serial peripheral device 112, a parallel peripheral device 113, and a display 114. In these devices, there are the CPU 101, the RAM 102, the ROM 103, the hard disk controller 105, the keyboard controller 106, the serial interface controller 107, the parallel interface controller 108 and the display controller 109 coupled with the system bus 104. The hard disk 110 is coupled with the hard disk controller 105, the keyboard 111 is coupled with the keyboard controller 106, the serial peripheral device 112 is coupled with the serial interface controller 107, the parallel peripheral device 113 is coupled with the parallel interface controller 108, and the display 114 is coupled with the display controller 109. It should be understood that the structure block diagram illustrated in FIG. 1 is shown for the purpose of an example only and not as a limitation to the scope of the present invention. In some cases, some devices can be added or removed depending on a specific situation.
Referring to FIG. 2, it shows a method flowchart for generating an extended snippet of a search result in one embodiment, including the steps as follows:
Step 201, retrieving and returning an associated table webpage having a table related to an inquired keyword.
In one embodiment, a webpage series related to the inquired keyword can be retrieved and returned, and the webpage series includes at least one associated table webpage having a table related to the inquired keyword. The inquired keyword can include one or more keywords, the number of which depends on the user's input. The webpage series related to the inquiry can be determined with a technology in the existing search engines. In the associated table webpage, the table related to the inquired keyword means matching part or all of the keywords in the inquired keywords in the table.
Generally, a table consists of three parts, i.e. rows, columns and cells, in which the cell in the first row are table header information, contents of the respective cells in the first row are column names of the respective columns, and data in cells of each row in the table are a row instance. The table usually adopts the formats of HTML, Excel, Word, PDF, and so on.
Step 202, obtaining a parsed result of the table in the associated table webpage, and extracting the column names and the respective row instances therefrom.
The existing search engines can be classified into two types according to the search result source. One type possesses its own webpage snatching, indexing and retrieving system (Indexer), has an independent “Spider” program, or a “Crawler” program, or a “Robot” program (the three titles having the same meaning), and can build a webpage database itself, and the search result is called directly from its own database. The second type rents a database of another search engine and sorts the search results in its self-defined format.
Accordingly, the parsed result of the table can also be obtained by a variety of ways. In an embodiment using the first type of search engine as a background, before the retrieving step 201, tables in all webpages are parsed when the spider program is used to snatch the webpages. The parsed result is stored in a self-built webpage database, and then the parsed result of the table is returned when the webpage series is returned in step 201. For an embodiment using the second type of search engines as the background, however, a real time manner can be employed to parse the tables in the associated table webpage, thereby obtaining the parsed result.
In the prior art, a variety of parsers are provided for parsing tables in diverse formats:
Therein, the Poor Obfuscation Implementation (POI) of the Apache is a function library with open source codes of the Apache software foundation. It provides an API for a Java program such that the Java program has the function of writing and reading Microsoft office format files. The Apache POI is also open source code software used in many search software and can be used to parse tables in various Office formats in the webpages. For example, for a table in a Word format, the table in the Word format can be read and parsed through the classes of Table, TableCell, TableRow, Tablelterator, and the like in the POI, specifically exemplified as follows:
   TableIterator it = new TableIterator(range);   // iterating all tables
in the document
        while (it.hasNext( )) {
          Table tb = (Table) it.next( ); // iterating rows ,
starting from 0 as default
          for (int i = 0; i < tb.numRows( ); i++) {
            TableRow tr = tb.getRow(i); // iterating
columns , starting from 0 as default
          for (int j = 0; j < tr.numCells( ); j++) {
            TableCell td = tr.getCell(j);// obtaining cells
            // obtaining contents of the cells
            for(int k=0;k<td.numParagraphs( );k++){
              Paragraph para =td.getParagraph(k);
              String s = para.text( );
              System.out.println(s);
            }
          }
        }
      }
For a table in an Excel format, the content of the Excel table can be parsed through elements of HSSFWorkbook, HSSFSheet, HSSFRow, HSSFCell and the like in the POI, specifically exemplified as follows:
  workbook = new HSSFWorkbook(is);  // if it is an Excel file, then
the HSSFWorkbook read is created
  numOfSheets = workbook.getNumberOfSheets( );   // setting a
Sheet number
  HSSFSheet sheet = workbook.getSheetAt(currSheet); // obtaining a
current sheet
  int currPosition = 0; // setting a current row position to zero
  int row = currPosition;
  HSSFRow rowline = sheet.getRow(row);
  int filledColumns = rowline.getLastCellNum( );// obtaining a column
number of the current row
  HSSFCell cell = null;
  for (int i = 0; i < filledColumns; i++) { // circularly traversing all
  the columns
      cell = rowline.getCell((short) i); // obtaining a current Cell
  }
There also exists a parser for an HTML webpage (HTML Parser) in the prior art (sourceforge, http://htmlparser.sourceforge.net), which is mainly used to modify or extract the HTML, provide an interface, and support a linear and nesting HTML text.
Extracting the column names and the instances in the parsed result of the table also includes a variety of embodiments: in one embodiment, column name information can be extracted according to a column name tag, and instance information can be extracted according to an instance tag. For example, after an HTML table is parsed, a relation of column names is extracted by a <TH> tag, and the instance information of the respective columns are extracted by a <TD> tag. In another embodiment, for example, for a table obtained by the POI, it is possible that there is no explicit tag bit representing the column name. In this case, a first non-null row in the table can be verified. Since a data format of the table header is generally different from the data format of the contents of the respective rows in the table, if the element format of the row is obviously distinguished from all the rest of the rows, then that row can be used as the column name row.
Step 203, determining a row instance related to the inquired keyword.
According to a position of the inquired keyword in the table, covered rows are determined, the column name is selected, and the instance rows are selected. For the snippet, the display space is limited, and only a limited number of rows can be displayed. Thus, selection of the relative instance rows is very important. In contrast, since the width requirement of the snippet is not strict, as long as the snippet does not exceed the width of the display screen, the covered column names can be all displayed basically.
As an option, weight information of the inquired keyword can also be taken into account, thereby assisting selection of relative instances and relative column names. The weight information can also be used to adjust the displayed content and order of the instances and the column names so that the most relative instance is displayed in front. The inquired word weight is one factor that needs to be considered when an adjustment to the snippet display order is made, and is usually the information provided by the search engine provider according to the statistics. As an example, different weights can be assigned according to a frequency that the inquired keyword is searched.
Step 204, generating the page snippet in a table style in accordance with the column names and the relative row instances.
In one embodiment, if a plurality of inquired keywords appear in step 201, then the step can include: statistically calculating the weights of the inquired keywords in the relative row instances to obtain the correlation of the row instances; and generating the page snippet in the table style in accordance with the column names and at least one relative row instance with the correlation arranged in the top. In the snippet, the selected row instances can be presented according to an original order in the table, or the relative row instances and the corresponding column names can be presented from highest to lowest correlation. Further, the form of the table in the page snippet in the table style can display either a border or no border, but it is at least necessary that the column names in the table correspond to the position of the instances with each other.
Now a flow for generating the snippet in the table style crossing pages is further explained in conjunction with FIG. 2, and a plurality of associated table webpages are returned in step 201 shown in FIG. 2. As an optional step, in one embodiment, pages having a similarity are aggregated by webpage clustering in accordance with the inquired keyword and the webpage series returned by the search engine. In this embodiment, the plurality of associated table webpages are all in the same cluster. The webpage clustering can adopt well known technical means which will not be stated in more detail herein. In one embodiment, the webpages from the same website domain name in the webpage series are clustered and the plurality of associated table webpages are included in the clustered result. The webpage aggregation is performed on webpages from the same website because tables having a high correlation usually occur in webpages under the same website domain name. Thus the correlation of the aggregation can be increased. For instance, in the information published in a company website, the information of one employee can be published with a plurality different tables for the same employee. Thus the webpages on which the snippet crossing pages can be performed are found more exactly by aggregating the webpages belonging to the website of the company.
Furthermore, the page snippet in the table style crossing pages can be generated in the following two embodiments. The page snippet in the table style crossing pages combines the column names and the instances associated with the inquired keyword in the plurality of associated table webpages.
In the first embodiment, after the plurality of associated table webpages are returned in step 201 shown in FIG. 2, the snippets in the table style are generated for each of the associated table webpages through step 202 to step 204. This embodiment includes: combining the snippet in the table style of the plurality of associated table webpages to obtain a combined snippet; determining the relative row instances and the column names in the combined snippet in accordance with the inquired keyword; and outputting the page snippet in the table style crossing pages in accordance with the relative row instances and the column names. Referring to Table 1, this embodiment is explained. The snippets in the table style of pages P1 and P3 shown in Table 1 match all inquired keywords KEY1, KEY2 and KEY3, and the snippet in the table style of page P2 matches part of the inquired keywords KEY1.
TABLE 1
P1
T1 T4 T2 T5 T3
KEY1 KEY2 KEY3
P2
T1 T6 T7 T8
KEY1
P3
T1 T2 T3 T5 T9
KEY1 KEY2 KEY3
After the snippets in the table style of the plurality of pages are combined, the combined snippet in the table style is generated. A blend and a concatenation of the column names and the instances occur in the combination of the snippets in the table style, that is, the parts with the same column name and cell data are blended, and the parts with the different column names and cell data are concatenated. As shown in Table 2, the combined snippet in the table style is illustrated.
TABLE 2
T1 T2 T3 T5 T4 T9 T6 T7 T8
KEY1 KEY2 KEY2
New relative instances and new relative column names are selected in the combined snippet in the table style according to the inquired keyword. After a plurality of snippets in the table style are blended, the size thereof may no longer be adapted to be displayed as the snippet, so it is necessary to further select the relative instances and the relative column names. Moreover, a final snippet in the table style is outputted according to the new relative instances and the new relative column names, and the inquiry result including the webpage series and the page snippet is generated.
In another embodiment, after the parsed results of the tables in the associated table webpages are obtained, the parsed results of the tables of the plurality of associated table webpages are combined to obtain a combined parsed result of the table. The row instances and the column names are extracted from the combined parsed result of the table. Thereafter, the page snippet in the table style crossing pages is generated through step 202 to step 204. The parsed results of the plurality of associated table webpages are combined as the new parsed result, then the instances and the column names related to the inquired keyword are further selected, so the instance is selected only once.
By implementing the method flow disclosed above in FIG. 2 in the computer system shown in FIG. 1, the present application is also embodied as an apparatus for generating a page snippet in a table style in a search engine. FIG. 3 shows an architecture schematic diagram of the apparatus, mainly including: a means 301 for retrieving and returning an associated table webpage having a table related to an inquired keyword; a means 302 for obtaining a parsed result of the table in the associated table webpage, and extracting column names and respective row instances on the basis of the parsed result; a means 303 for determining the row instances related to the inquired keyword; and a means 304 for generating a page snippet in a table style in accordance with the column names and the relative row instances.
In an embodiment, the means for retrieving and returning an associated table webpage having a table related to an inquired keyword returns a plurality of associated table webpages.
Further, in an embodiment, the means for obtaining a parsed result of the table in the associated table webpage and extracting column names and respective row instances on the basis of the parsed result includes: a means for combining the parsed results of the tables of the plurality of associated table webpages to obtain a combined parsed result of the table after the parsed results of the tables in the associated table webpages are obtained; and extracting the column names and the respective row instances on the basis of the combined parsed result of the table, wherein the means for generating the page snippet in the table style in accordance with the column names and the relative row instances generates the page snippet in the table style crossing pages.
In an embodiment, further included are: means for combining the page snippets in the table style of the plurality of associated table webpages; means for determining the row instances related to the inquired keyword in the combined page snippet in the table style; and means for generating the page snippet in the table style crossing pages in accordance with the column names and the relative row instances.
In another embodiment, the means for retrieving and returning an associated table webpage having a table related to an inquired keyword clusters the webpages from the same website domain name, and determines the plurality of associated table webpages in the clustering.
In an embodiment, the column names and the instances from different webpages are visually distinguished in the page snippet in the table style crossing pages.
In an embodiment, the inquired keywords are plural in the means for retrieving and returning, and the means for generating the page snippet in the table style in accordance with the column names and the relative row instances include: a means for statistically calculating weights of the inquired keywords in the relative row instances to obtain correlations of the row instances; and a means for generating the page snippet in the table style in accordance with said column names and at least one relative row instance with the correlation arranged in the top.
In an embodiment, the parsed result of the table is a result which is obtained and stored by parsing the tables in all webpages when a spider program snatches the webpages.
In an embodiment, the parsed result of the table is obtained by parsing the table in the associated table webpage in real time.
With the foresaid solutions, the inquiry is understood on the basis of parsing the table information in documents in various formats. Further, the page snippet in the table style reserving the table format information is generated. Therefore, the deficiency is improved that only the keyword in the search result is extracted and no table format information is reserved in the prior art.
Next, the technical solution of the present application is exemplarily explained in one complete embodiment for a webpage. It is assumed that the inquired keywords are a2, b2, and b9, and Page1 is a webpage in the returned webpage series. The page 1 (Page1) shown in FIG. 4 includes a table related to the inquired keywords.
After being parsed by the HTML Parser, the result is:
[−] <html>
  [+] <head>
  [−] <body>
     <h1> Page 1 </h1>
     <h2> This page talks about table 1 </h2>
     <p> bla bla bla . . . </p>
     <h2> The content of the table is shown as below </h2>
    [−] <table border=”1”>
      [−] <tbody>
       [−] <tr>
          <th> HeaderA </th>
          <th> HeaderB </th>
          <th> HeaderC </th>
          <th> HeaderD </th>
          <th> HeaderE </th>
          <th> HeaderF </th>
         </tr>
       [−] <tr>
          <td> a1 </td>
          <td> b1 </td>
          <td> c1 </td>
          <td> d1 </td>
          <td> e1 </td>
          <td> f1 </td>
         </tr>
       [−] <tr>
          <td> a2 </td>
          <td> b2 </td>
          <td> c2 </td>
          <td> d2 </td>
          <td> e2 </td>
          <td> f2 </td>
         </tr>
       [+] <tr>
       [+] <tr>
       [+] <tr>
       [+] <tr>
       [+] <tr>
       [+] <tr>
       [+] <tr>
The position of the table is located and acquired by the <Table> tag from the above parsed structure, and the information of the column names are extracted by the <TH> tag as follows:
[−] <tr>
   <th> HeaderA </th>
   <th> HeaderB </th>
   <th> HeaderC </th>
   <th> HeaderD </th>
   <th> HeaderE </th>
   <th>HeaderF </th>
  </tr>
The information of the respective row instances are extracted by the <TD> tag at the same time, for example:
[−] <tr>
    <td> a1 </td>
    <td> b1 </td>
    <td> c1 </td>
    <td> d1 </td>
    <td> e1 </td>
    <td> f1 </td>
  </tr>
[−] <tr>
    <td> a2 </td>
    <td> b2 </td>
    <td> c2 </td>
    <td> d2 </td>
    <td> e2 </td>
    <td> f2 </td>
  </tr>
[+] <tr>
[+] <tr>
[+] <tr>
[+] <tr>
[+] <tr>
[+] <tr>
It can be determined that the instances of the second row and the ninth row meet the match by matching the keywords a2, b2 and b9. If it is assumed that each keyword is equally important, it can follow that the correlation of the second row is ⅔, and the correlation of the ninth row is ⅓. Meanwhile, it can be determined that the second row covers columns a and b, and the ninth row covers column b. Accordingly, it can be determined that the second row instance and the ninth row instance are relative instances, and HeaderA and HeaderB are relative column names. Referring to Table 3, it is a schematic of the final generated snippet in the table style.
TABLE 3
HeaderA HeaderB HeaderC HeaderD HeaderE HeaderF
a2 b2 c2 d2 e2 f2
a9 b9 c9 d9 e9 f9
If the column space is constrained, in one embodiment, the part of unrelated table columns in the last can be omitted. See Table 4 for the exemplary snippet in the table style.
TABLE 4
HeaderA HeaderB HeaderC HeaderD
a2 b2 c2 d2
a9 b9 c9 d9
Next, an example of generating the snippet crossing pages is provided.
FIGS. 5 and 6 show webpage 2 (Page2) and webpage 3 (Page3) including the following table information in the same website, respectively. It is assumed that the inquired keywords are a2, b2, b9, h2 and j9, and it can be determined that the webpage 1 has the table matching with a part of keywords a2, b2 and b9, the webpage 2 has the table matching with a part of keywords a2 and h2, and the webpage 3 has the table matching with a part of keywords b2, b9 and j9. Any one of the pages can not satisfy the requirement of matching all keywords in the required keywords. Referring to Table 5, the snippet in the table style crossing pages obtained by the method provided by the present application is shown exemplarily.
TABLE 5
HeaderA HeaderB HeaderC HeaderD HeaderE HeaderF HeaderG HeaderH HeaderI HeaderJ
a2 b2 c2 d2 e2 f2 g2 h2 i2 j2
a9 b9 c9 d9 e9 f9 g9 h9 i9 j9
If the column space is also constrained, referring to Table 6, a schematic of omitting the part of unrelated table columns is shown.
TABLE 6
HeaderA HeaderB HeaderC HeaderH HeaderI HeaderJ
a2 b2 c2 h2 i2 j2
a9 b9 c9 h9 i9 j9
In one embodiment, the parts from different webpages can be visually differentiated in the generated snippet in the table style with different format information, and the user can click the corresponding part and jump to the source webpage to browse the information.
It should be pointed out that, the above description is an example only, but does not limit the present invention. The flowcharts and block diagrams in the drawings illustrate the architecture, functionality and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowcharts or block diagrams may represent a modular, program segment, or part of code, which includes one or more executable instructions for implementing the specified logic function(s). It should also be noted that, in some alternative implementations, the functions noted in the block can also occur in an order other than as noted in the drawings. For example, two blocks consecutively shown may, in fact, be performed substantially in parallel, or sometimes they can be performed in a reverse order, depending upon the functionality involved. It will also be noted that, each block of the block diagrams and/or flowcharts and combinations of blocks in the block diagrams and/or flowcharts, can be implemented by using a special purpose hardware-based system that executes the specified functions or operations, or by using a combination of a special purpose hardware and computer instructions.

Claims (20)

What is claimed is:
1. A method for generating an extended page snippet in a search engine, comprising:
retrieving and returning an associated table webpage having a table related to an inquired keyword;
obtaining a parsed result of the table in said associated table webpage, and extracting column names and respective row instances based on said parsed result;
determining relative row instances related to said inquired keyword;
generating a page snippet in a table style in accordance with said column names and said relative row instances;
wherein retrieving and returning an associated table webpage returns a plurality of associated table webpages;
wherein obtaining a parsed result returns a plurality of parsed; and
wherein obtaining a parsed result and extracting column names and respective row instances further comprises;
combing said plurality of parsed results to obtain a combined parsed result of the table;
extracting said column names and respective row instances based on said combined parsed result of the table; and
generating said page snippet further generates said page snippet in the table style crossing pages.
2. The method according to claim 1, further comprising:
generating a plurality of page snippets;
combining said page snippets in the table style of said plurality of associated table webpages;
determining the row instances related to said inquired keyword in said combined page snippet in the table style; and
generating said page snippet in the table style crossing pages in accordance with said column names and said relative row instances.
3. The method according to claim 2, wherein said column names and row instances from different webpages are visually distinguished in said page snippet in the table style crossing pages.
4. The method according to claim 1, wherein webpages from a same website domain name are clustered, and said plurality of associated table webpages are determined in said clustered result in said step of retrieving and returning.
5. The method according to claim 1, wherein said column names and row instances from different webpages are visually distinguished in said page snippet in the table style crossing pages.
6. The method according to claim 1, wherein said inquired keywords are plural in said step of retrieving and returning, and said step of generating said page snippet further comprises:
statistically calculating weights of said inquired keywords in said relative row instances to obtain a correlation of said row instances; and
generating said page snippet in the table style in accordance with said column names and at least one relative row instance with said correlation arranged in the top.
7. The method according to claim 1, wherein obtaining said parsed result of said table comprises:
parsing said table in a webpage series by snatching said webpage series with a spider program, wherein said webpage series includes said associated table webpage, and
storing said parsed result.
8. The method according to claim 1, wherein said parsed result of the table is obtained by parsing the table in said associated table webpage in real time.
9. An apparatus for generating an extended page snippet in a search engine, comprising:
a memory;
a processor communicatively coupled to the memory; and
an extended page snippet generation module communicatively coupled to the memory and the processor, wherein the extended page snippet generation module is configured to perform the steps of a method comprising:
retrieving and returning an associated table webpage having a table related to an inquired keyword;
obtaining a parsed result of the table in said associated table webpage, and extracting column names and respective row instances based on said parsed result;
determining the relative row instances related to said inquired keyword; and
generating a page snippet in a table style in accordance with said column names and said relative row instances;
wherein retrieving and returning an associated table webpage returns a plurality of associated table webpages;
wherein obtaining a parsed result returns a plurality results; and
wherein obtaining a parsed result and extracting column names and respective row instances further comprises;
combing said plurality of parsed results to obtain a combined parsed result of said table;
extracting said column names and respective row instances based on said combined parsed result of the table; and
generating said page snippet further generates said page snippet in the table style crossing pages.
10. The apparatus according to claim 9, wherein generating a page snippet generates a plurality of page snippets, the method further comprising:
combining said page snippets in the table style of said plurality of associated table webpages;
determining the row instances related to said inquired keyword in said combined page snippet in the table style; and
generating said page snippet in the table style crossing pages in accordance with said column names and said relative row instances.
11. The apparatus according to claim 10, wherein said column names and said row instances from different webpages are visually distinguished in said page snippet in the table style crossing pages.
12. The apparatus according to claim 9, wherein retrieving and returning an associated table webpage clusters webpages from a same website domain name, and determines said plurality of associated table webpages in said clustering.
13. The apparatus according to claim 9, wherein said column names and said row instances from different webpages are visually distinguished in said page snippet in the table style crossing pages.
14. The apparatus according to claim 9, wherein said inquired keywords are plural in said retrieving and returning, and generating said page snippet further comprises:
statistically calculating weights of said inquired keywords in said relative row instances to obtain a correlation of said row instances; and
generating said page snippet in the table style in accordance with said column names and at least one relative row instance with said correlation arranged in the top.
15. The apparatus according to claim 9, wherein obtaining said parsed result of said table comprises:
parsing said table in a webpage series by snatching said webpage series with a spider program, wherein said webpage series includes said associated table webpage, and
storing said parsed results.
16. The apparatus according to claim 9, wherein said parsed result of the table is obtained by parsing said table in said associated table webpage in real time.
17. A non-transitory computer readable storage medium tangibly embodying a computer readable program code having computer readable instructions which, when implemented, cause a computer device to carry out the steps of a method for generating an extended page snippet in a search engine, the method comprising:
retrieving and returning an associated table webpage having a table related to an inquired keyword;
obtaining a parsed result of the table in said associated table webpage, and extracting column names and respective row instances based on said parsed result;
determining relative row instances related to said inquired keyword;
generating a page snippet in a table style in accordance with said column names and said relative row instances;
wherein retrieving and returning an associated table webpage returns a plurality of associated table webpages;
wherein obtaining a parsed result returns a plurality of parsed result; and
wherein obtaining a parsed result and extracting column names and respective row instances further comprises;
combing said plurality of parsed results to obtain a combined parsed result of the table;
extracting said column names and respective row instance based on said combined parsed result of the table; and
generating said page snippet further generates said page snippet in the table style crossing pages.
18. The computer readable storage medium according to claim 17, the method further comprising;
generating a plurality of page snippets:
combining said page snippets in the table style of said plurality of associated table webpages;
determining the row instances related to said inquired keyword in said combined page snippet in the table style; and
generating said page snippet in the table style crossing pages in accordance with said column names and said relative row instances.
19. The computer readable storage medium according to claim 17, wherein webpages from a same website domain name are clustered, and said plurality of associated table webpages are determined in said clustered result in said step of retrieving and returning.
20. The computer readable storage medium according to claim 17, wherein said column names and row instances from different webpages are visually distinguished in said page snippet in the table style crossing pages.
US13/628,077 2011-09-30 2012-09-27 Method and apparatus for generating extended page snippet of search result Expired - Fee Related US8977606B2 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201110294672.4 2011-09-30
CN201110294672 2011-09-30
CN201110294672.4A CN103034633B (en) 2011-09-30 2011-09-30 Generate the method and device of the result of page searching summary of extension

Publications (2)

Publication Number Publication Date
US20130086035A1 US20130086035A1 (en) 2013-04-04
US8977606B2 true US8977606B2 (en) 2015-03-10

Family

ID=47993600

Family Applications (1)

Application Number Title Priority Date Filing Date
US13/628,077 Expired - Fee Related US8977606B2 (en) 2011-09-30 2012-09-27 Method and apparatus for generating extended page snippet of search result

Country Status (2)

Country Link
US (1) US8977606B2 (en)
CN (1) CN103034633B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636400A (en) * 2013-11-15 2015-05-20 腾讯科技(深圳)有限公司 Browser webpage generating method, browser and system
CN104182549A (en) * 2014-09-15 2014-12-03 中国联合网络通信集团有限公司 E-mail digest generation method and device
EP3220284A1 (en) * 2014-11-14 2017-09-20 Fujitsu Limited Data acquisition program, data acquisition method and data acquisition device
CN105808561A (en) * 2014-12-30 2016-07-27 北京奇虎科技有限公司 Method and device for extracting abstract from webpage
CN105808562A (en) * 2014-12-30 2016-07-27 北京奇虎科技有限公司 Method and device for extracting webpage abstract based on weight
US10140880B2 (en) * 2015-07-10 2018-11-27 Fujitsu Limited Ranking of segments of learning materials
CN105487746A (en) * 2015-08-28 2016-04-13 小米科技有限责任公司 Search result displaying method and device
CN105447191B (en) * 2015-12-21 2019-12-31 北京奇虎科技有限公司 Intelligent abstract method for providing image-text guiding step and corresponding device
CN105930471A (en) * 2016-04-25 2016-09-07 上海交通大学 Speech abstract generation method and apparatus
CN106095948A (en) * 2016-06-13 2016-11-09 网易(杭州)网络有限公司 The querying method of form, device and equipment
CN106126561A (en) * 2016-06-16 2016-11-16 北京百度网讯科技有限公司 The generation method and device of Search Results summary
CN109670028A (en) * 2018-12-27 2019-04-23 天津字节跳动科技有限公司 Table search method and device in online document
CN109783612B (en) * 2018-12-29 2020-12-29 上海智臻智能网络科技股份有限公司 Report data positioning method and device, storage medium and terminal
CN110334331A (en) * 2019-05-30 2019-10-15 重庆金融资产交易所有限责任公司 Method, apparatus and computer equipment based on order models screening table
CN110516048A (en) * 2019-09-02 2019-11-29 苏州朗动网络科技有限公司 The extracting method, equipment and storage medium of list data in pdf document

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090024574A1 (en) * 2000-02-08 2009-01-22 Sybase, Inc. System and Methodology for Extraction and Aggregation of Data from Dynamic Content
US7617176B2 (en) 2004-07-13 2009-11-10 Microsoft Corporation Query-based snippet clustering for search result grouping
US7725813B2 (en) * 2005-03-30 2010-05-25 Arizan Corporation Method for requesting and viewing a preview of a table attachment on a mobile communication device
US20100228744A1 (en) 2009-02-24 2010-09-09 Microsoft Corporation Intelligent enhancement of a search result snippet
US7836009B2 (en) 2004-08-19 2010-11-16 Claria Corporation Method and apparatus for responding to end-user request for information-ranking
US7900181B2 (en) 2005-05-24 2011-03-01 International Business Machines Corporation Systems, methods, and media for block-based assertion generation, qualification and analysis
US20110153577A1 (en) 2004-08-13 2011-06-23 Jeffrey Dean Query Processing System and Method for Use with Tokenspace Repository
US8533761B1 (en) * 2007-04-30 2013-09-10 Google Inc. Aggregating media information

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030226147A1 (en) * 2002-05-31 2003-12-04 Richmond Michael S. Associating an electronic program guide (EPG) data base entry and a related internet website
CN101576891A (en) * 2008-05-05 2009-11-11 北京瑞佳晨科技有限公司 Method for analyzing web page form object nodes
CN101615193A (en) * 2009-07-07 2009-12-30 北京大学 A kind of based on the integrated inquiry system of encyclopaedia data extract

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090024574A1 (en) * 2000-02-08 2009-01-22 Sybase, Inc. System and Methodology for Extraction and Aggregation of Data from Dynamic Content
US7617176B2 (en) 2004-07-13 2009-11-10 Microsoft Corporation Query-based snippet clustering for search result grouping
US20110153577A1 (en) 2004-08-13 2011-06-23 Jeffrey Dean Query Processing System and Method for Use with Tokenspace Repository
US7836009B2 (en) 2004-08-19 2010-11-16 Claria Corporation Method and apparatus for responding to end-user request for information-ranking
US7725813B2 (en) * 2005-03-30 2010-05-25 Arizan Corporation Method for requesting and viewing a preview of a table attachment on a mobile communication device
US7900181B2 (en) 2005-05-24 2011-03-01 International Business Machines Corporation Systems, methods, and media for block-based assertion generation, qualification and analysis
US8533761B1 (en) * 2007-04-30 2013-09-10 Google Inc. Aggregating media information
US20100228744A1 (en) 2009-02-24 2010-09-09 Microsoft Corporation Intelligent enhancement of a search result snippet

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"New snippets for list pages," Krishnan, Raj, Inside Search the Official Google Search Blog, article dated Aug. 26, 2011, retrieved from http://insidesearch.blogspot.com/2011/08/new-snippets-for-list-pages.html on May 9, 2014. *
Turpin, Andrew et al., Fast Generation of Result Snippets in Web Search, SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference, 2007, pp. 127-134.

Also Published As

Publication number Publication date
US20130086035A1 (en) 2013-04-04
CN103034633A (en) 2013-04-10
CN103034633B (en) 2016-08-03

Similar Documents

Publication Publication Date Title
US8977606B2 (en) Method and apparatus for generating extended page snippet of search result
US11907244B2 (en) Modifying field definitions to include post-processing instructions
US9639631B2 (en) Converting XML to JSON with configurable output
US9411790B2 (en) Systems, methods, and media for generating structured documents
US9495347B2 (en) Systems and methods for extracting table information from documents
US8972413B2 (en) System and method for matching comment data to text data
US10255363B2 (en) Refining search query results
US20150067476A1 (en) Title and body extraction from web page
US20040221233A1 (en) Systems and methods for report design and generation
US20140114994A1 (en) Apparatus and Method for Securing Preliminary Information About Database Fragments for Utilization in Mapreduce Processing
US20130013616A1 (en) Systems and Methods for Natural Language Searching of Structured Data
CN108090104B (en) Method and device for acquiring webpage information
CN105447099A (en) Log structured information extraction method and apparatus
TWI592807B (en) Method and device for web style address merge
EP2499581A2 (en) Method and system for grouping chunks extracted from a document, highlighting the location of a document chunk within a document, and ranking hyperlinks within a document
US10007646B1 (en) Method and system for presenting multiple levels of content for a document
KR20160042896A (en) Browsing images via mined hyperlinked text snippets
US8260772B2 (en) Apparatus and method for displaying documents relevant to the content of a website
US20150058716A1 (en) System and method for summarizing documents
KR20010094955A (en) Aggregation of content as a personalized document
Chen et al. EXACT: attributed entity extraction by annotating texts
Kogalovsky et al. Open citation content data
Bennett et al. assignFAST: An autosuggest based tool for FAST subject assignment
US9122748B2 (en) Matching documents against monitors
JP5707937B2 (en) Electronic document conversion apparatus and electronic document conversion method

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BAO, SHENG HUA;CHEN, JIAN;SU, ZHONG;AND OTHERS;REEL/FRAME:029034/0104

Effective date: 20120918

STCF Information on status: patent grant

Free format text: PATENTED CASE

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20190310