CN1402156A - Web site information extracting system and method - Google Patents

Web site information extracting system and method Download PDF

Info

Publication number
CN1402156A
CN1402156A CN 01123635 CN01123635A CN1402156A CN 1402156 A CN1402156 A CN 1402156A CN 01123635 CN01123635 CN 01123635 CN 01123635 A CN01123635 A CN 01123635A CN 1402156 A CN1402156 A CN 1402156A
Authority
CN
China
Prior art keywords
extraction
document
word
web
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 01123635
Other languages
Chinese (zh)
Inventor
黄子癸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weise Sci & Tech Co Ltd
Original Assignee
Weise Sci & Tech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weise Sci & Tech Co Ltd filed Critical Weise Sci & Tech Co Ltd
Priority to CN 01123635 priority Critical patent/CN1402156A/en
Publication of CN1402156A publication Critical patent/CN1402156A/en
Pending legal-status Critical Current

Links

Images

Abstract

A system and method for extracting the site information, that it to browse and filter the page data of WWW is composed of a finder for finding home page data and sending the resultant to page searching file, a data extractor for extracting the contents from said file to create an extracted file, and a memory device for storing the finding condition, filtering condition, and said page searching file. The finding data come from different sites can be displayed by said system.

Description

Web site information extracting system and method
(1) technical field
The present invention relates to a kind of text retrieval system and method, particularly a kind of webpage text retrieval system and method in World Wide Web (world wide web).
(2) background technology
Nowadays because the development of internet (Internet), the transmission of information with share also all the more fast with convenient.The user be as long as just can be connected on the World Wide Web (world wide web) that global website forms via the internet, and can use data or information on the World Wide Web.And at present, search device (searchengine) or webpage text retrieval system often are used for searching or retrieving its needed data by the user on the World Wide Web.
Please refer to Fig. 1, show the method flow synoptic diagram that traditional search device is searched the World Wide Web.At first, the user imports key word or the theme that desire is searched in search device, is then connected the World Wide Web and is begun retrieval by search device.Immediately, search device will meet web page address (URL) row of the key word imported or theme and give the user, be connected to those URL to browse its content by the user again.Though and above-mentioned traditional method is easy, have following shortcoming:
(1),, the palpus user just can see content but still being connected to the webpage of this URL again though search device has retrieved the URL that is relevant to key word.And, often comprise the unwanted data of user in the webpage, for the user, inconvenience very may need to utilize a text searching just can find needed data again.
(2) user can't compare its correlativity mutually at the web data of the URL that search device retrieved.For example, if the user searches is the price of a product, then the user can't to compare the product price of which website according to the result that search device retrieved among Fig. 1 the most cheap.
(3) summary of the invention
Therefore, the object of the present invention is to provide a kind of Web site information extracting system and method.The user can retrieve the needed data of user, and show all search data by native system by System and method for of the present invention from the World Wide Web, be beneficial to the result for retrieval that the user browses different web pages.
According to purpose of the present invention, a kind of Web site information extracting system is proposed, this system is connected with World Wide Web (world wide web) by internet (lnternet), in order to browse and to filter the web data of World Wide Web.This Web site information extracting system comprises a search device, a data extraction element and a memory storage at least.Wherein, search device is connected with the World Wide Web by the internet, and a Web search condition that sets in order to the foundation user is searched the web data in the World Wide Web, and search result is outputed in the search page document.And the data extract device is used to receive search page document, and extracts the content of search page document and form an extraction document according to the home page filter condition that the user sets.Memory storage is used to store Web search condition, home page filter condition, search page document and extraction document.
Wherein, the data extract device also comprises a hurdle extraction unit, a tag delete unit and a paragraph extraction unit.Wherein, the hurdle extraction unit is used for extracting the column number certificate that search page document sets.And the tag delete unit is used for deleting all web displaying control marks (tag) of search page document.The paragraph extraction unit is used for deletion or keeps the whole paragraph of search page document, and can be used for deleting the literal to be deleted in the search page document.
According to purpose of the present invention, a kind of site information extracting method is proposed in addition, use the web data of browsing and filter the World Wide Web for the user, this site information extracting method is at first set a Web search condition and a home page filter condition for the user.Then search web data in the World Wide Web, and export search result to a search page document according to the Web search condition.Next, extract the content of search page document and form an extraction document according to the home page filter condition.
Wherein, the step of extracting the content of search page document and forming extraction document according to the home page filter condition of this site information extracting method also comprises deletion or keeps in the search page document in the data of extracting between paragraph banner word and this extraction paragraph end word; Data in the extraction search page document between extraction hurdle banner word and extraction hurdle end word and all the web displaying control marks in the deletion search page document.
For above-mentioned purpose of the present invention, feature and advantage can be become apparent, a most preferred embodiment cited below particularly, and conjunction with figs. are described in detail below.
(4) description of drawings
Fig. 1 shows the method flow synoptic diagram that traditional search device is searched the World Wide Web.
Fig. 2 shows the system construction drawing according to a kind of Web site information extracting system of a most preferred embodiment of the present invention.
Fig. 3 shows the system block diagram of the Web site information extracting system 201 among Fig. 2.
Fig. 4 shows the system block diagram of the data extract device 303 among Fig. 3.
Fig. 5 shows the method flow synoptic diagram of the extraction site information of the Web site information extracting system 201 among Fig. 2.
Fig. 6 shows the process flow diagram of the site information extracting method of the Web site information extracting system 201 among Fig. 2.
Fig. 7 shows data extract setup unit 401 and sets the setting interface synoptic diagram that paragraph extracts.
Fig. 8 shows data extract setup unit 401 and sets the setting interface synoptic diagram that the hurdle extracts.
Fig. 9 shows the setting interface synoptic diagram that data extract setup unit 401 is set tag delete.
Figure 10 shows the substep process flow diagram of the step 605 among Fig. 6.
(5) embodiment
Please refer to Fig. 2, it shows the system construction drawing according to a kind of Web site information extracting system of a most preferred embodiment of the present invention.In Fig. 2, Web site information extracting system 201 is connected with World Wide Web (world wide web) 205 by internet (lnternet) 203.Wherein, World Wide Web 205 comprises a plurality of websites (web site) 207.And Web site information extracting system 201 can provide the user in order to browsing the webpage of each website 207 of searching global information 205, and can filter out unnecessary data and extract needed web data of user and column number certificate.
Then please refer to Fig. 3, it shows the system block diagram of the Web site information extracting system 201 among Fig. 2.As shown in Figure 3, Web site information extracting system 201 comprises search device 301, data extract device 303, memory storage 305, search device setting device 307 and monitor (monitor) 309.Wherein, the Web search condition that search device setting device 307 provides the user to set, and this Web search condition is used for the webpage of search device 301 which website of judgement and need be searched, which webpage does not need is retrieved.And search device 301 is connected with World Wide Web 205 via internet 203, in order to meet the web data of Web search condition in each website 207 of searching and extracting World Wide Web 205.Search device 301 outputs to a search page document with above-mentioned search result, and search page document is stored in the memory storage 305.
At this moment, this search page document is the webpage raw data, and it comprises web displaying control mark (tag) and the unwanted data of user.And the data extract device is used for a home page filter condition setting according to the user, extracts needed data content of user or hurdle from search page document, and is stored as an extraction document.In addition, monitor 309 is in order to show the content of extraction document.And memory storage 305 is in order to store above-mentioned Web search condition, home page filter condition, search page document and extraction document.
Then please refer to Fig. 4, it shows the system block diagram of the data extract device 303 among Fig. 3.As shown in Figure 4, data extract device 303 comprises data extract setup unit 401, hurdle extraction unit 403, tag delete unit 405 and paragraph extraction unit 407.Wherein, data extract setup unit 401 is used for setting above-mentioned home page filter condition for the user.And the home page filter condition can comprise that also extraction hurdle banner word of setting, an extraction hurdle finish word, an extraction paragraph banner word, an extraction paragraph end word and a literal to be deleted.
Hurdle extraction unit 403 is used for extracting search page document and blocks the data between a banner word and the extraction hurdle end word or block the position in extraction.Tag delete unit 405 is used for deleting all web displaying control marks of search page document.And paragraph extraction unit 407 can supply the user to set deletion or keep the data between extraction paragraph banner word and extraction paragraph end word in the search page document, also can be used for deleting the literal to be deleted that the user sets in the search page document.
In addition, data extract setup unit 401 also can be set the execution sequence of hurdle extraction unit 403, tag delete unit 405 and paragraph extraction unit 407 for user's elasticity, so that can extract the needed data of user smoothly.
Please refer to Fig. 5, it shows the method flow synoptic diagram of the extraction site information of Web site information extracting system 201 among Fig. 2.For example the user wants to retrieve the price of this product of PDA in the related web site.At first search device 301 retrieves the search page document that content is the original web page data according to the Web search condition in each website 207.Then from search page document, extract the needed data of user, and be stored as extraction document by data extract device 303.As shown in Figure 5, the user can directly see the commodity and the price of each related web site from extraction document, and the network address that needn't be connected to each website just can be seen content.
Then please refer to Fig. 6, it shows the process flow diagram of the site information extracting method of Web site information extracting system 201 among Fig. 2.In step 601, the user sets Web search condition and home page filter condition respectively in search device setting device 307 and data extract setup unit 401.And the setting of Web search condition comprises at least:
(1) searching network address sets: the user sets a network address at least and connects search for search device 301.
(2) full-text search condition enactment: the user sets a search key at least, judges whether to extract the data of the web page contents of this network address for search device 301.
(3) the network address search condition is set: the user can select to set a special word, judges a network address if comprise this special word for search device 301, i.e. this web page contents is extracted in decision.
(4) searching url-path sets: the user can select to set a path key word, judges whether comprise this path key word in the network address for search device 301, whether continues to search the sub-directory of this network address with decision.
(5) account number cipher is set: the user can select to set an account number and password, and when a network address needed account number and password just can inspect, search device 301 will be logined with predefined account number of user and password.
(6) search the degree of depth: the user can select to set the degree of depth when searching the website.
In addition, the user utilizes data extract setup unit 401 to set and whether carries out hurdle extraction unit 403, tag delete unit 405 and paragraph extraction unit 407 and execution sequence thereof.In this embodiment, be that execution sequence is that example describes with the order of paragraph extraction unit 407, hurdle extraction unit 403, tag delete unit 405, but the present invention is not as limit.Please refer to Fig. 7 simultaneously, it shows data extract setup unit 401 and sets the setting interface synoptic diagram that paragraph extracts.Drop down menu 701 among Fig. 8 can be extracted for the selected paragraph of user, the hurdle extracts or the tag delete option, can set the execution sequence of paragraph extraction unit 407, hurdle extraction unit 403, tag delete unit 405 whereby.As shown in Figure 7, the user utilizes drop down menu 701 setting data extraction elements 303 will at first carry out paragraph extraction unit 407, and the operation that the user can select to set paragraph extraction unit 407 is that paragraph extracts or word string is extracted:
Whether (1) paragraph extracts: the user sets and extracts the paragraph banner word and extract paragraph and finish word, and set to delete or keep at first option 703 and extracting paragraph banner word and the literal that extracts between the paragraph end word.The user can utilize second option 705 to set whether selected paragraph comprises extraction paragraph banner word and the extraction paragraph finishes word in addition.
(2) word string is extracted: the user imports literal to be deleted.
Then please refer to Fig. 8, it shows data extract setup unit 401 and sets the setting interface synoptic diagram that the hurdle extracts.In Fig. 8, the user chooses the hurdle extraction and carries out hurdle extraction unit 403 in regular turn with setting data extraction element 303.And extraction hurdle banner word more than the user can import at least one group and extraction hurdle finish word, so that hurdle extraction unit 403 extracts in the column number certificate of extracting between hurdle banner word and the extraction hurdle end word.
Please refer to Fig. 9, it shows the setting interface synoptic diagram that data extract setup unit 401 is set tag delete.In Fig. 9, the user chooses tag delete will carry out paragraph extraction unit 407 with setting data extraction element 303 the 3rd step.Wherein, the user can select whether to delete blank line.
Then in step 603 shown in Figure 6, search device 301 is searched the web data of each website 207 in the World Wide Web 205 according to the setting in the Web search condition, and extraction meets the web data of Web search condition and outputs to search page document.Then carry out step 605.
In step 605, data extract device 303 is according to the home page filter condition of setting, and extracts content and form extraction document from search page document.And the detailed substep of this step please refer to Figure 10.Figure 10 shows the substep process flow diagram of the step 605 among Fig. 6.In step 1001, in the data of extracting between paragraph banner word and the extraction paragraph end word, perhaps delete the literal to be deleted that the user sets in 407 deletions of paragraph extraction unit or the reservation search page document.
Then in step 1003, hurdle extraction unit 403 extracts in the web page files in the data of extracting between hurdle banner word and the extraction hurdle end word.Then carry out step 1005, all the web displaying control marks in the tag delete unit 405 deletion search page documents.In step 607, monitor 309 shows the content of extraction document to the user.So promptly finished site information extracting method of the present invention.
Among the foregoing description, with paragraph extract, the hurdle extracts, the order of tag delete is that data extract device 303 extracts content formation extraction document from search page document sequence of operation is that example describes, but the present invention is not as limit.The user can set up on their own so that can reach and extract suitable data.
Web site information extracting system that the above embodiment of the present invention is disclosed and method, remove by above-mentioned setting step, having substituted manpower handles outside the extensive work load of data search extraction and arrangement, target data for the locking extraction, can also be by the setting of extraction system flow process, reach the effect that upgrades in time, also more efficient than general search device for the grasp of data promptness; In addition, the present invention also has following advantage:
(1) Web site information extracting system of the present invention is desired the data retrieved extraction with the user and is shown, and filters out unwanted data, and the time that the user searches has again been saved in very convenient user's reading.
(2) Web site information extracting system of the present invention all shows side by side with meeting the needed data of user in each website 207 in the World Wide Web 205, is convenient to the user relatively data dependence and the otherness of different web pages.
In sum; though the present invention discloses as above with a most preferred embodiment; but it is not in order to limit the present invention; any those of ordinary skill in affiliated field; under the premise without departing from the spirit and scope of the present invention; should make various modifications, so protection scope of the present invention should be as the criterion with accompanying claims institute restricted portion.

Claims (15)

1. a Web site information extracting system is connected with World Wide Web (world wide web) by internet (Internet), and in order to browse and to filter the web data of this World Wide Web, described Web site information extracting system comprises at least:
A search device is connected with the World Wide Web via the internet, and a Web search condition that sets in order to the foundation user is searched the web data in this World Wide Web, and search result is outputed in the search page document;
A data extraction element is used to receive described search page document, and extracts the content of described search page document and form an extraction document according to the home page filter condition that the user sets; And
A memory storage is used to store described Web search condition, described home page filter condition, described search page document and described extraction document.
2. the system as claimed in claim 1, wherein said system also comprises a monitor (monitor), described monitor is used to show the content of described extraction document.
3. the system as claimed in claim 1, wherein said home page filter condition comprises that also an extraction hurdle banner word, an extraction hurdle finish word, an extraction paragraph banner word, an extraction paragraph end word and a literal to be deleted, and described data extract device also comprises:
A hurdle extraction unit is used for extracting the data of described search page document between described extraction hurdle banner word and described extraction hurdle end word;
A tag delete unit is used for deleting all web displaying control marks (tag) of described search page document; And
A paragraph extraction unit is used for deleting or keeps described search page document and finishes data between the word at described extraction paragraph banner word and described extraction paragraph, also can be used for deleting the literal described to be deleted in the described search page document.
4. system as claimed in claim 3, wherein said data extract device also comprises a data extraction setup unit, described data extract setup unit is used for setting described home page filter condition for described user.
5. the system as claimed in claim 1, wherein said system also comprises a search device setting device, described search device setting device is used for setting described Web search condition for described user.
6. a site information extracting method is used for browsing and filter the web data of World Wide Web for a user, and described site information extracting method comprises:
Described user sets a Web search condition and a home page filter condition;
According to described Web search condition, search the web data in the described World Wide Web, and search result is outputed in the search page document; And
According to described home page filter condition, extract the content of described search page document and form an extraction document.
7. method as claimed in claim 6, wherein said method also comprises:
The content that shows described extraction document.
8. method as claimed in claim 6, wherein said home page filter condition comprises that also an extraction hurdle banner word, extraction hurdle finish word, one and extract paragraph banner word and one and extract paragraph and finish word, and the step of extracting the content of described search page document and forming an extraction document according to described home page filter condition also comprises:
Deletion or keep in the described search page document data between described extraction paragraph banner word and described extraction paragraph end word;
Extract the data between described extraction hurdle banner word and described extraction hurdle end word in the described search page document; And
Delete all the web displaying control marks in the described search page document.
9. method as claimed in claim 6, wherein said home page filter condition comprises that also an extraction hurdle banner word, extraction hurdle finish word and a literal to be deleted, and the step of extracting the content of described search page document and forming an extraction document according to described home page filter condition also comprises:
Delete the literal described to be deleted in the described search page document;
Extract the data between described extraction hurdle banner word and described extraction hurdle end word in the described search page document; And
Delete all the web displaying control marks in the described search page document.
10. method as claimed in claim 6, wherein said home page filter condition comprises that also an extraction hurdle banner word and extraction hurdle finish word, and the step of extracting the content of described search page document and forming an extraction document according to described home page filter condition also comprises:
Extract the data between described extraction hurdle banner word and described extraction hurdle end word in the described search page document; And
Delete all the web displaying control marks in the described search page document.
11. a computer-readable recording medium comprises a program that is used to carry out the site information extracting method, wherein said method is used for browsing and filter the web data of World Wide Web for the user, and described site information extracting method comprises:
Described user sets a Web search condition and a home page filter condition;
According to described Web search condition, search the web data in the described World Wide Web, and search result is outputed in the search page document; And
According to described home page filter condition, extract the content of described search page document and form an extraction document.
12. computer-readable recording medium as claimed in claim 11, wherein said method also comprises:
The content that shows described extraction document.
13. computer-readable recording medium as claimed in claim 11, wherein said home page filter condition comprises that also an extraction hurdle banner word, extraction hurdle finish word, one and extract paragraph banner word and one and extract paragraph and finish word, and also comprises according to the step that the content that described home page filter condition is extracted described search page document forms an extraction document:
Deletion or keep in the described search page document data between described extraction paragraph banner word and described extraction paragraph end word;
Extract the data between described extraction hurdle banner word and described extraction hurdle end word in the described search page document; And
Delete all the web displaying control marks in the described search page document.
14. computer-readable recording medium as claimed in claim 11, wherein said home page filter condition comprises that also an extraction hurdle banner word, extraction hurdle finish word and a literal to be deleted, and also comprises according to the step that the content that described home page filter condition is extracted described search page document forms an extraction document:
Delete the literal described to be deleted in the described search page document;
Extract the data between described extraction hurdle banner word and described extraction hurdle end word in the described search page document; And
Delete all the web displaying control marks in the described search page document.
15. computer-readable recording medium as claimed in claim 11, wherein said home page filter condition comprises that also an extraction hurdle banner word and extraction hurdle finish word, and also comprises according to the step that the content that described home page filter condition is extracted described search page document forms an extraction document:
Extract the data between described extraction hurdle banner word and described extraction hurdle end word in the described search page document; And
Delete all the web displaying control marks in the described search page document.
CN 01123635 2001-08-22 2001-08-22 Web site information extracting system and method Pending CN1402156A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 01123635 CN1402156A (en) 2001-08-22 2001-08-22 Web site information extracting system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 01123635 CN1402156A (en) 2001-08-22 2001-08-22 Web site information extracting system and method

Publications (1)

Publication Number Publication Date
CN1402156A true CN1402156A (en) 2003-03-12

Family

ID=4665196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 01123635 Pending CN1402156A (en) 2001-08-22 2001-08-22 Web site information extracting system and method

Country Status (1)

Country Link
CN (1) CN1402156A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100432996C (en) * 2004-12-07 2008-11-12 国际商业机器公司 System, method and program for extracting web page core content based on web page layout
CN100444174C (en) * 2006-09-25 2008-12-17 北京中搜在线软件有限公司 Method for picking-up, and aggregating micro content of web page, and automatic updating system
CN100458797C (en) * 2007-06-20 2009-02-04 精实万维软件(北京)有限公司 Process for ordering network advertisement
CN100543741C (en) * 2006-02-10 2009-09-23 鸿富锦精密工业(深圳)有限公司 The system and method for automatic download and filtering web page
CN101997915A (en) * 2010-10-29 2011-03-30 中国电信股份有限公司 Deep packet detection device, webpage data processing method, and webpage data acquisition method and system
CN101409634B (en) * 2007-10-10 2011-04-13 中国科学院自动化研究所 Quantitative analysis tools and method for internet news influence based on information retrieval
CN101310277B (en) * 2005-11-15 2011-10-05 皇家飞利浦电子股份有限公司 Method of obtaining a representation of a text and system
CN101470731B (en) * 2007-12-26 2012-06-20 中国科学院自动化研究所 Personalized web page filtering method
CN101751438B (en) * 2008-12-17 2012-08-22 中国科学院自动化研究所 Theme webpage filter system for driving self-adaption semantics
CN101127038B (en) * 2006-08-18 2012-09-19 鸿富锦精密工业(深圳)有限公司 System and method for downloading website static web page
CN102857885A (en) * 2012-08-17 2013-01-02 东莞宇龙通信科技有限公司 Method and communication terminal for sharing information
CN104065504A (en) * 2013-03-22 2014-09-24 腾讯科技(深圳)有限公司 Information processing method and device
CN107169076A (en) * 2017-05-10 2017-09-15 北京京东尚科信息技术有限公司 Method, system and the computer-readable recording medium cleaned for 2-D data

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100432996C (en) * 2004-12-07 2008-11-12 国际商业机器公司 System, method and program for extracting web page core content based on web page layout
CN101310277B (en) * 2005-11-15 2011-10-05 皇家飞利浦电子股份有限公司 Method of obtaining a representation of a text and system
CN100543741C (en) * 2006-02-10 2009-09-23 鸿富锦精密工业(深圳)有限公司 The system and method for automatic download and filtering web page
CN101127038B (en) * 2006-08-18 2012-09-19 鸿富锦精密工业(深圳)有限公司 System and method for downloading website static web page
CN100444174C (en) * 2006-09-25 2008-12-17 北京中搜在线软件有限公司 Method for picking-up, and aggregating micro content of web page, and automatic updating system
CN100458797C (en) * 2007-06-20 2009-02-04 精实万维软件(北京)有限公司 Process for ordering network advertisement
CN101409634B (en) * 2007-10-10 2011-04-13 中国科学院自动化研究所 Quantitative analysis tools and method for internet news influence based on information retrieval
CN101470731B (en) * 2007-12-26 2012-06-20 中国科学院自动化研究所 Personalized web page filtering method
CN101751438B (en) * 2008-12-17 2012-08-22 中国科学院自动化研究所 Theme webpage filter system for driving self-adaption semantics
CN101997915A (en) * 2010-10-29 2011-03-30 中国电信股份有限公司 Deep packet detection device, webpage data processing method, and webpage data acquisition method and system
CN101997915B (en) * 2010-10-29 2014-01-08 中国电信股份有限公司 Deep packet detection device, webpage data processing method, and webpage data acquisition method and system
CN102857885A (en) * 2012-08-17 2013-01-02 东莞宇龙通信科技有限公司 Method and communication terminal for sharing information
CN104065504A (en) * 2013-03-22 2014-09-24 腾讯科技(深圳)有限公司 Information processing method and device
CN104065504B (en) * 2013-03-22 2019-04-12 腾讯科技(深圳)有限公司 The processing method and processing device of information
CN107169076A (en) * 2017-05-10 2017-09-15 北京京东尚科信息技术有限公司 Method, system and the computer-readable recording medium cleaned for 2-D data
CN107169076B (en) * 2017-05-10 2020-06-05 北京京东尚科信息技术有限公司 Method, system and computer readable storage medium for two-dimensional data cleansing

Similar Documents

Publication Publication Date Title
US8082266B2 (en) Index for data retrieval and data structuring
US8359295B2 (en) User interface for navigating a keyword space
US7660808B2 (en) Automatically indexing a collection of files of a selected type
CN1317661C (en) System and method for facilitating internet search by providing web document layout image
US9367637B2 (en) System and method for searching a bookmark and tag database for relevant bookmarks
CN1402156A (en) Web site information extracting system and method
CN1522418A (en) Predictive caching and highlighting of web pages
JP2009500719A (en) Query search by image (query-by-imagesearch) and search system
WO2008098502A1 (en) Method and device for creating index as well as method and system for retrieving
US6694302B2 (en) System, method and article of manufacture for personal catalog and knowledge management
WO2011145922A1 (en) Method and system for compiling a unique sample code for specific web content
CN101310277B (en) Method of obtaining a representation of a text and system
JP2009026249A (en) Browsing-history-editing terminal, program, and its method
Klein et al. Evaluating methods to rediscover missing web pages from the web infrastructure
KR100671077B1 (en) Server, Method and System for Providing Information Search Service by Using Sheaf of Pages
CN101599069A (en) The searching method of electronic document and system
CN103853777A (en) Method and device for accessing websites through keywords
JP2008191982A (en) Retrieval result output device
US20080208831A1 (en) Controlling search indexing
US20090313558A1 (en) Semantic Image Collection Visualization
CN105243073A (en) Bookmark access method and device and terminal
CN103853730B (en) Control the method and system of network linking shortcut classification
TWI238333B (en) Website information capturing system and method
JP4510041B2 (en) Document search system and program
CN1838123A (en) Information search method and system based on fixed keyword

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication