CN1402156A

CN1402156A - Web site information extracting system and method

Info

Publication number: CN1402156A
Application number: CN 01123635
Authority: CN
Inventors: 黄子癸
Original assignee: Weise Sci & Tech Co Ltd
Current assignee: Weise Sci & Tech Co Ltd
Priority date: 2001-08-22
Filing date: 2001-08-22
Publication date: 2003-03-12

Abstract

A system and method for extracting the site information, that it to browse and filter the page data of WWW is composed of a finder for finding home page data and sending the resultant to page searching file, a data extractor for extracting the contents from said file to create an extracted file, and a memory device for storing the finding condition, filtering condition, and said page searching file. The finding data come from different sites can be displayed by said system.

Description

Web site information extracting system and method

(1) technical field

The present invention relates to a kind of text retrieval system and method, particularly a kind of webpage text retrieval system and method in World Wide Web (world wide web).

(2) background technology

Nowadays because the development of internet (Internet), the transmission of information with share also all the more fast with convenient.The user be as long as just can be connected on the World Wide Web (world wide web) that global website forms via the internet, and can use data or information on the World Wide Web.And at present, search device (searchengine) or webpage text retrieval system often are used for searching or retrieving its needed data by the user on the World Wide Web.

Please refer to Fig. 1, show the method flow synoptic diagram that traditional search device is searched the World Wide Web.At first, the user imports key word or the theme that desire is searched in search device, is then connected the World Wide Web and is begun retrieval by search device.Immediately, search device will meet web page address (URL) row of the key word imported or theme and give the user, be connected to those URL to browse its content by the user again.Though and above-mentioned traditional method is easy, have following shortcoming:

(1),, the palpus user just can see content but still being connected to the webpage of this URL again though search device has retrieved the URL that is relevant to key word.And, often comprise the unwanted data of user in the webpage, for the user, inconvenience very may need to utilize a text searching just can find needed data again.

(2) user can't compare its correlativity mutually at the web data of the URL that search device retrieved.For example, if the user searches is the price of a product, then the user can't to compare the product price of which website according to the result that search device retrieved among Fig. 1 the most cheap.

(3) summary of the invention

Therefore, the object of the present invention is to provide a kind of Web site information extracting system and method.The user can retrieve the needed data of user, and show all search data by native system by System and method for of the present invention from the World Wide Web, be beneficial to the result for retrieval that the user browses different web pages.

According to purpose of the present invention, a kind of Web site information extracting system is proposed, this system is connected with World Wide Web (world wide web) by internet (lnternet), in order to browse and to filter the web data of World Wide Web.This Web site information extracting system comprises a search device, a data extraction element and a memory storage at least.Wherein, search device is connected with the World Wide Web by the internet, and a Web search condition that sets in order to the foundation user is searched the web data in the World Wide Web, and search result is outputed in the search page document.And the data extract device is used to receive search page document, and extracts the content of search page document and form an extraction document according to the home page filter condition that the user sets.Memory storage is used to store Web search condition, home page filter condition, search page document and extraction document.

Wherein, the data extract device also comprises a hurdle extraction unit, a tag delete unit and a paragraph extraction unit.Wherein, the hurdle extraction unit is used for extracting the column number certificate that search page document sets.And the tag delete unit is used for deleting all web displaying control marks (tag) of search page document.The paragraph extraction unit is used for deletion or keeps the whole paragraph of search page document, and can be used for deleting the literal to be deleted in the search page document.

According to purpose of the present invention, a kind of site information extracting method is proposed in addition, use the web data of browsing and filter the World Wide Web for the user, this site information extracting method is at first set a Web search condition and a home page filter condition for the user.Then search web data in the World Wide Web, and export search result to a search page document according to the Web search condition.Next, extract the content of search page document and form an extraction document according to the home page filter condition.

Wherein, the step of extracting the content of search page document and forming extraction document according to the home page filter condition of this site information extracting method also comprises deletion or keeps in the search page document in the data of extracting between paragraph banner word and this extraction paragraph end word; Data in the extraction search page document between extraction hurdle banner word and extraction hurdle end word and all the web displaying control marks in the deletion search page document.

For above-mentioned purpose of the present invention, feature and advantage can be become apparent, a most preferred embodiment cited below particularly, and conjunction with figs. are described in detail below.

(4) description of drawings

Fig. 1 shows the method flow synoptic diagram that traditional search device is searched the World Wide Web.

Fig. 2 shows the system construction drawing according to a kind of Web site information extracting system of a most preferred embodiment of the present invention.

Fig. 3 shows the system block diagram of the Web site information extracting system 201 among Fig. 2.

Fig. 4 shows the system block diagram of the data extract device 303 among Fig. 3.

Fig. 5 shows the method flow synoptic diagram of the extraction site information of the Web site information extracting system 201 among Fig. 2.

Fig. 6 shows the process flow diagram of the site information extracting method of the Web site information extracting system 201 among Fig. 2.

Fig. 7 shows data extract setup unit 401 and sets the setting interface synoptic diagram that paragraph extracts.

Fig. 8 shows data extract setup unit 401 and sets the setting interface synoptic diagram that the hurdle extracts.

Fig. 9 shows the setting interface synoptic diagram that data extract setup unit 401 is set tag delete.

Figure 10 shows the substep process flow diagram of the step 605 among Fig. 6.

(5) embodiment

Please refer to Fig. 2, it shows the system construction drawing according to a kind of Web site information extracting system of a most preferred embodiment of the present invention.In Fig. 2, Web site information extracting system 201 is connected with World Wide Web (world wide web) 205 by internet (lnternet) 203.Wherein, World Wide Web 205 comprises a plurality of websites (web site) 207.And Web site information extracting system 201 can provide the user in order to browsing the webpage of each website 207 of searching global information 205, and can filter out unnecessary data and extract needed web data of user and column number certificate.

Then please refer to Fig. 3, it shows the system block diagram of the Web site information extracting system 201 among Fig. 2.As shown in Figure 3, Web site information extracting system 201 comprises search device 301, data extract device 303, memory storage 305, search device setting device 307 and monitor (monitor) 309.Wherein, the Web search condition that search device setting device 307 provides the user to set, and this Web search condition is used for the webpage of search device 301 which website of judgement and need be searched, which webpage does not need is retrieved.And search device 301 is connected with World Wide Web 205 via internet 203, in order to meet the web data of Web search condition in each website 207 of searching and extracting World Wide Web 205.Search device 301 outputs to a search page document with above-mentioned search result, and search page document is stored in the memory storage 305.

At this moment, this search page document is the webpage raw data, and it comprises web displaying control mark (tag) and the unwanted data of user.And the data extract device is used for a home page filter condition setting according to the user, extracts needed data content of user or hurdle from search page document, and is stored as an extraction document.In addition, monitor 309 is in order to show the content of extraction document.And memory storage 305 is in order to store above-mentioned Web search condition, home page filter condition, search page document and extraction document.

Then please refer to Fig. 4, it shows the system block diagram of the data extract device 303 among Fig. 3.As shown in Figure 4, data extract device 303 comprises data extract setup unit 401, hurdle extraction unit 403, tag delete unit 405 and paragraph extraction unit 407.Wherein, data extract setup unit 401 is used for setting above-mentioned home page filter condition for the user.And the home page filter condition can comprise that also extraction hurdle banner word of setting, an extraction hurdle finish word, an extraction paragraph banner word, an extraction paragraph end word and a literal to be deleted.

Hurdle extraction unit 403 is used for extracting search page document and blocks the data between a banner word and the extraction hurdle end word or block the position in extraction.Tag delete unit 405 is used for deleting all web displaying control marks of search page document.And paragraph extraction unit 407 can supply the user to set deletion or keep the data between extraction paragraph banner word and extraction paragraph end word in the search page document, also can be used for deleting the literal to be deleted that the user sets in the search page document.

In addition, data extract setup unit 401 also can be set the execution sequence of hurdle extraction unit 403, tag delete unit 405 and paragraph extraction unit 407 for user's elasticity, so that can extract the needed data of user smoothly.

Please refer to Fig. 5, it shows the method flow synoptic diagram of the extraction site information of Web site information extracting system 201 among Fig. 2.For example the user wants to retrieve the price of this product of PDA in the related web site.At first search device 301 retrieves the search page document that content is the original web page data according to the Web search condition in each website 207.Then from search page document, extract the needed data of user, and be stored as extraction document by data extract device 303.As shown in Figure 5, the user can directly see the commodity and the price of each related web site from extraction document, and the network address that needn't be connected to each website just can be seen content.

Then please refer to Fig. 6, it shows the process flow diagram of the site information extracting method of Web site information extracting system 201 among Fig. 2.In step 601, the user sets Web search condition and home page filter condition respectively in search device setting device 307 and data extract setup unit 401.And the setting of Web search condition comprises at least:

(1) searching network address sets: the user sets a network address at least and connects search for search device 301.

(2) full-text search condition enactment: the user sets a search key at least, judges whether to extract the data of the web page contents of this network address for search device 301.

(3) the network address search condition is set: the user can select to set a special word, judges a network address if comprise this special word for search device 301, i.e. this web page contents is extracted in decision.

(4) searching url-path sets: the user can select to set a path key word, judges whether comprise this path key word in the network address for search device 301, whether continues to search the sub-directory of this network address with decision.

(5) account number cipher is set: the user can select to set an account number and password, and when a network address needed account number and password just can inspect, search device 301 will be logined with predefined account number of user and password.

(6) search the degree of depth: the user can select to set the degree of depth when searching the website.

In addition, the user utilizes data extract setup unit 401 to set and whether carries out hurdle extraction unit 403, tag delete unit 405 and paragraph extraction unit 407 and execution sequence thereof.In this embodiment, be that execution sequence is that example describes with the order of paragraph extraction unit 407, hurdle extraction unit 403, tag delete unit 405, but the present invention is not as limit.Please refer to Fig. 7 simultaneously, it shows data extract setup unit 401 and sets the setting interface synoptic diagram that paragraph extracts.Drop down menu 701 among Fig. 8 can be extracted for the selected paragraph of user, the hurdle extracts or the tag delete option, can set the execution sequence of paragraph extraction unit 407, hurdle extraction unit 403, tag delete unit 405 whereby.As shown in Figure 7, the user utilizes drop down menu 701 setting data extraction elements 303 will at first carry out paragraph extraction unit 407, and the operation that the user can select to set paragraph extraction unit 407 is that paragraph extracts or word string is extracted:

Whether (1) paragraph extracts: the user sets and extracts the paragraph banner word and extract paragraph and finish word, and set to delete or keep at first option 703 and extracting paragraph banner word and the literal that extracts between the paragraph end word.The user can utilize second option 705 to set whether selected paragraph comprises extraction paragraph banner word and the extraction paragraph finishes word in addition.

(2) word string is extracted: the user imports literal to be deleted.

Then please refer to Fig. 8, it shows data extract setup unit 401 and sets the setting interface synoptic diagram that the hurdle extracts.In Fig. 8, the user chooses the hurdle extraction and carries out hurdle extraction unit 403 in regular turn with setting data extraction element 303.And extraction hurdle banner word more than the user can import at least one group and extraction hurdle finish word, so that hurdle extraction unit 403 extracts in the column number certificate of extracting between hurdle banner word and the extraction hurdle end word.

Please refer to Fig. 9, it shows the setting interface synoptic diagram that data extract setup unit 401 is set tag delete.In Fig. 9, the user chooses tag delete will carry out paragraph extraction unit 407 with setting data extraction element 303 the 3rd step.Wherein, the user can select whether to delete blank line.

Then in step 603 shown in Figure 6, search device 301 is searched the web data of each website 207 in the World Wide Web 205 according to the setting in the Web search condition, and extraction meets the web data of Web search condition and outputs to search page document.Then carry out step 605.

In step 605, data extract device 303 is according to the home page filter condition of setting, and extracts content and form extraction document from search page document.And the detailed substep of this step please refer to Figure 10.Figure 10 shows the substep process flow diagram of the step 605 among Fig. 6.In step 1001, in the data of extracting between paragraph banner word and the extraction paragraph end word, perhaps delete the literal to be deleted that the user sets in 407 deletions of paragraph extraction unit or the reservation search page document.

Then in step 1003, hurdle extraction unit 403 extracts in the web page files in the data of extracting between hurdle banner word and the extraction hurdle end word.Then carry out step 1005, all the web displaying control marks in the tag delete unit 405 deletion search page documents.In step 607, monitor 309 shows the content of extraction document to the user.So promptly finished site information extracting method of the present invention.

Among the foregoing description, with paragraph extract, the hurdle extracts, the order of tag delete is that data extract device 303 extracts content formation extraction document from search page document sequence of operation is that example describes, but the present invention is not as limit.The user can set up on their own so that can reach and extract suitable data.

Web site information extracting system that the above embodiment of the present invention is disclosed and method, remove by above-mentioned setting step, having substituted manpower handles outside the extensive work load of data search extraction and arrangement, target data for the locking extraction, can also be by the setting of extraction system flow process, reach the effect that upgrades in time, also more efficient than general search device for the grasp of data promptness; In addition, the present invention also has following advantage:

(1) Web site information extracting system of the present invention is desired the data retrieved extraction with the user and is shown, and filters out unwanted data, and the time that the user searches has again been saved in very convenient user's reading.

(2) Web site information extracting system of the present invention all shows side by side with meeting the needed data of user in each website 207 in the World Wide Web 205, is convenient to the user relatively data dependence and the otherness of different web pages.

In sum; though the present invention discloses as above with a most preferred embodiment; but it is not in order to limit the present invention; any those of ordinary skill in affiliated field; under the premise without departing from the spirit and scope of the present invention; should make various modifications, so protection scope of the present invention should be as the criterion with accompanying claims institute restricted portion.

Claims

1. a Web site information extracting system is connected with World Wide Web (world wide web) by internet (Internet), and in order to browse and to filter the web data of this World Wide Web, described Web site information extracting system comprises at least:

A search device is connected with the World Wide Web via the internet, and a Web search condition that sets in order to the foundation user is searched the web data in this World Wide Web, and search result is outputed in the search page document;

A data extraction element is used to receive described search page document, and extracts the content of described search page document and form an extraction document according to the home page filter condition that the user sets; And

A memory storage is used to store described Web search condition, described home page filter condition, described search page document and described extraction document.

2. the system as claimed in claim 1, wherein said system also comprises a monitor (monitor), described monitor is used to show the content of described extraction document.

3. the system as claimed in claim 1, wherein said home page filter condition comprises that also an extraction hurdle banner word, an extraction hurdle finish word, an extraction paragraph banner word, an extraction paragraph end word and a literal to be deleted, and described data extract device also comprises:

A hurdle extraction unit is used for extracting the data of described search page document between described extraction hurdle banner word and described extraction hurdle end word;

A tag delete unit is used for deleting all web displaying control marks (tag) of described search page document; And

A paragraph extraction unit is used for deleting or keeps described search page document and finishes data between the word at described extraction paragraph banner word and described extraction paragraph, also can be used for deleting the literal described to be deleted in the described search page document.

4. system as claimed in claim 3, wherein said data extract device also comprises a data extraction setup unit, described data extract setup unit is used for setting described home page filter condition for described user.

5. the system as claimed in claim 1, wherein said system also comprises a search device setting device, described search device setting device is used for setting described Web search condition for described user.

6. a site information extracting method is used for browsing and filter the web data of World Wide Web for a user, and described site information extracting method comprises:

Described user sets a Web search condition and a home page filter condition;

According to described Web search condition, search the web data in the described World Wide Web, and search result is outputed in the search page document; And

According to described home page filter condition, extract the content of described search page document and form an extraction document.

7. method as claimed in claim 6, wherein said method also comprises:

The content that shows described extraction document.

8. method as claimed in claim 6, wherein said home page filter condition comprises that also an extraction hurdle banner word, extraction hurdle finish word, one and extract paragraph banner word and one and extract paragraph and finish word, and the step of extracting the content of described search page document and forming an extraction document according to described home page filter condition also comprises:

Deletion or keep in the described search page document data between described extraction paragraph banner word and described extraction paragraph end word;

Extract the data between described extraction hurdle banner word and described extraction hurdle end word in the described search page document; And

Delete all the web displaying control marks in the described search page document.

9. method as claimed in claim 6, wherein said home page filter condition comprises that also an extraction hurdle banner word, extraction hurdle finish word and a literal to be deleted, and the step of extracting the content of described search page document and forming an extraction document according to described home page filter condition also comprises:

Delete the literal described to be deleted in the described search page document;

10. method as claimed in claim 6, wherein said home page filter condition comprises that also an extraction hurdle banner word and extraction hurdle finish word, and the step of extracting the content of described search page document and forming an extraction document according to described home page filter condition also comprises:

11. a computer-readable recording medium comprises a program that is used to carry out the site information extracting method, wherein said method is used for browsing and filter the web data of World Wide Web for the user, and described site information extracting method comprises:

Described user sets a Web search condition and a home page filter condition;

12. computer-readable recording medium as claimed in claim 11, wherein said method also comprises:

The content that shows described extraction document.

13. computer-readable recording medium as claimed in claim 11, wherein said home page filter condition comprises that also an extraction hurdle banner word, extraction hurdle finish word, one and extract paragraph banner word and one and extract paragraph and finish word, and also comprises according to the step that the content that described home page filter condition is extracted described search page document forms an extraction document:

14. computer-readable recording medium as claimed in claim 11, wherein said home page filter condition comprises that also an extraction hurdle banner word, extraction hurdle finish word and a literal to be deleted, and also comprises according to the step that the content that described home page filter condition is extracted described search page document forms an extraction document:

15. computer-readable recording medium as claimed in claim 11, wherein said home page filter condition comprises that also an extraction hurdle banner word and extraction hurdle finish word, and also comprises according to the step that the content that described home page filter condition is extracted described search page document forms an extraction document: