CN101094135A - Method and system for extracting information of content in Internet - Google Patents

Method and system for extracting information of content in Internet Download PDF

Info

Publication number
CN101094135A
CN101094135A CN 200610090410 CN200610090410A CN101094135A CN 101094135 A CN101094135 A CN 101094135A CN 200610090410 CN200610090410 CN 200610090410 CN 200610090410 A CN200610090410 A CN 200610090410A CN 101094135 A CN101094135 A CN 101094135A
Authority
CN
China
Prior art keywords
source code
extraction
address
code
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200610090410
Other languages
Chinese (zh)
Other versions
CN100512181C (en
Inventor
郭欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CNB2006100904105A priority Critical patent/CN100512181C/en
Publication of CN101094135A publication Critical patent/CN101094135A/en
Application granted granted Critical
Publication of CN100512181C publication Critical patent/CN100512181C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The method comprises: a) getting the source code of the target webpage; b) extracting the address link matching the a preset extracting term from said source code of the target webpage; c) according to the extracted address link, getting the source code of its corresponding content webpage; d) extracting the content information matching the preset extracting term from the content webpage. The system thereof comprises: a setting unit used for presetting a target webpage and an extracting term; a first acquisition unit used for the getting the address link from the target webpage source code; and a second acquisition unit used for getting the content information from the content webpage source code.

Description

A kind of extracting method of internet content information and extraction system
Technical field
The present invention relates to computer, Internet technical field, relate in particular to a kind of extracting method and extraction system of internet content information.
Background technology
Internet development by now, its information content that comprises has reached the stage of magnanimity, but these reference contents are dispersed on thousands of the websites in the Internet, have brought great inconvenience for browsing of people.Under these circumstances, the internet content extractive technique more and more comes into one's own, and it can initiatively extract information content, for business such as content-aggregated, content mining, content release provide initial data.
The extraction of the Internet information content is different notions with search engine.Search engine is the keyword by user input, searches the webpage that has certain relation with keyword, and these satisfactory web page addresses are enumerated is shown to the user.
The extraction of the Internet information content is the tactful requirement by user's input, specified sites is analyzed, find satisfactory information content, and extract the information such as title, author, source, issuing time, text, picture of information content respectively, by certain interface the information of these extractions is consigned to other application program then, such as delivery system etc.
Having a kind of information content extractive technique based on extend markup language (XML) at present, abbreviate " RSS " as, is a kind of content release and represent form, only comprises data, adopts the XML tissue to form.Fig. 1 is the principle schematic of RSS.Referring to Fig. 1, under the RSS mode, the RSS of oneself at first must be issued in the information content website, and an XML page promptly is provided, and this page is showed the up-to-date information content of some, comprises title, author, issuing time, summary, the link of text address etc.Subsequently, the user finds own interested RSS by certain mode, subscribes to.Refresh this RSS later at set intervals, the up-to-date information content that obtains ordering, comprise title, author, issuing time, summary, the link of text address etc., browse the link of text address by click, the user can jump to browsing content original text on the reference content website of issuing this RSS.
In order to make the convenient RSS of subscription of user, a lot of RSS reading tools have also appearred at present, comprise desktop tool and Web instrument, they can preserve the RSS that the user subscribes to, and according to the time interval that the user is provided with, regular obtain up-to-date information content, remind the user to browse.
But there is following technical problem in above-mentioned prior art:
1) be not that all information content websites all provide RSS.According to shown in Figure 1, can see clearly that adopt the prerequisite of above-mentioned prior art to be, the information content website must at first be issued RSS, the user just might subscribe to.But for the information of internet mass, the website of issue RSS only accounts for a seldom part, and most information website still adopts traditional web page browsing mode.
2) the RSS content-dependent is in information content website provider.At present, the RSS that many information content websites provide does not cover all information in this website, and has only provided the sub-fraction content, the content that does not provide for RSS, mode by prior art just can't obtain, and this has limited the initiative of information extraction with regard to the user.
3) can't obtain and preserve body matter by RSS.Present RSS only provides the link of text address, and the content of text is not provided, and the user must visit text address link network address pointed, just can browse text, has therefore reduced user's surfing.
Summary of the invention
In view of this, main purpose of the present invention is to provide a kind of extracting method of internet content information, make the user can initiatively extract the information content of needs from the arbitrary information website on the Internet according to self needs, and needn't passively depend on the RSS content whether the information website is issued RSS and issued, can from more wide information source, extract and more enrich careful information content.
Another purpose of the present invention is to provide a kind of extraction system of internet content information, make the user can initiatively extract the information content of needs from the arbitrary information website on the Internet according to self needs, and needn't passively depend on the RSS content whether the information website is issued RSS and issued, can from more wide information source, extract and more enrich careful information content.
In order to realize the foregoing invention purpose, main technical schemes of the present invention is:
A kind of extracting method of internet content information, this method comprises:
A, obtain the source code of target web;
B, in the source code of target web, extract the address of being mated and link with predetermined extraction conditions;
C, extract successful address chain according to step B and obtain the positive web page text source code of getting its correspondence;
D, in the positive web page text source code that step C obtains, extract the content information meet predetermined extraction conditions.
Preferably, at a certain target web, circulation is carried out described steps A to step D, and further comprises among the step B: filter out the address link that success is extracted in circular treatment before, filter out and extract the address that failure and accumulative total surpasses default extraction time and link.
Preferably, comprise matching condition and filtercondition in the described predetermined extraction conditions of step D; Step D specifically comprises: extract the content information that mates with matching condition earlier from described positive web page text source code, again according to the content information after the filtercondition filtration coupling.
Preferably, the described matching way of step B is the regular expression coupling; The described matching way of step D is: regular expression coupling or context coupling or regular expression coupling and context coupling.
Preferably, in matching process, when matching an above identical content information, then only therefrom extract the content information that matches for the first time.
Preferably, described filtercondition comprises: need the character string of filtering and be used to indicate the mark that whether filters out the HTML label.
Preferably, comprise in the predetermined extraction conditions of step D be used for indicating filter the JS code and or the mark of ad code, and step D further comprises: judge earlier whether contain in the described positive web page text source code JS code with or ad code, if had earlier with the JS code and or ad code filter out, carry out described coupling and filtration again, if not then directly carry out described coupling and filtration.
Preferably, further comprise among the step D: when containing picture tag in the positive web page text source code, obtain the picture address tabulation of this positive web page text source code, the request picture address is saved to this locality with picture, and picture is carried out rename according to certain format.
Preferably, further comprise among the step D: when containing the paging label in the positive web page text source code, obtain the address link of all paging contents, the paging address link of all non-pages or leaves is re-executed step C and step D.
A kind of extraction system of internet content information, this system comprises:
The unit is set: be used to the user that the interface that is provided with of target web and predetermined extraction conditions is provided, and preserve set content;
First acquiring unit is used for obtaining the set target web source code in unit is set;
First extraction unit is used for the target web source code that obtains at first acquiring unit and extracts with described and the address that set predetermined extraction conditions mated in the unit is set links;
Second acquisition unit is used for extracting successful address chain according to first extraction unit and obtains the positive web page text source code of getting its correspondence;
Second extraction unit is used for meeting the content information that the set predetermined extraction conditions in unit is set in the positive web page text source code extraction that second acquisition unit obtains.
Preferably, described extraction system is carried out circular treatment at a certain target web; Further comprise filter element in described first extraction unit, the address that is used for filtering out in circular treatment success extraction before links, and filters out extraction failure and accumulative total and links above the address of presetting extraction time.
Preferably, comprise matching condition and filtercondition in the described predetermined extraction conditions; Described second extraction unit further comprises: matching unit is used for extracting the content information that mates with matching condition from described positive web page text source code; Filter element is used for filtering content information after the described matching unit coupling according to filtercondition.
Preferably, described second extraction unit further comprises: the expansion filter element, be used for judging described positive web page text source code whether contain the JS code and or ad code, if had earlier with the JS code and or ad code filter out, source code after will filtering is again issued described matching unit and is handled, if not then directly source code is issued described matching unit and handled.
Preferably, described second extraction unit further comprises: the picture processing unit, be used to judge whether positive web page text source code contains picture tag, if the picture address tabulation that has then obtain this positive web page text source code, the request picture address, picture is saved to this locality, picture is carried out rename according to certain format.
Preferably, described second extraction unit further comprises: the paging processing unit, be used for judging whether positive web page text source code contains the paging label,, the paging address chain sending and receiving of all non-pages or leaves handled to second acquisition unit if having then obtain the address link of all paging contents.
Because the present invention adopts mode initiatively to obtain the source code of target web, extract address link wherein, obtain the source code of this link more on one's own initiative, therefrom obtain required content information, the present invention has adopted the technology of initiatively obtaining with respect to prior art thus, and can interface be set for the user provides, by the user described predetermined condition is set initiatively as required, therefore the present invention makes the user can initiatively extract the information content of needs according to self needs from the arbitrary information website on the Internet, and needn't passively depend on the RSS content whether the information website is issued RSS and issued, can from more wide information source, extract and more enrich careful information content.
The present invention also provides the autoincrement mode extractive technique of internet content information, can reduce the repetition and waste of client process resource, improves extraction efficiency.
The present invention's content of website each bar information link that can also obtain information, thus content information can be kept at local for user capture, thereby improve user's surfing.
The present invention can filter out interfere informations such as JS code and ad code in the process of information extraction, overcome directly to be forced to receive the wherein shortcoming of excessive interference information when the information website obtains information content.
The present invention also provides the localized technology of effective picture, helps to accelerate the browse displays speed of picture; And the present invention also provides the extractive technique of pages content, can realize the extraction to a plurality of web page content information that are associated.
Description of drawings
Fig. 1 is the principle schematic of RSS;
Fig. 2 is the structural representation of the extraction system of internet content information of the present invention;
Fig. 3 is the flow chart of the extracting method of internet content information of the present invention;
Fig. 4 is for obtaining the positive web page text source code of its correspondence and therefrom extracting the particular flow sheet that meets the content information of being scheduled to extraction conditions according to the address link.
Embodiment
Below by specific embodiments and the drawings the present invention is described in further details.
Core concept of the present invention is: adopt mode initiatively to obtain the source code of target web, extract address link wherein, obtain the source code of this link more on one's own initiative, therefrom obtain required content information.
Fig. 2 is the structural representation of the extraction system of internet content information of the present invention.Referring to Fig. 2, the extraction system 21 of described internet content information comprises:
Unit 201 is set: be used to the user that the interface that is provided with of target web and predetermined extraction conditions is provided, and preserve set content; The user can be by being provided with target web (this target web is generally an index webpage) that interface customizes the target information content website of required visit and the customization predetermined extraction conditions at this webpage and the corresponding webpage of index address thereof.
First acquiring unit 202 and is provided with unit 201 and is connected, and is used for obtaining from target information content website the set target web source code in unit is set.
First extraction unit 203 and is provided with unit 201 and is connected with first acquiring unit 202, and the target web source code that is used for obtaining at first acquiring unit 202 extracts with described and the address that set predetermined extraction conditions mated in the unit 201 is set links.
Second acquisition unit 204 is connected with first extraction unit 203, is used for extracting successful address link according to first extraction unit 203 and obtains the corresponding positive web page text source code of described address link from target information content website;
Second extraction unit 205 and is provided with unit 20 1 and is connected with second acquisition unit 204, is used for extracting at the positive web page text source code that second acquisition unit 204 obtains meeting the content information that unit 20 1 set predetermined extraction conditions are set.
Extraction system 21 of the present invention can be arranged on independently on the server, is independent of the information content website, therefore can adopt mode initiatively to extract the required information content of user.
Fig. 3 is the flow chart of the extracting method of internet content information of the present invention.Referring to Fig. 3, this flow process comprises:
Step 301, obtain the source code of target web (being generally the index webpage).Described source code is HTML (Html) source code, because the Html source code of the Web page is open, so any request for webpage can obtain the Html source code,, can obtain the Html source code of target pages by the HTTP(Hypertext Transport Protocol) agreement.
Step 302, in the source code of target web, extract the address of being mated and link with predetermined extraction conditions.The extraction conditions here is a regular expression, such as " http://www .xinahuanet .com/news/[0-9] 8}_content .htm ", in target pages Html source code, obtain the address lists of links of this regular expression coupling, here the tabulation that obtains has comprised information content address links all in this target web, also comprises the address link of having extracted.
Step 303, extract successful address chain according to step 302 and obtain the positive web page text source code of getting its correspondence;
Step 304, in the positive web page text source code that step 303 is obtained, extract the content information meet predetermined extraction conditions.
The present invention adopts initiatively extracting mode, and can be by the extraction conditions of consumer premise one cover set form, the user customizes corresponding extraction conditions to each information content website according to form, by these, background program extracts the up-to-date information content of these websites with the circular increment formula, comprises title, author, source, issuing time, text, picture etc.Described increment type extracts and is meant the only newly-increased content of extraction, no longer extracts for the content of having extracted.
Comprise following content in the described predetermined extraction conditions:
1) index address of target web is such as the home address of certain information content website channel.It in the step 301 source code that obtains target web according to the address of these target webs by http protocol.
2) can mate the regular expression that the information content address links on the described target pages.Extraction conditions described in the step 302 is exactly this regular expression.
3) be used to extract the extraction conditions of each text web page contents, i.e. predetermined extraction conditions described in the step 304.
Below illustrate that with a concrete example this is used to extract the content of the extraction conditions of each text web page contents.Table 1 is a content sample table of the extraction conditions that is used to extract each text web page contents.Referring to table 1, this extraction conditions has defined matching condition and filtercondition for the each several part that extracts content, for example matching condition can be information such as matched character string and match pattern, and filtercondition can be for filtering character string and showing the information such as mark of whether filtering the Html label.Wherein the particular content of matching condition and filtercondition can be provided with as required by the user.
The title matched character string class=′txt?18′height=′50′>|</td>
Title filters character string
The title match pattern Contextual tab
Whether title filters Html Not
The source matched character string The source: |</td>
The source filtering character string
The source match pattern Contextual tab
Whether the source filters Html Be
The time matched character string [0-9]{4}-[0-9]{2}-[0-9]{2}. *[0-9] { 2}:[0-9] 2}|[0-9] 4} [0-9] the 2} month [0-9] 2} day. *[0-9]{2}:[0-9]{2}
The temporal filtering character string
The time match pattern Regular expression
Whether the time filters Html Not
The classification and matching character string Homepage. *</a>
The categorical filtering character string Homepage
The classification and matching pattern Regular expression
Whether classification filters Html Be
The text matching character string <td?class=″p1″>|<table?width=″
Text filters character string
The text matching pattern Contextual tab
Whether text filters Html Be
Advertisement begins label <!--NEWSZW_HZH_BEGIN-->
The advertisement end-tag <!--NEWSZW_HZH_END-->
Chinese character encoding Gb2312
The paging regular expression target=_blank>[0-9]+</a>
Table 1
At each target web, described extraction system all correspondence is provided with similar so a extraction conditions, is kept in the database of this extraction system.And at each target web, after the background program of extraction system is obtained described extraction conditions,, carry out described extraction and handle, promptly carry out above-mentioned steps 301 to step 304 according to the description of this extraction conditions.
Because the content of information content website can be at any time the renewal, therefore at a certain target web, can carry out described step 301 to step 304 according to predetermined loop cycle; And further filter out the address link that success is extracted in circular treatment before in the step 302, filter out and extract the address link that failure and accumulative total surpass default extraction time.Concrete is: link for the address of being mated with predetermined extraction conditions of extracting successfully, judge whether this address link has been extracted into merits and demerits in circular treatment before, be then this address chain to be taken over to filter, handle otherwise step 303 is transferred in this address link; Link for the address of being mated with predetermined extraction conditions of extracting failure, judge whether to surpass the extraction time of being scheduled to, filter if then this address chain is taken over, otherwise increase progressively its actual extraction time, extract again when treating next circular treatment.
Fig. 4 is for obtaining the positive web page text source code of its correspondence and therefrom extracting the particular flow sheet that meets the content information of being scheduled to extraction conditions according to the address link.Referring to Fig. 4, this flow process is a specific embodiment of above-mentioned steps 303 and step 304, specifically comprises:
Step 401, extract successful information content address chain according to step 302 and obtain the positive web page text source code of getting its correspondence, i.e. the Html source code of information content.
Step 402, this step are an optional step, because may comprise JS sometimes in the text (is Javascript, be a kind of script) code and or ad code and other disturb code, then need at first will filter them, prevent their interference to the coupling body matter, therefore can in described predetermined extraction conditions (can referring to table 1), be provided be used for indicating filter the JS code and or the mark of ad code and or other filterconditions, and need execution this step 402, be specially:
Judge earlier whether contain in the described positive web page text source code JS code and ad code with or other information that need filter, if had earlier with the JS code and or ad code and or described other information filterings fall, execution in step 403 again, if do not have then direct execution in step 403.
The method of described filtration JS code can be to seek the closed label "<script " of JS and "</script〉" in the Html source code, and its content that comprises is deleted.
Described filtering advertisements code needs to carry out according to the code analysis rules that concrete webpage is customized, and has specified advertisement to begin label and end-tag in the described code analysis rules, can find the ad code segment by these labels, thus deletion.
Step 403, from described positive web page text source code, extract the content information with described matching condition (can referring to table 1) coupling.Described content information classification of mating is: any in title, author, source, time, text, the picture or kind combination arbitrarily.
Matching way herein can be regular expression coupling or context coupling or regular expression coupling and context coupling.Wherein, the regular expression coupling is meant: by specify a regular expression in matching condition, mate content corresponding, such as, the regular expression on certain information dissemination date be " [0-9] 4}-[0-9] 2}-[0-9] 2} ", this regular expression can match " 2006-05-30 " such date; The context coupling is meant: by specify the contextual tab that will extract content in matching condition, can extract the middle content of contextual tab, such as, the contextual tab of certain information title be "<h1〉|</h1 ", wherein use the separator of " | " expression contextual tab.
In matching process, when matching an above identical content information, then only therefrom extract the content information that matches for the first time.
Step 404, filter content information after the coupling according to described filtercondition.Can be by the setting of extraction conditions, the for example setting of table 1, whether need filter and filter what content etc. for a certain content, described filtercondition comprises: need the character string of filtering and be used to indicate the mark that whether filters out HTML Html label, can filter according to described filtercondition for each content information after step 403 coupling, for example filter out some character string and or filter out the Html label.
Step 405, the positive web page text source code after filtering is carried out post-processed, comprising:
1) when containing picture tag in the positive web page text source code, obtain the picture address tabulation of this positive web page text source code, the request picture address is saved to this locality with picture, and picture is carried out rename according to certain format.When described picture address is relative address, then make up by the address of the information content page, obtain the absolute address` of picture, by asking this absolute address` picture is saved to this locality.Picture can be saved in this locality like this, help to accelerate the browse displays speed of picture.
2) when containing the paging label in the positive web page text source code, obtain the address link of all paging contents, the paging address link to all non-pages or leaves re-executes step 303 and step 304, and with first page link.
3) according to predetermined text format the content information of handling through said extracted that meets predetermined extraction conditions is carried out format analysis processing.The operation etc. of for example setting type is beneficial to the user and browses and check.
General, the number that step 302 is extracted successful information content address link has more than one, therefore all needs execution graph 4 described handling processes for each address link.
By the processing of above-mentioned flow process, just can export article content and picture that the user needs, thereby make the user obtain the information content information that needs from described target web.
Need to prove: described filtration JS code and or ad code and or other disturb the operation of code, the operation of handling the operation of picture and handling paging does not have strict sequencing, can parallel processing yet.
Corresponding with said method, extraction system of the present invention can be carried out circular treatment at a certain target web; Further comprise filter element in described first extraction unit 203, the address that is used for filtering out in circular treatment success extraction before links, and filters out extraction failure and accumulative total and links above the address of presetting extraction time.
Can comprise in second extraction unit 205 in the extraction system of the present invention:
Matching unit, the content information that the matching condition that is used for comprising from described positive web page text source code extraction and predetermined extraction conditions is mated; Filter element is used for filtering content information after the described matching unit coupling according to the filtercondition that predetermined extraction conditions comprises.
The expansion filter element, be used for judging described positive web page text source code whether contain the JS code and or ad code, if had earlier with the JS code and or ad code filter out, source code after will filtering is again issued described matching unit and is handled, if not then directly source code is issued described matching unit and handled.
The picture processing unit is used to judge whether positive web page text source code contains picture tag, if having then the picture address tabulation of obtaining this positive web page text source code, the request picture address is saved to this locality with picture, and picture is carried out rename according to certain format.
The paging processing unit is used for judging whether positive web page text source code contains the paging label, if having then obtain the address link of all paging contents, handles to second acquisition unit 204 the paging address chain sending and receiving of all non-pages or leaves.
The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with the people of this technology in the disclosed technical scope of the present invention; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.

Claims (15)

1, a kind of extracting method of internet content information is characterized in that, this method comprises:
A, obtain the source code of target web;
B, in the source code of target web, extract the address of being mated and link with predetermined extraction conditions;
C, extract successful address chain according to step B and obtain the positive web page text source code of getting its correspondence;
D, in the positive web page text source code that step C obtains, extract the content information meet predetermined extraction conditions.
2, method according to claim 1, it is characterized in that, at a certain target web, circulation is carried out described steps A to step D, and further comprise among the step B: filter out the address link that success is extracted in circular treatment before, filter out and extract the address link that failure and accumulative total surpass default extraction time.
3, method according to claim 1 and 2 is characterized in that, comprises matching condition and filtercondition in the described predetermined extraction conditions of step D; Step D specifically comprises: extract the content information that mates with matching condition earlier from described positive web page text source code, again according to the content information after the filtercondition filtration coupling.
4, method according to claim 3 is characterized in that, the described matching way of step B is the regular expression coupling; The described matching way of step D is: regular expression coupling or context coupling or regular expression coupling and context coupling.
5, method according to claim 3 is characterized in that, in matching process, when matching an above identical content information, then only therefrom extracts the content information that matches for the first time.
6, method according to claim 3 is characterized in that, described filtercondition comprises: need the character string of filtering and be used to indicate the mark that whether filters out the HTML label.
7, method according to claim 3, it is characterized in that, comprise in the predetermined extraction conditions of step D be used for indicating filter the JS code and or the mark of ad code, and step D further comprises: judge earlier whether contain in the described positive web page text source code JS code with or ad code, if had earlier with the JS code and or ad code filter out, carry out described coupling and filtration again, if not then directly carry out described coupling and filtration.
8, method according to claim 1, it is characterized in that, further comprise among the step D: when containing picture tag in the positive web page text source code, obtain the picture address tabulation of this positive web page text source code, the request picture address, picture is saved to this locality, picture is carried out rename according to certain format.
9, method according to claim 1, it is characterized in that, further comprise among the step D: when containing the paging label in the positive web page text source code, obtain the address link of all paging contents, the paging address link of all non-pages or leaves is re-executed step C and step D.
10, a kind of extraction system of internet content information is characterized in that, this system comprises:
The unit is set: be used to the user that the interface that is provided with of target web and predetermined extraction conditions is provided, and preserve set content;
First acquiring unit is used for obtaining the set target web source code in unit is set;
First extraction unit is used for the target web source code that obtains at first acquiring unit and extracts with described and the address that set predetermined extraction conditions mated in the unit is set links;
Second acquisition unit is used for extracting successful address chain according to first extraction unit and obtains the positive web page text source code of getting its correspondence;
Second extraction unit is used for meeting the content information that the set predetermined extraction conditions in unit is set in the positive web page text source code extraction that second acquisition unit obtains.
11, extraction system according to claim 10 is characterized in that, described extraction system is carried out circular treatment at a certain target web; Further comprise filter element in described first extraction unit, the address that is used for filtering out in circular treatment success extraction before links, and filters out extraction failure and accumulative total and links above the address of presetting extraction time.
12, extraction system according to claim 10 is characterized in that, comprises matching condition and filtercondition in the described predetermined extraction conditions; Described second extraction unit further comprises: matching unit is used for extracting the content information that mates with matching condition from described positive web page text source code; Filter element is used for filtering content information after the described matching unit coupling according to filtercondition.
13, extraction system according to claim 12, it is characterized in that, described second extraction unit further comprises: the expansion filter element, be used for judging described positive web page text source code whether contain the JS code and or ad code, if had earlier with the JS code and or ad code filter out, source code after will filtering is again issued described matching unit and is handled, if not then directly source code is issued described matching unit and handled.
14, extraction system according to claim 10, it is characterized in that, described second extraction unit further comprises: the picture processing unit, be used to judge whether positive web page text source code contains picture tag, if the picture address tabulation that has then obtain this positive web page text source code, the request picture address is saved to this locality with picture, and picture is carried out rename according to certain format.
15, extraction system according to claim 10, it is characterized in that, described second extraction unit further comprises: the paging processing unit, be used for judging whether positive web page text source code contains the paging label, if have then obtain the address link of all paging contents, the paging address chain sending and receiving of all non-pages or leaves are handled to second acquisition unit.
CNB2006100904105A 2006-06-23 2006-06-23 Method and system for extracting information of content in Internet Active CN100512181C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006100904105A CN100512181C (en) 2006-06-23 2006-06-23 Method and system for extracting information of content in Internet

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006100904105A CN100512181C (en) 2006-06-23 2006-06-23 Method and system for extracting information of content in Internet

Publications (2)

Publication Number Publication Date
CN101094135A true CN101094135A (en) 2007-12-26
CN100512181C CN100512181C (en) 2009-07-08

Family

ID=38992180

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006100904105A Active CN100512181C (en) 2006-06-23 2006-06-23 Method and system for extracting information of content in Internet

Country Status (1)

Country Link
CN (1) CN100512181C (en)

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639772A (en) * 2008-07-31 2010-02-03 国际商业机器公司 Method and device for generating window title
CN101784022A (en) * 2009-01-16 2010-07-21 北京炎黄新星网络科技有限公司 Method and system for filtering and classifying short messages
CN101937469A (en) * 2010-09-15 2011-01-05 深圳市任子行网络技术股份有限公司 Information capture method of video website
CN101997915A (en) * 2010-10-29 2011-03-30 中国电信股份有限公司 Deep packet detection device, webpage data processing method, and webpage data acquisition method and system
CN102043862A (en) * 2010-12-29 2011-05-04 重庆新媒农信科技有限公司 Directional web data extraction method
CN102073678A (en) * 2010-12-03 2011-05-25 厦门市美亚柏科信息股份有限公司 System and method for analyzing information of websites
CN102193944A (en) * 2010-03-12 2011-09-21 三星电子(中国)研发中心 Method for extracting webpage subject contents
CN101261643B (en) * 2008-05-04 2012-01-11 腾讯科技(深圳)有限公司 Website page information statistical method and apparatus
CN102375857A (en) * 2010-08-24 2012-03-14 腾讯科技(深圳)有限公司 Search method and device
CN102567521A (en) * 2011-12-29 2012-07-11 维构(上海)文化传媒有限公司 Webpage data capturing and filtering method
CN102722580A (en) * 2012-06-07 2012-10-10 杭州电子科技大学 Method for downloading video comments dynamically generated in video websites
CN102750392A (en) * 2012-07-09 2012-10-24 浙江省公众信息产业有限公司 Web topic information extraction method and system
CN102819613A (en) * 2012-08-28 2012-12-12 北京奇虎科技有限公司 RSS (really simple syndication) information paging fetching system and method
CN102929596A (en) * 2012-09-21 2013-02-13 华为技术有限公司 Code checking method and device
CN102929992A (en) * 2012-10-22 2013-02-13 卢屹韦 Method for periodically and automatically grabbing online news information
CN103020263A (en) * 2012-12-24 2013-04-03 北京小米科技有限责任公司 Method, device and terminal for storing webpage information
CN103064943A (en) * 2012-12-25 2013-04-24 北京奇虎科技有限公司 Customer premises equipment
CN103150389A (en) * 2013-03-21 2013-06-12 北京奇虎科技有限公司 Method and device for processing matching setting of webpage text contents
CN103164435A (en) * 2011-12-13 2013-06-19 北大方正集团有限公司 Acquisition method and system of network data
WO2013178094A1 (en) * 2012-05-31 2013-12-05 优视科技有限公司 Page display method and device
CN103838728A (en) * 2012-11-21 2014-06-04 腾讯科技(深圳)有限公司 Webpage information processing method and browser
CN104090933A (en) * 2014-06-25 2014-10-08 武汉传神信息技术有限公司 Method for window displaying of network information
CN104360882A (en) * 2014-11-07 2015-02-18 北京奇虎科技有限公司 Method and device for displaying images in web page in browser
CN104537128A (en) * 2015-01-30 2015-04-22 广联达软件股份有限公司 Webpage information extracting method and device
CN104572901A (en) * 2014-12-25 2015-04-29 小米科技有限责任公司 Method and device for downloading webpage data
CN102023998B (en) * 2009-09-21 2015-05-20 创新科技有限公司 Method and device for processing webpage so as to display on handheld equipment
CN104915415A (en) * 2015-06-08 2015-09-16 浪潮集团有限公司 Distributed internet data collection and analysis system
CN105468730A (en) * 2015-11-20 2016-04-06 广州华多网络科技有限公司 Webpage information extraction method and equipment
CN105550165A (en) * 2015-12-23 2016-05-04 深圳市八零年代网络科技有限公司 Plug-in and method capable of importing webpage article into webpage text editor
CN105930346A (en) * 2016-04-06 2016-09-07 清华大学 Internet case information extraction method and device
CN105938496A (en) * 2016-05-27 2016-09-14 深圳市永兴元科技有限公司 Webpage content extraction method and apparatus
CN103902578B (en) * 2012-12-27 2017-05-31 中国移动通信集团四川有限公司 A kind of method for abstracting web page information and device
CN107168948A (en) * 2017-04-19 2017-09-15 广州视源电子科技股份有限公司 A kind of sentence recognition methods and system
CN107623624A (en) * 2016-07-15 2018-01-23 阿里巴巴集团控股有限公司 The method and device of notification message is provided
CN107766384A (en) * 2016-08-22 2018-03-06 北京国双科技有限公司 A kind of method and apparatus for determining page issuing time
CN108170784A (en) * 2017-12-26 2018-06-15 佛山市道静科技有限公司 The method and system of content information on a kind of extraction internet
CN109522282A (en) * 2018-09-29 2019-03-26 中国平安人寿保险股份有限公司 Picture management method, device, computer installation and storage medium
CN109558123A (en) * 2018-12-03 2019-04-02 掌阅科技股份有限公司 The method of webpage conversion electrons book, electronic equipment, storage medium
CN110175288A (en) * 2019-05-23 2019-08-27 中国搜索信息科技股份有限公司 A kind of filter method and system of the writings and image data towards younger population
CN111026984A (en) * 2019-11-07 2020-04-17 国家计算机网络与信息安全管理中心 Method and device for detecting operation state of Internet financial company
CN113886661A (en) * 2021-12-06 2022-01-04 北京并行科技股份有限公司 Information acquisition method and device and computing equipment
CN114201971A (en) * 2021-12-13 2022-03-18 海南港航控股有限公司 Method and system for extracting character attributes from webpage
CN114417216A (en) * 2022-01-04 2022-04-29 马上消费金融股份有限公司 Data acquisition method and device, electronic equipment and readable storage medium

Cited By (61)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101261643B (en) * 2008-05-04 2012-01-11 腾讯科技(深圳)有限公司 Website page information statistical method and apparatus
CN101639772A (en) * 2008-07-31 2010-02-03 国际商业机器公司 Method and device for generating window title
CN101784022A (en) * 2009-01-16 2010-07-21 北京炎黄新星网络科技有限公司 Method and system for filtering and classifying short messages
CN102023998B (en) * 2009-09-21 2015-05-20 创新科技有限公司 Method and device for processing webpage so as to display on handheld equipment
CN102193944A (en) * 2010-03-12 2011-09-21 三星电子(中国)研发中心 Method for extracting webpage subject contents
CN102375857A (en) * 2010-08-24 2012-03-14 腾讯科技(深圳)有限公司 Search method and device
CN102375857B (en) * 2010-08-24 2014-08-13 腾讯科技(深圳)有限公司 Search method and device
CN101937469B (en) * 2010-09-15 2012-09-05 任子行网络技术股份有限公司 Information capture method of video website
CN101937469A (en) * 2010-09-15 2011-01-05 深圳市任子行网络技术股份有限公司 Information capture method of video website
CN101997915A (en) * 2010-10-29 2011-03-30 中国电信股份有限公司 Deep packet detection device, webpage data processing method, and webpage data acquisition method and system
CN101997915B (en) * 2010-10-29 2014-01-08 中国电信股份有限公司 Deep packet detection device, webpage data processing method, and webpage data acquisition method and system
CN102073678A (en) * 2010-12-03 2011-05-25 厦门市美亚柏科信息股份有限公司 System and method for analyzing information of websites
CN102073678B (en) * 2010-12-03 2013-02-27 厦门市美亚柏科信息股份有限公司 System and method for analyzing information of websites
CN102043862B (en) * 2010-12-29 2012-10-17 重庆新媒农信科技有限公司 Directional web data extraction method
CN102043862A (en) * 2010-12-29 2011-05-04 重庆新媒农信科技有限公司 Directional web data extraction method
CN103164435B (en) * 2011-12-13 2016-03-09 北大方正集团有限公司 A kind of acquisition method of network data and system
US9525605B2 (en) 2011-12-13 2016-12-20 Peking University Founder Group Co., Ltd. Method of and system for collecting network data
CN103164435A (en) * 2011-12-13 2013-06-19 北大方正集团有限公司 Acquisition method and system of network data
WO2013087012A1 (en) * 2011-12-13 2013-06-20 北大方正集团有限公司 Method and system for collecting network data
CN102567521A (en) * 2011-12-29 2012-07-11 维构(上海)文化传媒有限公司 Webpage data capturing and filtering method
CN102567521B (en) * 2011-12-29 2013-08-07 维构(上海)文化传媒有限公司 Webpage data capturing and filtering method
RU2611965C2 (en) * 2012-05-31 2017-03-01 Юс Мобиле Лимитед Method and device for page display
US9684636B2 (en) 2012-05-31 2017-06-20 Uc Mobile Limited Ad blocking page display method and device
WO2013178094A1 (en) * 2012-05-31 2013-12-05 优视科技有限公司 Page display method and device
CN102722580A (en) * 2012-06-07 2012-10-10 杭州电子科技大学 Method for downloading video comments dynamically generated in video websites
CN102750392B (en) * 2012-07-09 2014-07-16 浙江省公众信息产业有限公司 Web topic information extraction method and system
CN102750392A (en) * 2012-07-09 2012-10-24 浙江省公众信息产业有限公司 Web topic information extraction method and system
CN102819613A (en) * 2012-08-28 2012-12-12 北京奇虎科技有限公司 RSS (really simple syndication) information paging fetching system and method
CN102819613B (en) * 2012-08-28 2015-11-25 北京奇虎科技有限公司 RSS information paging grasping system and method
CN102929596B (en) * 2012-09-21 2016-01-06 华为技术有限公司 Code arrange distinguish method and relevant apparatus
CN102929596A (en) * 2012-09-21 2013-02-13 华为技术有限公司 Code checking method and device
CN102929992A (en) * 2012-10-22 2013-02-13 卢屹韦 Method for periodically and automatically grabbing online news information
CN103838728A (en) * 2012-11-21 2014-06-04 腾讯科技(深圳)有限公司 Webpage information processing method and browser
CN103838728B (en) * 2012-11-21 2018-01-09 腾讯科技(深圳)有限公司 The processing method and browser of info web
CN103020263A (en) * 2012-12-24 2013-04-03 北京小米科技有限责任公司 Method, device and terminal for storing webpage information
CN103064943A (en) * 2012-12-25 2013-04-24 北京奇虎科技有限公司 Customer premises equipment
CN103902578B (en) * 2012-12-27 2017-05-31 中国移动通信集团四川有限公司 A kind of method for abstracting web page information and device
CN103150389A (en) * 2013-03-21 2013-06-12 北京奇虎科技有限公司 Method and device for processing matching setting of webpage text contents
CN104090933A (en) * 2014-06-25 2014-10-08 武汉传神信息技术有限公司 Method for window displaying of network information
CN104360882A (en) * 2014-11-07 2015-02-18 北京奇虎科技有限公司 Method and device for displaying images in web page in browser
CN104360882B (en) * 2014-11-07 2018-07-27 北京奇虎科技有限公司 Display methods and device are carried out to picture in webpage in a kind of browser
CN104572901A (en) * 2014-12-25 2015-04-29 小米科技有限责任公司 Method and device for downloading webpage data
CN104537128A (en) * 2015-01-30 2015-04-22 广联达软件股份有限公司 Webpage information extracting method and device
CN104915415A (en) * 2015-06-08 2015-09-16 浪潮集团有限公司 Distributed internet data collection and analysis system
CN105468730A (en) * 2015-11-20 2016-04-06 广州华多网络科技有限公司 Webpage information extraction method and equipment
CN105550165A (en) * 2015-12-23 2016-05-04 深圳市八零年代网络科技有限公司 Plug-in and method capable of importing webpage article into webpage text editor
CN105930346A (en) * 2016-04-06 2016-09-07 清华大学 Internet case information extraction method and device
CN105938496A (en) * 2016-05-27 2016-09-14 深圳市永兴元科技有限公司 Webpage content extraction method and apparatus
CN107623624A (en) * 2016-07-15 2018-01-23 阿里巴巴集团控股有限公司 The method and device of notification message is provided
CN107766384A (en) * 2016-08-22 2018-03-06 北京国双科技有限公司 A kind of method and apparatus for determining page issuing time
CN107168948A (en) * 2017-04-19 2017-09-15 广州视源电子科技股份有限公司 A kind of sentence recognition methods and system
CN108170784A (en) * 2017-12-26 2018-06-15 佛山市道静科技有限公司 The method and system of content information on a kind of extraction internet
CN109522282A (en) * 2018-09-29 2019-03-26 中国平安人寿保险股份有限公司 Picture management method, device, computer installation and storage medium
CN109522282B (en) * 2018-09-29 2024-02-02 中国平安人寿保险股份有限公司 Picture management method, device, computer device and storage medium
CN109558123A (en) * 2018-12-03 2019-04-02 掌阅科技股份有限公司 The method of webpage conversion electrons book, electronic equipment, storage medium
CN109558123B (en) * 2018-12-03 2022-09-16 掌阅科技股份有限公司 Method for converting webpage into electronic book, electronic equipment and storage medium
CN110175288A (en) * 2019-05-23 2019-08-27 中国搜索信息科技股份有限公司 A kind of filter method and system of the writings and image data towards younger population
CN111026984A (en) * 2019-11-07 2020-04-17 国家计算机网络与信息安全管理中心 Method and device for detecting operation state of Internet financial company
CN113886661A (en) * 2021-12-06 2022-01-04 北京并行科技股份有限公司 Information acquisition method and device and computing equipment
CN114201971A (en) * 2021-12-13 2022-03-18 海南港航控股有限公司 Method and system for extracting character attributes from webpage
CN114417216A (en) * 2022-01-04 2022-04-29 马上消费金融股份有限公司 Data acquisition method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN100512181C (en) 2009-07-08

Similar Documents

Publication Publication Date Title
CN100512181C (en) Method and system for extracting information of content in Internet
CN100444174C (en) Method for picking-up, and aggregating micro content of web page, and automatic updating system
US6675350B1 (en) System for collecting and displaying summary information from disparate sources
US6605120B1 (en) Filter definition for distribution mechanism for filtering, formatting and reuse of web based content
US20100030752A1 (en) System, methods and applications for structured document indexing
KR100377515B1 (en) Method for managing advertisements on Internet and System therefor
JP2006309515A (en) Information delivery method and information delivery server
CN101231641A (en) Method and system for automatic analysis of hotspot subject propagation process in the internet
KR102222287B1 (en) Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL
CN102831252A (en) Method and device for updating index database and search method and system
CN103207874A (en) Updated webpage content prompting method and system
US20080263439A1 (en) Client application for identification of updates in selected network pages
JP2007279901A (en) Method for transmitting data relevant to document
CN103235800A (en) Preview method and preview system of search results
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN102314494A (en) Method and equipment for processing webpage contents
CN102023998A (en) Method and device for processing webpage so as to display on handheld equipment
JP2006277281A (en) Advertisement management method, web page displaying device, and computer program
CN105204806A (en) Individual display method and device for mobile terminal webpage
US20050131859A1 (en) Method and system for standard bookmark classification of web sites
CN101556592A (en) Method for intelligently parsing internet content
JP5089091B2 (en) Content collection system
CN102929992A (en) Method for periodically and automatically grabbing online news information
Lee et al. ScalableWeb News Adaptation To Mobile Devices Using Visual Block Segmentation for Ubiquitous Media Services
Šimec et al. RSS as medium for information and communication technology

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant