CN104750804A

CN104750804A - Plug-in type configurable vertical network spider implementation method

Info

Publication number: CN104750804A
Application number: CN201510131253.7A
Authority: CN
Inventors: 孟硕培; 程向飞
Original assignee: Nanjing Tu Niu Science And Technology Ltd
Current assignee: Nanjing Tu Niu Science And Technology Ltd
Priority date: 2015-03-24
Filing date: 2015-03-24
Publication date: 2015-07-01

Abstract

The invention discloses a plug-in type configurable vertical network spider implementation method. The plug-in type configurable vertical network spider implementation method comprises a grabbing stage and an extracting stage, wherein the grabbing stage comprises a grabbing configuration stage and a grabbing program execution stage, and the extracting stage comprises an extract configuration stage and an extract program execution stage. The plug-in type configurable vertical network spider implementation method can achieve webpage grabbing and information extract of multiple fields through a configuration mode, the precision is high, not only is the defect that a traditional search engine is ambiguous in intention and not high in precision overcome, but also the webpage grabbing and the information extract of the multiple fields can be achieved.

Description

A kind of plug-in type configurable vertical field web crawlers implementation method

Technical field

The present invention relates to vertical field web crawlers implementation method, particularly relate to a kind of plug-in type configurable vertical field web crawlers implementation method.

Background technology

Along with the development of Internet technology, the information on network is vast as the open sea.How in the ocean of information, to find required information rapidly and accurately, become the focal issue that people pay close attention to.For the network information of flood tide, the worker of different field only needs to pay close attention to the data relevant to this area, and does not need other data, and they often have higher requirement to the quality of data.Traditional search engine search intention is failed to understand, result degree of accuracy is not high, data destructuring, is difficult to provide accurate and clean data.

The reptile in general vertical field adopts certain algorithm to make for a certain specific area, can obtain the information in certain field exactly, but to the webpage capture of the other field outside this specific area and information extraction helpless.

Summary of the invention

Goal of the invention: the object of this invention is to provide a kind of webpage capture and the information extraction that can be realized multiple field by the mode of configuration, the plug-in type that degree of accuracy is high configurable vertical field web crawlers implementation method.

Technical scheme: for reaching this object, the present invention by the following technical solutions:

Plug-in type of the present invention configurable vertical field web crawlers implementation method, comprises stage of gripping and extraction stage; Wherein, stage of gripping comprises crawl configuration phase and capture program execute phase, and extraction stage comprises extraction configuration phase and extraction program execute phase;

Capture configuration phase: configuration captures configuration file;

The capture program execute phase: read and capture configuration file, carry out crawl work, obtain capturing data;

Extract configuration phase: comprise configure base information, and configure decimation pattern according to described crawl data;

The extraction program execute phase: read and extract configuration file, extraction work is carried out to described crawl data, obtains extracted data.

Further, described crawl configuration phase comprises the following steps:

Step (11): all entrance URL are placed in independent configuration file, and specify corresponding state;

Step (12): the coded system depositing path and the crawl page configuring the data grabbed;

Step (13): the storage configuring the URL grabbed is arranged;

Step (14): the path of configuration URL entry file;

Step (15): configuration seized condition, in order to define the rules for grasping of the url list of next state;

Step (16): judge whether seized condition is final state; If not, then enter step (15) and continue the next state of configuration; If so, then step (17) is entered;

Step (17): the configuration information preserving reptile.

Further, the rules for grasping in described step (15) realizes by capturing plug-in unit, captures plug-in unit and adopts XPath, regular expression or custom programming.

Further, the described capture program execute phase comprises the following steps:

Step (21): read configuration file;

Step (22): obtain entrance url list, and entrance url list is joined URL and wait to capture in queue;

Step (23): judge that URL waits whether capture queue is empty; If be empty, then capture end; Otherwise, perform step (24);

Step (24): wait to capture queue from URL and take out a URL;

Step (25): judge whether the URL taken out captured; If so, step (23) is re-executed; Otherwise, perform step (26);

Step (26): the webpage pointed by the URL of taking-up is carried out capturing and storing;

Step (27): the URL captured is added in URL pond;

Step (28): enter the seized condition in step (15), obtain new url list according to rules for grasping;

Step (29): add new url list and wait to capture in queue to URL, perform step (23).

Further, described extraction configuration phase comprises the following steps:

Step (31): described configure base information, comprises the store path of the store path of the data grabbed, URL to be extracted, and the store path of the data be drawn into;

Step (32): described configures decimation pattern according to crawl data, comprises the information extracted from the webpage captured, and decimation rule.

Further, the decimation rule in described step (32) realizes by extracting plug-in unit, extracts plug-in unit and adopts XPath, regular expression or custom programming.

Further, the described extraction program execute phase comprises the following steps:

Step (41): read configuration file;

Step (42): read url list to be extracted, adds in URL queue to be extracted by url list to be extracted;

Step (43): read a URL from URL queue to be extracted;

Step (44): retrieve webpage corresponding to URL from the store path of the data grabbed described in step (31); If webpage exists, then perform step (45); Otherwise, perform step (46);

Step (45): extract corresponding contents according to decimation rule, and store;

Step (46): judge whether URL queue to be extracted is empty; If be empty, then terminate; Otherwise, perform step (43).

Beneficial effect: the present invention can realize webpage capture and the information extraction in multiple field by the mode of configuration, and degree of accuracy is high, the shortcoming that traditional search engines intention is not clear, degree of accuracy is not high can be solved, webpage capture and the information extraction in multiple field can be realized again.

Accompanying drawing explanation

Fig. 1 is method block diagram of the present invention;

Fig. 2 is the process flow diagram of capture program execute phase of the present invention;

Fig. 3 is the process flow diagram of extraction program execute phase of the present invention.

Embodiment

Below in conjunction with accompanying drawing, further illustrate technical scheme of the present invention by embodiment.

The block diagram of plug-in type of the present invention configurable vertical field web crawlers implementation method as shown in Figure 1, comprises stage of gripping and extraction stage; Wherein, stage of gripping comprises crawl configuration phase and capture program execute phase, and extraction stage comprises extraction configuration phase and extraction program execute phase.

Capture configuration phase: configuration captures configuration file;

Extract configuration phase: comprise configure base information, and configure decimation pattern according to crawl data;

The extraction program execute phase: read and extract configuration file, extraction work is carried out to crawl data, obtains extracted data.

Technical scheme of the present invention is further illustrated below with an embodiment.If there is a demand, Water demand website is to the circuit price distribution of yunnan tourism.In order to complete the demand, developer needs to obtain the information such as the price of this each circuit of website, number of days of going on a tour, group's phase.Now just according to method of the present invention, first can perform stage of gripping, the content of website be captured, then performs extraction stage, the information extraction such as circuit ID, line name, price needed out.

1, configuration phase is captured

(1) entrance URL is configured

Step (11): the list page navigating to the yunnan tourism circuit of this website, as: http://www.tuniu.com/g3300/tours-sh-0/list-h0/1/, as long as all circuits can be traveled through by the circulation of lower one page, this page can be used as the entrance page of whole crawl; All entrance URL are placed in independent configuration file, and specify corresponding state; Collocation method is as follows:

Numerical value wherein in state node indicates the state value of this URL in crawl step.

(2) configuration captures the page and deposits path and coding, and the storage capturing data is arranged, and the path of URL entry file

Step (12): configure the deposit data path and coding that grab;

Step (13): the storage configuring the URL grabbed is arranged;

Step (14): the path of configuration URL entry file;

The configuration sample of step (12), step (13) and step (14): all configuration informations are all placed in an xml file, for arranging the every configuration capturing configuration phase, comprise the deposit data path (index_dir) grabbed, capture the coded system (encoding) of the page, URL entry file path (entry_urls), and the URL grabbed deposit path (output_urls).

(3) seized condition is configured

Step (15): after configuring above-mentioned Back ground Information, just enters the link of seized condition configuration.Configuration seized condition, in order to define the rules for grasping of next state url list.Each state writes in the xml node of state by name, and it is that the attribute id of state node indicates this state for which state; Action node in each state indicates which kind of rule to obtain target url list according to.The type attribute of action node specifies crawl plug-in unit used, captures plug-in unit and adopts XPath, regular expression or custom programming; State attribute specifies the NextState obtaining URL and will jump to.

The concrete configuration process of step (15) is as follows:

First, configuration status 1, for obtaining the URL of all list page.The function of state 1 has traveled through all list page of this website to the circuit of yunnan tourism.Each list page only has 20 circuits, and needing to jump to lower one page could capture all circuits.Therefore, capture plug-in unit by nextpage in webpage, find name to be the link of " lower one page ", thus obtain the URL of lower one page.

Secondly, configuration status 2, for obtaining the URL of all circuit details pages.Next state is just entered, i.e. state 2 after obtaining all list page URL.For page http://www.tuniu.com/g3300/tours-sh-0/list-h0/1/, this page has 20 circuits, and we can pass through XPath technology, is namely found the URL of these 20 circuits by //dt/p//a/@href.

Finally, 3 are got the hang of after obtaining circuit details page URL.Without any action inside state 3, also namely state 3 is marks of conditions desist, and the circuit details page grabbed is saved in server.

The configuration sample of step (15):

In sum, whole crawl configuration phase has been completed.Next, just can carry out the capture program execute phase according to the configuration file captured in configuration phase.

2, the capture program execute phase

The process flow diagram of capture program execute phase as shown in Figure 2, comprises the following steps:

Step (21): read configuration file;

Step (24): wait to capture queue from URL and take out a URL;

Step (27): the URL captured is added in URL pond;

3, configuration phase is extracted

The data obtained after capture program is finished are whole webpages, directly cannot bring and carry out data analysis, need to extract the data of these webpages, filter out useful data.

Extract configuration phase and comprise following two steps:

Step (31): configure base information, comprises the store path of the store path of the data grabbed, URL to be extracted, and the store path of the data be drawn into;

Step (32): configure decimation pattern according to the data (such as webpage) that stage of gripping obtains, comprises the information extracted from the webpage captured, and decimation rule.

Connection with step (31), some relevant Back ground Informations are extracted in configuration, comprise the store path (url_list) of url to be extracted, the store path (index_dir) of the data grabbed, and the store path of the data be drawn into (output_dir).For the catalogue of the deposit data be drawn in file system, the configuration sample of step (31) is as follows:

Connection with step (32), for this webpage of http://sh.tuniu.com/tours/22512110, needs the data extracted to have 3 fields such as product id, name (name) and promotion price (price).Each field specifies its process extracted path and will do with the form of an xml node.

Node node is a field to be processed, and name attribute specifies the name of this field.Xpath attribute indicates the xpath path of this field in webpage.If the content obtained by xpath path is undesirable, can also processing it by extracting plug-in unit (action), replacement process can be carried out with regular expression to the content obtained as extracted plug-in unit replace.Extract plug-in unit and adopt XPath or regular expression, if existing extraction plug-in unit can not satisfy the demands, developer also can pass through custom programming, develops the extraction plug-in unit of oneself, realizes this function.

4, the extraction program execute phase

After extracting configuration, just can carry out extraction program according to the information configured in extraction configuration phase to the data that stage of gripping obtains and perform.The process flow diagram of extraction program execute phase as shown in Figure 3, comprises the following steps:

Step (41): read configuration file;

Step (43): read a URL from URL queue to be extracted;

Step (44): retrieve webpage corresponding to URL from corresponding catalogue or database; If webpage exists, then perform step (45); Otherwise, perform step (46);

The data obtained are extracted as follows in this embodiment:

id:22512110

Name:< Kunming-Dali-bis-corridor-Lijing round trip flight 6 day tour > sees the large scape in Dian Chi, two corridor in product dream

price:1299

Extract the data obtained and become very structuring, developer can utilize these data to carry out other research easily.

Claims

1. a plug-in type configurable vertical field web crawlers implementation method, is characterized in that: comprise stage of gripping and extraction stage; Wherein, stage of gripping comprises crawl configuration phase and capture program execute phase, and extraction stage comprises extraction configuration phase and extraction program execute phase;

Capture configuration phase: configuration captures configuration file;

2. plug-in type according to claim 1 configurable vertical field web crawlers implementation method, is characterized in that: described crawl configuration phase comprises the following steps:

Step (13): the storage configuring the URL grabbed is arranged;

Step (14): the path of configuration URL entry file;

Step (17): the configuration information preserving reptile.

3. plug-in type according to claim 2 configurable vertical field web crawlers implementation method, it is characterized in that: the rules for grasping in described step (15) realizes by capturing plug-in unit, captures plug-in unit and adopts XPath, regular expression or custom programming.

4. plug-in type according to claim 2 configurable vertical field web crawlers implementation method, is characterized in that: the described capture program execute phase comprises the following steps:

Step (21): read configuration file;

Step (24): wait to capture queue from URL and take out a URL;

Step (27): the URL captured is added in URL pond;

5. plug-in type according to claim 1 configurable vertical field web crawlers implementation method, is characterized in that: described extraction configuration phase comprises the following steps:

6. plug-in type according to claim 5 configurable vertical field web crawlers implementation method, it is characterized in that: the decimation rule in described step (32) realizes by extracting plug-in unit, extracts plug-in unit and adopts XPath, regular expression or custom programming.

7. plug-in type according to claim 5 configurable vertical field web crawlers implementation method, is characterized in that: the described extraction program execute phase comprises the following steps:

Step (41): read configuration file;

Step (43): read a URL from URL queue to be extracted;