Search Images Maps Play YouTube News Gmail Drive More »
Sign in
Screen reader users: click this link for accessible mode. Accessible mode has the same essential features but works better with your reader.

Patents

  1. Advanced Patent Search
Publication numberCN101997915 A
Publication typeApplication
Application numberCN 201010532086
Publication date30 Mar 2011
Filing date29 Oct 2010
Priority date29 Oct 2010
Also published asCN101997915B
Publication number201010532086.4, CN 101997915 A, CN 101997915A, CN 201010532086, CN-A-101997915, CN101997915 A, CN101997915A, CN201010532086, CN201010532086.4
Inventors杨俊 , 蒋丹舟, 蔡逆水, 陈强
Applicant中国电信股份有限公司
Export CitationBiBTeX, EndNote, RefMan
External Links: SIPO, Espacenet
Deep packet detection device, webpage data processing method, and webpage data acquisition method and system
CN 101997915 A
Abstract
The invention discloses a webpage data processing method, a webpage data acquisition method, a deep packet detection device and a webpage data acquisition system. The webpage data acquisition method comprises the following steps: selectively capturing hyper text transport protocol (HTTP) messages of a flow webpage server according to a webpage address information base; analyzing the content of the captured HTTP messages; extracting the content of tag fields in the HTTP messages; and selectively acquiring the data in the captured HTTP messages according to the content of the tag fields. According to the invention, the deep packet detection technique with the webpage data acquisition technique are combined, the acquisition and analysis efficiency of the webpage data is improved, the cost for acquiring and analyzing mass data is reduced, and simultaneously the webpage data more accurately can be acquired because the tag fields are adopted.
Claims(11)  translated from Chinese
  1. 一种网页数据处理方法,其特征在于,包括:根据数据采集需求确定每个网页的HTTP协议报文的数据采集范围;在每个网页的HTTP协议报文中加入标签字段,所述标签字段的内容表示网页的HTTP协议报文的数据采集范围。 A web page data processing method comprising: determining based on the data collection needs of each page HTTP protocol packets scope of data collection; adding tag field in each page's HTTP protocol packets, the label field HTTP web content presentation protocol packets of data acquisition range.
  2. 2.根据权利要求1所述的方法,其特征在于,所述标签字段设置在所述每个网页的HTTP协议报文的头部字段中。 2. The method according to claim wherein the tag field is set in the header field of each page of the HTTP protocol packets.
  3. 3.根据权利要求1所述的方法,其特征在于,所述网页的HTTP协议报文的数据采集范围包括提取HTTP协议报文中的全部数据、提取HTTP协议报文中的部分数据、以及不提取HTTP协议报文中的任何数据。 3. The method according to claim 1, characterized in that the web page HTTP protocol packets of data collection include HTTP protocol to extract all of the data packets, HTTP protocol to extract part of the data packets, and not HTTP protocol to extract any data packets.
  4. 4. 一种网页数据采集方法,其特征在于,包括:根据网页地址信息库选择性地抓取流向网页服务器的HTTP协议报文;解析抓取到的HTTP协议报文的内容;提取所述HTTP协议报文中的标签字段的内容;根据所述标签字段的内容对所述抓取到的HTTP协议报文中的数据进行选择性采集。 4. A web page data collection methods, characterized by comprising: the web address repository flows selectively crawl web server HTTP protocol packets; resolves to crawl content HTTP protocol packets; extracting the HTTP Content protocol packets label field; based on the contents of the tag field data for the crawl to the HTTP protocol packets selective acquisition.
  5. 5.根据权利要求4所述的方法,其特征在于,通过下述步骤形成所述流向网页服务器的HTTP协议报文:根据数据采集需求确定每个网页的HTTP协议报文的数据采集范围;在每个网页的HTTP协议报文中加入所述标签字段,形成所述流向网页服务器的HTTP 协议报文,其中,所述标签字段的内容表示网页的HTTP协议报文的数据采集范围。 5. The method according to claim 4, wherein forming the flow web server HTTP protocol packets by the following steps: data collection based on the needs identified for each page HTTP protocol packets scope of data collection; in Each page of the HTTP protocol packet is added to the label field, forming the flow web server HTTP protocol packets, wherein the contents of the tag field indicates the Web page HTTP protocol packets of data acquisition range.
  6. 6.根据权利要求4或5所述的方法,其特征在于,所述标签字段设置在所述每个网页的HTTP协议报文的头部字段中。 4 or 6. The method according to claim 5, characterized in that the tag field is set in the header field of each page of the HTTP protocol packets.
  7. 7.根据权利要求5所述的方法,其特征在于,所述网页的HTTP协议报文的数据采集范围包括提取HTTP协议报文中的全部数据、提取HTTP协议报文中的部分数据、以及不提取HTTP协议报文中的任何数据。 7. The method according to claim 5, characterized in that the web page HTTP protocol packets of data collection include HTTP protocol to extract all of the data packets, HTTP protocol to extract part of the data packets, and not HTTP protocol to extract any data packets.
  8. 8. 一种深度包检测装置,其特征在于,包括:地址筛选模块,用于根据网页地址信息库选择性地抓取流向网页服务器的HTTP协议报文;报文解析模块,与所述地址筛选模块相连,用于解析抓取到的HTTP协议报文的内容;标签内容提取模块,与所述报文解析模块相连,用于提取所述HTTP协议报文中的标签字段的内容,其中,所述标签字段的内容表示网页的HTTP协议报文的数据采集范围;数据采集模块,与所述标签内容提取模块相连,用于根据所述标签字段的内容对所述抓取到的HTTP协议报文中的数据进行选择性采集。 A deep packet inspection apparatus comprising: address screening module for web addresses repository flows selectively crawl web server HTTP protocol packets; packet analysis module, and the address filtering module connected to a crawl for parsing HTTP protocol packets of content; label extraction module is connected with the message parsing module for extracting the contents of the HTTP protocol packets label field, in which the the contents of said tag field indicates the Web page HTTP protocol packets scope of data collection; data acquisition module, and extract the contents of the tag attached to the module, based on the contents of the tag field is used for the crawl to the HTTP protocol packets selective data collection.
  9. 9.根据权利要求8所述的装置,其特征在于,所述标签字段设置在所述流向网页服务器的HTTP协议报文的头部字段中。 9. The device according to claim 8, wherein the tag field is set at the head of the flow field of a web server in the HTTP protocol packets.
  10. 10.根据权利要求8所述的装置,其特征在于,所述网页的HTTP协议报文的数据采集范围包括提取HTTP协议报文中的全部数据、提取HTTP协议报文中的部分数据、以及不提取HTTP协议报文中的任何数据。 10. The apparatus according to claim 8, wherein the web page HTTP protocol packets of data collection include HTTP protocol to extract all of the data packets, HTTP protocol to extract part of the data packets, and not HTTP protocol to extract any data packets.
  11. 11. 一种网页数据采集系统,其特征在于,包括权利要求8-10中任一项所述的深度包检测装置以及网页数据处理装置,其中,所述网页数据处理装置包括:采集范围确定模块,用于根据数据采集需求确定每个网页的HTTP协议报文的数据采集范围;数据处理模块,与所述采集范围确定模块相连,用于在每个网页的HTTP协议报文中加入所述标签字段,形成所述流向网页服务器的HTTP协议报文,其中,所述标签字段的内容表示网页的HTTP协议报文的数据采集范围。 11. A web page data collection system, characterized in that the apparatus comprises a deep packet inspection data processing apparatus and the web according to any one of claims 8-10, wherein said web page data processing apparatus comprising: acquisition range determining module for according to the data collection needs to determine the scope of each page of data collection HTTP protocol packets; data processing module, connected with the acquisition range determining module for each page of the HTTP protocol packet is added to the tag field, forming the flow web server HTTP protocol packets, wherein the contents of the tag field indicates the Web page HTTP protocol packets of data acquisition range.
Description  translated from Chinese

深度包检测装置、网页数据处理方法、采集方法及系统 Deep packet inspection means, web page data processing method and system for collection method

技术领域 Technical Field

[0001] 本发明涉及互联网技术领域,特别地,涉及一种深度包检测装置、网页数据处理方法、网页数据采集方法及网页数据采集系统。 [0001] The present invention relates to the technical field of the Internet, in particular, relates to a deep packet inspection device, data processing method pages, web page data collection method and data acquisition systems.

背景技术 Background

[0002] 随着WEB技术和TOB应用的快速发展,对各种TOB应用网站,特别是电子渠道、电子商务等平台中的集中监控、用户数据采集和统计分析的应用也越来越广泛。 [0002] With the rapid development of technology and TOB WEB applications, a variety of TOB application site, especially electronic channels, e-commerce platform centralized monitoring, user data collection and statistical analysis applications are increasingly widespread. 但是,由于用户量庞大的电子渠道和电子商务等平台的用户数据是海量的,因此,在实际工作中需要对海量数据进行选择性地采集。 However, due to the huge amount of user data users and e-commerce platform for electronic channels is massive, so in practice the need for massive amounts of data collected selectively.

[0003] 然而,现有的网页在设计之初并没有考虑数据采集问题,而且现有的网页普遍存在页面地址及采集数据杂乱、准确性不高等问题,因此,基于现有的网页难于进行高效和准确地数据采集。 [0003] However, the existing pages in the beginning of the design did not consider the issue of data collection, and the prevalence of existing web page address and data acquisition messy, accuracy is not high, therefore, based on existing pages difficult to efficiently and accurate data collection.

发明内容 DISCLOSURE

[0004] 本发明要解决的一个技术问题是提供一种深度包检测装置、网页数据处理方法、 网页数据采集方法及网页数据采集系统,能够高效且准确地对网页的数据进行采集。 [0004] A technical problem to be solved by the present invention is to provide a deep packet inspection devices, web data processing methods, data collection methods and web page data acquisition system capable of efficiently and accurately collect data on the web.

[0005] 根据本发明的一方面,提出了一种网页数据处理方法,包括根据数据采集需求确定每个网页的HTTP协议报文的数据采集范围;在每个网页的HTTP协议报文中加入标签字段,标签字段的内容表示网页的HTTP协议报文的数据采集范围。 [0005] According to an aspect of the present invention, a new web data processing method, including data collection needs to determine the scope of the data collection based on each page's HTTP protocol packets; adding tags in each page HTTP protocol packets Content fields, field labels indicate the page HTTP protocol packets of data acquisition range.

[0006] 根据本发明网页数据处理方法的一个实施例,标签字段设置在每个网页的HTTP 协议报文的头部字段中。 [0006] According to the page data processing method according to an embodiment of the present invention in the header field of each page in the HTTP protocol packets example, the label field is set.

[0007] 根据本发明网页数据处理方法的另一实施例,网页的HTTP协议报文的数据采集范围包括提取HTTP协议报文中的全部数据、提取HTTP协议报文中的部分数据、以及不提取HTTP协议报文中的任何数据。 Scope of data collection [0007] According to a further embodiment the page data processing method, the page HTTP protocol packets that include HTTP protocol packet extracting all the data, extract HTTP protocol packets part of the data, and does not extract Any data HTTP protocol packets.

[0008] 根据本发明的另一方面,还提出了一种网页数据采集方法,包括根据网页地址信息库选择性地抓取流向网页服务器的HTTP协议报文;解析抓取到的HTTP协议报文的内容; 提取HTTP协议报文中的标签字段的内容;根据标签字段的内容对抓取到的HTTP协议报文中的数据进行选择性采集。 [0008] According to another aspect of the present invention also proposes a web data collection methods, including web address based on the repository flows selectively crawl web server HTTP protocol packets; resolves to crawl to the HTTP protocol packets content; extract the contents of the HTTP protocol packet label field; based on the contents of the data field labels to crawl to the HTTP protocol packets selective acquisition.

[0009] 根据本发明网页数据采集方法的一个实施例,通过下述步骤形成流向网页服务器的HTTP协议报文:根据数据采集需求确定每个网页的HTTP协议报文的数据采集范围;在每个网页的HTTP协议报文中加入标签字段,形成流向网页服务器的HTTP协议报文,其中, 标签字段的内容表示网页的HTTP协议报文的数据采集范围。 [0009] According to the HTTP protocol packets to one embodiment, the formation of the flow of the web server by the steps of the present invention is a method of data collection page: determined based on the data collection needs of each page HTTP protocol packets scope of data collection; in each Web page HTTP protocol packets added label field, forming the flow web server HTTP protocol packets, which represents the contents of the tag field of web HTTP protocol packets of data acquisition range.

[0010] 根据本发明网页数据采集方法的另一实施例,标签字段设置在每个网页的HTTP 协议报文的头部字段中。 [0010] According to another page of data collection methods of the present invention embodiment of Labels field in the header field of each page in the HTTP protocol packets.

[0011] 根据本发明网页数据采集方法的又一实施例,网页的HTTP协议报文的数据采集范围包括提取HTTP协议报文中的全部数据、提取HTTP协议报文中的部分数据、以及不提取HTTP协议报文中的任何数据。 Scope of data collection [0011] According to yet another page of data collection methods of the present invention example, the page HTTP protocol packets that include HTTP protocol packet extracting all the data, extract HTTP protocol packets part of the data, and does not extract Any data HTTP protocol packets.

[0012] 根据本发明的又一方面,还提出了一种深度包检测装置,包括地址筛选模块,用于根据网页地址信息库选择性地抓取流向网页服务器的HTTP协议报文;报文解析模块,与地址筛选模块相连,用于解析抓取到的HTTP协议报文的内容;标签内容提取模块,与报文解析模块相连,用于提取HTTP协议报文中的标签字段的内容,其中,标签字段的内容表示网页的HTTP协议报文的数据采集范围;数据采集模块,与标签内容提取模块相连,用于根据标签字段的内容对抓取到的HTTP协议报文中的数据进行选择性采集。 [0012] According to yet another aspect of the present invention, also proposed a deep packet inspection device includes an address filter module for web addresses repository flows selectively crawl web server HTTP protocol packets; packet parsing module, filter module is connected with the address for parsing HTTP protocol to crawl to the content of packets; label extraction module is connected to the message parsing module is used to extract the contents of the HTTP protocol packet label field, wherein Content tab field indicates the Web page HTTP protocol packets scope of data collection; data acquisition module, content extraction module is connected with the label, the label for the field based on the contents of the data to crawl to the HTTP protocol packets selective collection .

[0013] 根据本发明深度包检测装置的一个实施例,标签字段设置在流向网页服务器的HTTP协议报文的头部字段中。 [0013] In accordance with one embodiment of the present invention, deep packet inspection device according to the tag field is set in the flow of web server HTTP protocol header fields of packets.

[0014] 根据本发明深度包检测装置的另一实施例,网页的HTTP协议报文的数据采集范围包括提取HTTP协议报文中的全部数据、提取HTTP协议报文中的部分数据、以及不提取HTTP协议报文中的任何数据。 Scope of data collection [0014] According to a further embodiment of deep packet inspection device, the page HTTP protocol packets that include extracting HTTP protocol all data packets, HTTP protocol to extract part of the data packets, and does not extract Any data HTTP protocol packets.

[0015] 根据本发明的再一方面,还提出了一种网页数据采集系统,包括上述实施例中的深度包检测装置以及网页数据处理装置,其中,网页数据处理装置包括采集范围确定模块, 用于根据数据采集需求确定每个网页的HTTP协议报文的数据采集范围;数据处理模块,与采集范围确定模块相连,用于在每个网页的HTTP协议报文中加入标签字段,形成流向网页服务器的HTTP协议报文,其中,标签字段的内容表示网页的HTTP协议报文的数据采集范围。 [0015] According to another aspect of the present invention, there is proposed a page data acquisition system, including the above-described embodiments of the deep packet inspection means and the web data processing apparatus, wherein the data processing means comprises a web page acquisition range determining module, with to determine each page HTTP protocol packets scope of data collection based on the needs of data collection; data processing module, coupled with the acquisition module to determine the scope for adding label field in each page's HTTP protocol packets, forming the flow web server The HTTP protocol packets, which represents the contents of the tag field of web HTTP protocol packets of data acquisition range.

[0016] 本发明提供的深度包检测装置、网页数据处理方法、网页数据采集方法及网页数据采集系统,能够将深度包检测(De印Packet InspectiomDPI)技术与网页数据采集技术相结合,提升了对网页数据的采集效率,减小了对海量数据进行采集和分析的成本。 [0016] deep packet inspection apparatus of the present invention provides a method of data processing web page, web page data collection methods and data acquisition system, capable of deep packet inspection (De India Packet InspectiomDPI) technology and web data capture technology, promoted to Web page data collection efficiency, reducing the cost of massive data collection and analysis. 同时, 由于采用标签字段,所以能够更准确地确定网页的数据采集范围,从而提高了数据采集的准确性。 At the same time, the use of the label field, it is possible to more accurately determine the range of the data collection page, thereby improving the accuracy of data acquisition.

附图说明 Brief Description

[0017] 此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分。 [0017] The drawings described herein to provide a further understanding of the present invention, constitute a part of this application. 在附图中: In the drawings:

[0018] 图1是本发明网页数据处理方法的一个实施例的流程示意图。 [0018] FIG. 1 is a schematic flow diagram of one embodiment of the present invention, the page data processing method.

[0019] 图2是本发明网页数据采集方法的一个实施例的流程示意图。 [0019] FIG. 2 is a schematic flow diagram of one embodiment of the present invention, a method of data collection page.

[0020] 图3是本发明网页数据采集方法的又一实施例的流程示意图。 [0020] FIG. 3 is a schematic flow diagram of another page of data collection methods of the present invention embodiment.

[0021] 图4是本发明深度包检测装置的一个实施例的结构示意图。 [0021] FIG. 4 is a block diagram of one embodiment of deep packet inspection device of the present invention.

[0022] 图5是本发明网页数据采集系统的一个实施例的结构示意图。 [0022] FIG. 5 is a structural diagram of an embodiment of the present invention, the page data acquisition system.

具体实施方式 DETAILED DESCRIPTION

[0023] 下面参照附图对本发明进行更全面的描述,其中说明本发明的示例性实施例。 [0023] The following is described more fully with reference to the accompanying drawings of the present invention, which illustrates an exemplary embodiment of the present invention. 本发明的示例性实施例及其说明用于解释本发明,但并不构成对本发明的不当限定。 An exemplary embodiment of the present invention and are used to explain the present invention, but does not constitute an unduly limit the invention.

[0024] 以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。 [0024] The following description of at least one exemplary embodiment is merely illustrative, and in no way limit the present invention and any application or use.

[0025] 本发明将DPI技术和TOB网页数据采集技术相结合,在分析了DPI选择性数据采集原理的基础上,为了提升采集分析的效率,提出了便于DPI采集的网页数据处理方法、网页数据采集方法、深度包检测装置以及网页数据采集系统。 [0025] The present invention will TOB DPI technology and web data capture technology, based on the analysis of data collected DPI principle of selectivity, in order to enhance the efficiency of collection and analysis, DPI proposed to facilitate the collection of web data processing method, the page data collection methods, deep packet inspection devices and Web data acquisition system.

[0026] 在进行DPI选择性采集时,首先需要建立一个库,存储待采集的页面地址,每个请求到服务器后先根据这个库进行地址查询,如果网页的地址与库中的页面地址相匹配,则提取页面的内容。 [0026] To perform selective collection DPI, you first need to create a library to store the page address to be collected after each request to the server to perform address queries based on this library, if the page's address in the library match the address of the page , then extract the contents of the page.

[0027] 图1是本发明网页数据处理方法的一个实施例的流程示意图。 [0027] FIG. 1 is a schematic flow diagram of one embodiment of the present invention, the page data processing method.

[0028] 如图1所示,该实施例可以包括以下步骤: [0028] As shown in Figure 1, this embodiment may include the steps of:

[0029] S102,根据数据采集需求确定对每个网页的HTTP协议报文进行数据采集的范围; [0029] S102, according to the data collection requirements determination for each page of the HTTP protocol packets scope of data collection;

[0030] S104,在每个网页的HTTP协议报文中添加标签字段,该标签字段的内容表示对网页的HTTP协议报文进行数据采集的范围,其中,该标签字段可以位于HTTP协议报文中的任何位置,优选地,可以将标签字段设置在每个网页的HTTP协议报文的头部字段中。 [0030] S104, to add in every page of the HTTP protocol packet label field, the contents of the tag field indicates the web page's HTTP protocol packets scope of data collection, in which the tag field can be located in HTTP protocol packets any position, preferably, may be provided in the head field tag field of each page of the HTTP protocol packets.

[0031] 另外,对网页的HTTP协议报文进行数据采集的范围可以包括提取HTTP协议报文中的全部数据(即,包括报文头至报文尾的所有数据)、提取HTTP协议报文中的部分数据(例如,IP地址、用户名、页面地址、访问时间、登录类别以及页面参数等)、以及不提取HTTP 协议报文中的任何数据。 [0031] In addition, the page HTTP protocol packets scope of data collection may include HTTP protocol to extract all the data packets (ie, including all of the data packet header to the end), extract HTTP protocol packets The part of the data (for example, IP address, user name, page address, access time, login category and page parameters, etc.), as well as HTTP protocol does not extract any data packets.

[0032] 该实施例在进行网页数据处理的同时考虑了DPI技术,在分析了DPI选择性数据采集原理的基础上提出了便于DPI采集的网页数据处理方法,该实施例能够显著提升网页数据的采集效率,并且提高数据采集的准确性。 [0032] Meanwhile, the embodiment of pages of data processing during consideration of DPI technology to facilitate the analysis of the proposed acquisition of DPI-based web data processing method of data acquisition DPI principle of selectivity, this embodiment can significantly improve page data collection efficiency, and improve the accuracy of data collection.

[0033] 在本发明网页数据处理方法的另一实施例中,首先需要对页面地址进行规范(例如,http://202. 23. 24. 153/news/sports,代表新闻中的体育内容),然后,将TOB网站网页分为不同的层级,对应不同的数据采集范围,再在网页的HTTP协议报文中加入标签字段, 该标签字段的内容对应于不同的数据采集范围。 [0033] In another aspect of the invention data processing website embodiment of the method, you first need to standardize the page address (for example, http:. // 202 23. 24. 153 / news / sports, on behalf of the news in the sports content) and then, the TOB website pages are divided into different levels, corresponding to different data acquisition range, then add the label fields on the page of the HTTP protocol packets, the contents of the tag field corresponding to the different data collection range. 根据RFC协议规范,HTTP协议报文的头部字段可以根据具体应用需要嵌入自定义字段内容,因此,可以在电子渠道网页实现时嵌入自定义的HTTP头部字段信息(即,标签字段),针对不同的数据采集需求,对网页嵌入不同的自定义信息,从而实现对网页的层级分类,进一步地可以为数据采集作好准备。 According to RFC protocol specification, HTTP protocol header fields of packets can be the need to embed a custom field content depending on the application, so you can embed a custom HTTP header field information (ie, the label field) to achieve the page when the electronic channels for Different data collection needs, different web page to embed custom information, in order to achieve the level of classification of the page, you can prepare for further data acquisition.

[0034] 图2是本发明网页数据采集方法的一个实施例的流程示意图。 [0034] FIG. 2 is a schematic flow diagram of one embodiment of the present invention, a method of data collection page.

[0035] 如图2所示,该实施例可以包括以下步骤: [0035] shown in Figure 2, this embodiment may include the steps of:

[0036] S202,根据网页地址信息库选择性地抓取流向网页服务器的HTTP协议报文,其中,该网页地址信息库中可以存储待抓取网页的页面地址,在流向网页服务器的页面的地址满足网页地址信息库的要求(例如,该页面地址存储于网页地址信息库中)时,才被抓取并进行后续的报文解析与数据提取; [0036] S202, according to the web page address information base flow selectively crawl web server HTTP protocol packets, wherein the web address can be stored in the repository to be crawled web page address, the address of the web server in the flow of the page Web page addresses repository to meet the requirements (for example, the web page address stored in the address information database), it was only crawl and subsequent packet analysis and data extraction;

[0037] S204,解析抓取到的HTTP协议报文的内容; [0037] S204, parse the contents of the HTTP protocol to crawl into packets;

[0038] S206,提取HTTP协议报文中的标签字段的内容; [0038] S206, extract the contents of HTTP protocol packets in a label field;

[0039] S208,根据标签字段的内容对抓取到的HTTP协议报文中的数据进行选择性采集。 [0039] S208, based on the contents of the data field labels to crawl to the HTTP protocol packets selective acquisition.

[0040] 其中,可以通过下述步骤形成流向网页服务器的HTTP协议报文:根据数据采集需求确定对每个网页的HTTP协议报文进行数据采集的范围;在每个网页的HTTP协议报文中加入标签字段,形成流向网页服务器的HTTP协议报文,其中,标签字段的内容表示对网页的HTTP协议报文进行数据采集的范围。 [0040] where the flow can be formed web server HTTP protocol packets by the following steps: According to the data collected to determine the scope for the data collection needs of each page HTTP protocol packets; each page of HTTP protocol packets Join label field, forming the flow web server HTTP protocol packets, which represents the contents of the tag field of web HTTP protocol packets range of data acquisition.

[0041] 在一个实例中,对网页的HTTP协议报文进行数据采集的范围可以包括提取HTTP协议报文中的全部数据(即,包括报文头至报文尾的所有数据)、提取HTTP协议报文中的部分数据(例如,IP地址、用户名、页面地址、访问时间、登录类别以及页面参数等)、以及不提取HTTP协议报文中的任何数据。 Range [0041] In one example, on page HTTP protocol packets of data collection may include HTTP protocol to extract all the data packets (ie, including all of the data packet header to the end), extract HTTP protocol part of the data packets (eg, IP address, user name, page address, access time, login category and page parameters, etc.), as well as HTTP protocol does not extract any data packets.

[0042] 可选地,标签字段可以位于每个网页的HTTP协议报文中的任何位置,优选地,可以将标签字段设置在每个网页的HTTP协议报文的头部字段中。 [0042] Alternatively, the tag field can be located on each page of the HTTP protocol packets in any position, preferably, may be provided in the head field tag field of each page in the HTTP protocol packets.

[0043] 该实施例在进行数据采集时,首先根据网页地址信息库筛选待采集的网页,在很大程度上减少了海量数据的干扰。 [0043] When this embodiment during data collection, the first library screening based on the page address information to be collected on the page, to a large extent reduce the interference of massive data. 进一步地,该实施例还解析所抓取网页的HTTP头部字段内容,提取自定义的头部字段标签内容,按照标签的内容采取不同的数据采集提取流程,例如,可以提取HTTP协议报文的全部内容、提取HTTP协议报文的部分内容或者不提取任何内容,从而实现带选择性的数据采集,减小海量数据对于技术及成本的压力,同时提高了数据采集的效率和准确性。 Further, the embodiment also parses the HTTP header field crawled pages, extract the contents of the header field custom label, in accordance with the contents of tag data capture to take a different extraction process, for example, you can extract the HTTP protocol packets All content, extract parts packets or HTTP protocol does not extract any content, in order to achieve with selective data collection, reducing the huge amounts of data for technical and cost pressures, while improving the efficiency and accuracy of data collection.

[0044] 在本发明网页数据采集方法的另一实施例中,根据RFC协议规范解析流向TOB网站服务器的HTTP协议报文,根据解析出的标签字段的内容在待采集数据内容的相应位置提取具体信息。 Corresponding position [0044] In another aspect of the invention data collection website embodiment of the method in accordance with the protocol specification RFC parsing flow TOB Web server HTTP protocol packets, according to parse out the contents of the tag field in the data to be collected to extract specific content information. 具体地,DPI装置在处理HTTP协议时,解析相应的自定义头部字段内容(即, 标签字段的内容),根据自定义头部字段内容的定义调用不同的数据采集流程,以实现网页数据的提取。 Specifically, DPI device when processing HTTP protocol, parsing corresponding custom header field content (ie, content label field), according to the custom header field content definitions invoke different data collection processes to achieve the website data extraction. HTTP协议头部字段嵌入的自定义内容可以分为标签和内容两个部分,自定义的头部字段可以约定以“X-”开头,例如,"X-type :0”可以表示提取HTTP协议报文的所有内容,“X-type :1”可以表示只提取URL地址。 HTTP protocol header fields are embedded in custom content tags and content can be divided into two parts, from the head of the field definitions can be agreed at the beginning of "X-" in order to, for example, "X-type: 0" may represent extract HTTP protocol packets All content text, "X-type: 1" may represent only extract URL address. 根据数据采集内容的层级需要,可以定义一个或者多个自定义头部标签,分别赋予不同的内容,代表提取不同的数据。 According to the level of data collection required content, you can define one or more custom head tags were given different content, different representatives of extracted data.

[0045] 图3是本发明网页数据采集方法的又一实施例的流程示意图。 [0045] FIG. 3 is a schematic flow diagram of another page of data collection methods of the present invention embodiment.

[0046] 如图3所示,该实施例可以包括以下步骤: [0046] As shown in Figure 3, this embodiment may include the steps of:

[0047] S302,搭建DPI采集系统,与目标采集网站进行数据镜像; [0047] S302, build DPI acquisition systems, target acquisition and site data mirroring;

[0048] S304,建立网页地址信息库,其中存储了待抓取网页的地址; [0048] S304, the establishment of a web page address information repository, which stores the address of the page to be crawled;

[0049] S306,建立选择性解析内容深度信息库,其中存储了不同自定义标签对应的数据采集解析子程序,例如,提取HTTP协议报文的全部内容所使用的全部数据采集解析子程序、提取HTTP协议报文的部分内容所使用的部分数据采集解析子程序等; [0049] S306, the establishment of selective parsing depth of content repository, which stores different custom label corresponding data collection parsing routines, for example, to extract all the data the entire contents of HTTP protocol packet capture parsing routines used to extract part of the data part of the HTTP protocol packets that are used to resolve subroutines collection;

[0050] S308,根据网页地址信息库对流向网页服务器的页面进行选择性抓取; [0050] S308, according to the web page address information bank on the flow of web server pages selectively crawl;

[0051] S310,存储所抓取的数据; [0051] S310, storing the captured data;

[0052] S312,解析抓取到的页面的HTTP协议报文的内容,根据HTTP协议报文中的标签字段的内容对抓取到的HTTP协议报文中的数据进行选择性采集; [0052] S312, the content of the page to resolve crawl HTTP protocol packets, according to the content of the HTTP protocol packets on the data label field to the HTTP protocol to crawl packets selective collection;

[0053] S314,分类存储解析后的数据。 [0053] S314, data classification storage parsed.

[0054] 图4是本发明深度包检测装置的一个实施例的结构示意图。 [0054] FIG. 4 is a block diagram of one embodiment of deep packet inspection device of the present invention.

[0055] 如图4所示,该实施例的深度包检测装置10可以包括: [0055] 4, deep packet inspection apparatus 10 of this embodiment may include:

[0056] 地址筛选模块11,用于根据网页地址信息库选择性地抓取流向网页服务器的HTTP协议报文; [0056] address filtering module 11 for selectively crawl web server HTTP protocol flow of packets based on the page address information base;

[0057] 报文解析模块12,与地址筛选模块相连,用于解析抓取到的HTTP协议报文的内容; [0057] message parsing content module 12, and address filtering module connected to a crawl for parsing HTTP protocol packets;

[0058] 标签内容提取模块13,与报文解析模块相连,用于提取HTTP协议报文中的标签字段的内容,其中,标签字段的内容表示网页的HTTP协议报文的数据采集范围,可选地,网页的HTTP协议报文的数据采集范围可以包括提取HTTP协议报文中的全部数据、提取HTTP协议报文中的部分数据、以及不提取HTTP协议报文中的任何数据; [0058] The label extraction module 13, and the message parsing module is connected, to extract the contents of the HTTP protocol packet label field, in which the contents of the tag field indicates the Web page HTTP protocol packets of data acquisition range, optional scope of data collection, the page HTTP protocol packets can include HTTP protocol to extract all of the data packets, HTTP protocol to extract part of the data packets, and does not extract any data HTTP protocol packets;

[0059] 数据采集模块14,与标签内容提取模块相连,用于根据标签字段的内容对抓取到的HTTP协议报文中的数据进行选择性采集。 [0059] data acquisition module 14, with the label extraction module is connected, according to the contents of the tag field to the HTTP protocol Data Capture packets selective acquisition.

[0060] 可选地,可以将标签字段设置在流向网页服务器的HTTP协议报文的头部字段中。 [0060] Alternatively, the tag field is set in the flow of web server HTTP protocol header fields of packets.

[0061] 该实施例在进行数据采集时,首先根据网页地址筛选待采集的网页,在很大程度上减少了对海量的处理。 [0061] When this embodiment during data collection, the first to be collected according to the web page address filtering, to a large extent reduce the massive processing. 另外,该实施例还解析所抓取网页的HTTP头部字段内容,提取自定义的头部字段标签内容,按照标签的内容采取不同的数据采集提取流程,可以提取HTTP 协议报文的全部内容、提取HTTP协议报文的部分内容或者不提取任何内容等,从而实现带选择性的数据采集,减小海量数据对于技术及成本的压力,同时提高了数据采集的效率和准确性。 In addition, this embodiment also parses the HTTP header field crawled pages of content, extract the contents of a custom tag header field, to take a different extraction procedure in accordance with the contents of the data collection tags, you can extract all the content of the HTTP protocol packets. extract part of HTTP protocol packets or does not extract any content, enabling data acquisition with selective, reduced mass data for technical and cost pressures, while improving the efficiency and accuracy of data collection.

[0062] 图5是本发明网页数据采集系统的一个实施例的结构示意图。 [0062] FIG. 5 is a structural diagram of an embodiment of the present invention, the page data acquisition system.

[0063] 如图5所示,该实施例的网页数据采集系统可以包括前述实施例中的深度包检测装置10以及网页数据处理装置21,其中,网页数据处理装置21包括: [0063] As shown in Figure 5, the page data acquisition system of this embodiment may include, for example the deep packet inspection means 10 and the page data processing device 21 of the preceding embodiments, wherein the web page data processing apparatus 21 comprises:

[0064] 采集范围确定模块211,用于根据数据采集需求确定每个网页的HTTP协议报文的数据采集范围; [0064] acquisition range determining module 211, according to the data collection requirements determination for each page of the HTTP protocol packets scope of data collection;

[0065] 数据处理模块212,与采集范围确定模块相连,用于在每个网页的HTTP协议报文中加入标签字段,形成流向网页服务器的HTTP协议报文,其中,标签字段的内容表示网页的HTTP协议报文的数据采集范围。 [0065] The data processing module 212 is connected with an acquisition module to determine the scope for adding label field in each page's HTTP protocol packets, forming the flow web server HTTP protocol packets, which represents the contents of the tag field of the page HTTP protocol packets of data collection range.

[0066] 虽然已经通过示例对本发明的一些特定实施例进行了详细说明,但是本领域的技术人员应该理解,以上示例仅是为了进行说明,而不是为了限制本发明的范围。 [0066] Although by way of example of certain embodiments of the present invention has been described in detail, those skilled in the art will appreciate that the above example is for illustration only and not intended to limit the scope of the invention. 本领域的技术人员应该理解,可在不脱离本发明的范围和精神的情况下,对以上实施例进行修改。 Those skilled in the art will appreciate, can be made without departing from the scope and spirit of the present invention, the modification of the above embodiments. 本发明的范围由所附权利要求来限定。 The scope of the invention defined by the appended claims.

Patent Citations
Cited PatentFiling datePublication dateApplicantTitle
CN1402156A *22 Aug 200112 Mar 2003威瑟科技股份有限公司Web site information extracting system and method
CN101094135A *23 Jun 200626 Dec 2007腾讯科技(深圳)有限公司Method and system for extracting information of content in Internet
CN101399749A *27 Sep 20071 Apr 2009华为技术有限公司Method, system and device for packet filtering
CN101556609A *19 May 200914 Oct 2009杭州信杨通信技术有限公司Customer behavior analysis and service system based on web contents
CN101667182A *5 Sep 200810 Mar 2010华为技术有限公司Method, system and device for performing secondary operation on web pages
US20020091755 *5 Jan 200111 Jul 2002Attila NarinSupplemental request header for applications or devices using web browsers
WO2007101478A1 *9 Mar 200613 Sep 2007Tecs Research And Development LimitedA method of monitoring online banner activity
Non-Patent Citations
Reference
1 *中华人民共和国工业和信息化部: "《中华人民共和国通信行业标准》", 15 June 2009, article "深度包检测设备技术要求"
2 *韩树人 等: "基于嵌入式Web服务器的远程实时数据采集", 《计算机技术与发展》, vol. 18, no. 1, 31 January 2008 (2008-01-31)
Referenced by
Citing PatentFiling datePublication dateApplicantTitle
CN104486157A *16 Dec 20141 Apr 2015国家电网公司Information system performance detecting method based on deep packet analysis
Classifications
International ClassificationH04L29/08
Legal Events
DateCodeEventDescription
30 Mar 2011C06Publication
6 Jul 2011C10Entry into substantive examination
8 Jan 2014C14Grant of patent or utility model