CN102768660B - Dynamic-interaction-based generation method of template of internet acquisition system - Google Patents

Dynamic-interaction-based generation method of template of internet acquisition system Download PDF

Info

Publication number
CN102768660B
CN102768660B CN201110114641.6A CN201110114641A CN102768660B CN 102768660 B CN102768660 B CN 102768660B CN 201110114641 A CN201110114641 A CN 201110114641A CN 102768660 B CN102768660 B CN 102768660B
Authority
CN
China
Prior art keywords
node
acquisition system
source code
generation method
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110114641.6A
Other languages
Chinese (zh)
Other versions
CN102768660A (en
Inventor
陈宗华
陈永江
伊鹏
刘永超
李存华
仲兆满
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Golden Feather Network Technology Nanjing Co ltd
Original Assignee
JIANGSU JINGE NETWORK TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGSU JINGE NETWORK TECHNOLOGY Co Ltd filed Critical JIANGSU JINGE NETWORK TECHNOLOGY Co Ltd
Priority to CN201110114641.6A priority Critical patent/CN102768660B/en
Publication of CN102768660A publication Critical patent/CN102768660A/en
Application granted granted Critical
Publication of CN102768660B publication Critical patent/CN102768660B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a dynamic-interaction-based generation method of a template of an internet acquisition system. The method is characterized by including steps of firstly, accessing internet to load a target webpage and obtain a webpage text source code set S; secondly, identifying a node set N of the set S according to regular expression of a tag and adding a unique serial number to each node in the set N; thirdly, constructing a model tree T with an interdependent hierarchical structure according to the set N; and fourthly, inputting node IDs (identities), performing moving operation, and using ergodic analysis algorithm to iteratively calculate front tag expression and rear tag expression of each node. By the method, customization of finish template of man-machine interaction is achieved, operation can be finished by clicking a mouse, tiles, authors, contents, reply time, reply contents and the like of information which can be seed on a browser can be distinguished in acquired information, acquiring efficiency is high, only contents defined by the template are acquired during acquiring, and network resource occupation is low. In addition, the template has a function of novelty retrieval.

Description

A kind of internet acquisition system masterplate generation method based on dynamic interaction
Technical field
The invention belongs to internet information acquisition field, specifically relate to a kind of internet acquisition system masterplate generation method based on dynamic interaction.
Background technology
Along with the fast development of social informatization, network has become the important sources of people's obtaining information.And the network information has magnanimity, complexity, the features such as destructuring, for the network information obtain and analysis and the research work of information search Network Based have all brought very large difficulty.A large amount of practices also show, it is difficult on network, various information carrier (News Network, blog, forum, microblogging etc.) being carried out to information acquisition.Particularly want provisionally while carrying out information acquisition for certain specific objective, the simplification of the ability that is suitable for, collecting efficiency and operation to acquisition system has all proposed very high requirement.In order to adapt to the increasing demand in market, also arise at the historic moment for the template rapid generation of each acquisition target.Internet acquisition system masterplate generation method based on dynamic interaction, in automatic collection field, can carry out the customization of template to some information carriers (News Network, blog, forum, microblogging etc.) targetedly, the title of inclusion information, content, author, time of origin etc. some common manifestation forms.
Internet acquisition system masterplate generation method based on dynamic interaction, is applied to public sentiment management domain on the one hand, uses in government departments such as public security, safety, safety supervisions; Also can use in information analysis field on the other hand, as: hunter's industry.In internet acquisition system masterplate generation method based on dynamic interaction, think that the information carrier of internet is changeable, the information acquisition of carrier is needed to different templates.On market, also have many information acquisition systems, but mostly all exist that information acquisition content is concrete, the problem such as technical requirement threshold is high in template configuration, acquisition target scope is less than normal.For example: TIS information acquisition device, more suitable news website is gathered, information carrier (as: forum, blog, the microblogging etc.) collecting efficiency to other is lower; Heritrix is very comprehensive to the support of acquisition target, but the information gathering is received all, and adopts different templates less than the difference for information, and dragons and fishes jumbled together for the information collecting, and is unfavorable for analyzing; Network expression information acquisition system, has improvement to the defect of above-mentioned two systems, does not remove at the end but improve.Network expression, in the time gathering different object, has only defined some collection rules and has distinguished acquisition target, but rule highly professional, operation skill is more, is unfavorable for the all-round popularization to market.
Summary of the invention
The technical problem to be solved in the present invention is the deficiency existing for prior art, and a kind of internet acquisition system masterplate generation method based on dynamic interaction is provided, and the method can load collection target automatically, and completes the foundation of specific template with user interactions.
In order to address the above problem, the present invention adopts following technical proposals: the present invention is a kind of internet acquisition system masterplate generation method based on dynamic interaction, is characterized in, its step is as follows:
(1) the access internet loaded targets page obtains web page text source code S set;
(2) identify the node set N in S set according to the regular expression of label, and be that each node of gathering in N adds unique sequence number;
(3) construct and there is complementary hierarchy Model tree T according to node set N;
(4) finally input node ID and move operation, utilize the front and back label expression formula of traversal analytical algorithm iterative computation egress.
The above-described internet acquisition system masterplate based on dynamic interaction generates in method and technology scheme, and described step (1) can operate by following concrete operation step:
(1-1) input web page address utilizes HttpClient to obtain original html source code S set 1;
(1-2) utilize JavaScript in Cobra performance set S1, returned results and be filled into S1 relevant position, finally obtain text source code S set 2, be i.e. described web page text source code S set.
The above-described internet acquisition system masterplate based on dynamic interaction generates in method and technology scheme, and described step (2) can operate by following concrete operation step:
(2-1) obtain node n according to canonical matched text source code S set;
(2-2) whether decision node n type is the one in end-tag node, script node and nested node thereof, link node, br node, annotation/annotation expression formula node, if not the node type of foregoing description, re-execute B3, otherwise execution step (2-1);
(2-3) utilize sequence generator to generate unique system banner sequence number, be appended in node n label;
(2-4) return to repeated execution of steps (2-1), until identify all node n, the set of all node n compositions, is described node in conjunction with N.
The above-described internet acquisition system masterplate based on dynamic interaction generates in method and technology scheme, in described step (3), can operate by following concrete operation step:
(3-1) utilize HtmlParser to generate HTML DOM structure as R ', R ' is initially the root node in DOM, creates self-defining hierarchy Model tree HtmlNode as node R simultaneously, and R initial default is the root node of tree;
(3-2), by the node content in R ', ID and child node information are given in node R;
If (3-3) node R ' there is child node, the label that obtains R ' converts end node to, as the brotgher of node of R;
(3-4) next node that obtains R ' is given R ', returns to execution step (3-2) again, until traversal HTML DOM finishes, the series of layers aggregated(particle) structure that node R produces is described hierarchy Model tree T.
The above-described internet acquisition system masterplate based on dynamic interaction generates in method and technology scheme, and described step (4) can operate by following concrete operation step:
(4-1) click and load nodal operation, obtain node identification ID;
(4-2) inquire node object corresponding to tree T according to ID;
(4-3) click the front/rear move operation O that puts;
(4-4) calculate the front/rear border of putting;
(4-5) obtain the front/rear expression formula of putting.
Compared with prior art, the internet acquisition system masterplate generation method based on dynamic interaction of the present invention has following effect:
1, man-machine interaction completes model customization: user only need to click the mouse and get final product complete operation, does not need professional technical know-how;
2, while collection, distinguish the attribute of information: the content that title, author, content, turnaround time, reply content of distinguishing information in Information Monitoring etc. seen on browser;
3, collecting efficiency is high: in gatherer process, only the content of template definition is gathered, take Internet resources little;
4, new function is looked in template support: when collection, only update content is gathered not repeated acquisition.
Brief description of the drawings
Fig. 1 is a kind of FB(flow block) of the inventive method;
Fig. 2 is the node set N process flow diagram in the identification described in step 102 tag set S in Fig. 1;
Fig. 3 is the hierarchy Model tree T process flow diagram of the structure node described in step 103 in Fig. 1;
Fig. 4 is all input node ID of step 104 and the front and back label expression formula process flow diagram of move operation computing node in Fig. 1.
Embodiment
Referring to accompanying drawing, further describe concrete technical scheme of the present invention, so that those skilled in the art understands the present invention further, and do not form the restriction to its right.
Embodiment 1, a kind of internet acquisition system masterplate generation method based on dynamic interaction, its step is as follows:
(1) the access internet loaded targets page obtains web page text source code S set;
(2) identify the node set N in S set according to the regular expression of label, and be that each node of gathering in N adds unique sequence number;
(3) construct and there is complementary hierarchy Model tree T according to node set N;
(4) finally input node ID and move operation, utilize the front and back label expression formula of traversal analytical algorithm iterative computation egress.
Embodiment 2, the concrete operation step of the step (1) of the internet acquisition system masterplate generation method based on dynamic interaction described in embodiment 1 is as follows:
(1-1) input web page address utilizes HttpClient to obtain original html source code S set 1;
(1-2) utilize JavaScript in Cobra performance set S1, returned results and be filled into S1 relevant position, finally obtain text source code S set 2, be i.e. described web page text source code S set.
Embodiment 3, the internet acquisition system masterplate generation method based on dynamic interaction described in embodiment 1 or 2 the concrete operation step of step (2) as follows:
(2-1) obtain node n according to canonical matched text source code S set;
(2-2) whether decision node n type is the one in end-tag node, script node and nested node thereof, link node, br node, annotation/annotation expression formula node, if not the node type of foregoing description, re-execute B3, otherwise execution step (2-1);
(2-3) utilize sequence generator to generate unique system banner sequence number, be appended in node n label;
(2-4) return to repeated execution of steps (2-1), until identify all node n, the set of all node n compositions, is described node in conjunction with N.
Embodiment 4, the concrete operation step of the step (3) of the internet acquisition system masterplate generation method based on dynamic interaction described in embodiment 1 or 2 or 3 is as follows:
(3-1) utilize HtmlParser to generate HTML DOM structure as R ', R ' is initially the root node in DOM, creates self-defining hierarchy Model tree HtmlNode as node R simultaneously, and R initial default is the root node of tree;
(3-2), by the node content in R ', ID and child node information are given in node R;
If (3-3) node R ' there is child node, the label that obtains R ' converts end node to, as the brotgher of node of R;
(3-4) next node that obtains R ' is given R ', returns to execution step (3-2) again, until traversal HTML DOM finishes, the series of layers aggregated(particle) structure that node R produces is described hierarchy Model tree T.
Embodiment 5, the concrete operation step of the step (4) of the internet acquisition system masterplate generation method based on dynamic interaction in embodiment 1-4 described in any one is as follows:
(4-1) click and load nodal operation, obtain node identification ID;
(4-2) inquire node object corresponding to tree T according to ID;
(4-3) click the front/rear move operation O that puts;
(4-4) calculate the front/rear border of putting;
(4-5) obtain the front/rear expression formula of putting.
Embodiment 6, with reference to Fig. 1-4, the operation experiments of being undertaken by the internet acquisition system masterplate generation method based on dynamic interaction of the present invention, comprises the steps:
Step 101, the loaded targets page obtain text source code S set, and it is specific as follows:
(1), input web page address utilizes HttpClient to obtain original html source code S set 1; For example, the original html source code S set 1 obtaining by internet is as follows:
<html>
<head>
The template target </title> that <title> is to be generated
<script?type="text/javascript">
var?go?=?function(){
Document.getElementById (" content_id ") .innerHTML=" JS replacement "; }
</script>
</head>
<body?onload="javascript:go();">
<p id=" content_id " > content </p>
</body>
</html>
Html source code set is made up of various html tags and content of pages;
(2), utilize Cobra to carry out the JavaScript in S1, and the result of processing is filled to S1 relevant position.A lot of info webs need just be presented out after the scripts such as JS are processed, and are therefore necessary the entrained information of script to process.For example, in the S set 1 described in A1, have partial information to be present in JS script, after utilizing Cobra to carry out, the content text in p label should be " JS replacement ".After processing like this, obtain a new source code S set 2, i.e. described web page text source code S set.
Node set N in step 102, identification tag set S.With reference to Fig. 2, comprise the steps:
Label in step 201, use regular expression r=" < ([^<>] *) > " identification S set is as node n.Node n comprises title, attribute, content of text.For example: the corresponding node n of p label in the S set 1 described in A1 is " <p id=" content_id " > content ";
Whether step 202, decision node n are null value, if null value, illustrate in S set that label has been identified complete, perform step 201 complete; If non-null value illustrates that the label in S set is the complete step 203 that enters of identification;
Whether step 203, decision node n type are the one in end-tag node, script node and nested node thereof, link node, br node, annotation/annotation expression formula node, if wherein one type, return to step 201 and identify next node n; If not one of them type, enter next step 204;
Step 204, sequence generator generate unique system identifier and are appended in node label.For example: for the p label in the S set 1 described in A1 adds sequence, if the sequence number 20100120112233 producing (current date timestamp+1 is to a random number between 1000000), add in p label, the information in p label should be " <p id=" content_id " systemid=" 20100120112233 " > "; After adding system mark, return to step 201, until all add-on system identifiers of all nodes.
The hierarchy Model tree T of step 103, structure node, with reference to Fig. 3, comprises the steps:
Step 301, the S set of utilizing HtmlParser parsing to be added system banner generate HTML DOM structure R ', and R ' is initially the root node in DOM;
Step 302, create self-defining hierarchy Model tree HtmlNode as node R, R is initially the root node of tree;
Step 303, by node R ' in bookmark name, attribute, content of text is given in node R.For example: give node R using the p label in the S set described in A1 1 as node, obtain packets of information that node R has containing title " <p id=" content_id " > ", system identifier 20100120112233;
Step 304, judge that whether current R ' node has child node, if there is child node, carries out next step 305; If there is no child node, perform step 307;
Step 305, convert the bookmark name of R ' node to end-tag L.For example: by the node p bookmark name in the S set described in A1 1 " <p id=" content_id " systemid=" 20100120112233 " > ", convert end-tag to and be denoted as L for " </p> ";
Step 306, change into node and add and make the node R brotgher of node obtaining end-tag L in step 305;
Step 307, obtain the next child node of R ' and value is assigned to R ', as Next iteration node;
Step 308, judge whether by the R ' of new assignment be null value, if R ' finishes (set of all node R is exactly described hierarchy Model tree T) for generating HTML DOM structure traversal in null value description of step 301, step 103 finishes; If R ' is non-null value, in step 301, generate HTML DOM structure traversal and do not finish, return to execution step 303.
The front and back label expression formula of step 104, input node ID and move operation computing node, with reference to Fig. 4, comprises the steps:
Step 401, click load certain page node, obtain the mark ID being labeled of this node.For example: the S set described in A2, will see " JS replacement " such text by browser, select the text, click load button, p label imports next step 402 into using the system identifier having identified 20100120112233 as required parameter;
Step 402, obtain after required parameter, according to ID query tree, T obtains corresponding node;
Step 403, click the front/rear move operation button of putting;
Step 404, judgement are pre action or post action, if pre action performs step 405; If not pre action but post action performs step 407;
Step 405, calculate preposition border;
Step 406, obtain preposition expression formula;
Step 407, calculate rearmounted border;
Step 408, obtain rearmounted expression formula.

Claims (4)

1. the internet acquisition system masterplate generation method based on dynamic interaction, is characterized in that, its step is as follows:
(1) the access internet loaded targets page obtains web page text source code S set;
(2) identify the node set N in S set according to the regular expression of label, and be that each node of gathering in N adds unique sequence number; Concrete operation step is as follows:
(2-1) obtain node n according to canonical matched text source code S set;
(2-2) whether decision node n type is the one in end-tag node, script node and nested node thereof, link node, br node, annotation/annotation expression formula node, if not the node type of foregoing description, re-execute step (2-3), otherwise execution step (2-1);
(2-3) utilize sequence generator to generate unique system banner sequence number, be appended in node n label;
(2-4) return to repeated execution of steps (2-1), until identify all node n, the set of all node n compositions, is described node in conjunction with N;
(3) construct and there is complementary hierarchy Model tree T according to node set N;
(4) finally input node ID and move operation, utilize the front and back label expression formula of traversal analytical algorithm iterative computation egress.
2. the internet acquisition system masterplate generation method based on dynamic interaction according to claim 1, is characterized in that, the concrete operation step of described step (1) is as follows:
(1-1) input web page address utilizes HttpClient to obtain original html source code S set 1;
(1-2) utilize JavaScript in Cobra performance set S1, returned results and be filled into S1 relevant position, finally obtain text source code S set 2, be i.e. described web page text source code S set.
3. the internet acquisition system masterplate generation method based on dynamic interaction according to claim 1, is characterized in that, the concrete operation step in described step (3) is as follows:
(3-1) utilize HtmlParser to generate HTML DOM structure as R ', R ' is initially the root node in DOM, creates self-defining hierarchy Model tree HtmlNode as node R simultaneously, and R initial default is the root node of tree;
(3-2), by the node content in R ', ID and child node information are given in node R;
If (3-3) node R ' there is child node, the label that obtains R ' converts end node to, as the brotgher of node of R;
(3-4) next node that obtains R ' is given R ', returns to execution step (3-2) again, until traversal HTML DOM finishes, the series of layers aggregated(particle) structure that node R produces is described hierarchy Model tree T.
4. the internet acquisition system masterplate generation method based on dynamic interaction according to claim 1, is characterized in that, the concrete operation step of described step (4) is as follows:
(4-1) click and load nodal operation, obtain node identification ID;
(4-2) inquire node object corresponding to tree T according to ID;
(4-3) click the front/rear move operation O that puts;
(4-4) calculate the front/rear border of putting;
(4-5) obtain the front/rear expression formula of putting.
CN201110114641.6A 2011-05-05 2011-05-05 Dynamic-interaction-based generation method of template of internet acquisition system Active CN102768660B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110114641.6A CN102768660B (en) 2011-05-05 2011-05-05 Dynamic-interaction-based generation method of template of internet acquisition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110114641.6A CN102768660B (en) 2011-05-05 2011-05-05 Dynamic-interaction-based generation method of template of internet acquisition system

Publications (2)

Publication Number Publication Date
CN102768660A CN102768660A (en) 2012-11-07
CN102768660B true CN102768660B (en) 2014-09-03

Family

ID=47096064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110114641.6A Active CN102768660B (en) 2011-05-05 2011-05-05 Dynamic-interaction-based generation method of template of internet acquisition system

Country Status (1)

Country Link
CN (1) CN102768660B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808076A (en) * 2012-12-14 2016-07-27 中兴通讯股份有限公司 Setting method and device of browser bookmark, and terminal

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026413A (en) * 1997-08-01 2000-02-15 International Business Machines Corporation Determining how changes to underlying data affect cached objects
CN101615178A (en) * 2008-06-26 2009-12-30 日电(中国)有限公司 Be used to set up the method and system of object hierarchy structure

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026413A (en) * 1997-08-01 2000-02-15 International Business Machines Corporation Determining how changes to underlying data affect cached objects
CN101615178A (en) * 2008-06-26 2009-12-30 日电(中国)有限公司 Be used to set up the method and system of object hierarchy structure

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《基于局部标签树匹配的改进网页聚类算法》;李睿, 曾俊瑀, 周四望;《计算机应用》;20100331;第30卷(第3期);第818-820页 *
李睿, 曾俊瑀, 周四望.《基于局部标签树匹配的改进网页聚类算法》.《计算机应用》.2010,第30卷(第3期),第818-820页.

Also Published As

Publication number Publication date
CN102768660A (en) 2012-11-07

Similar Documents

Publication Publication Date Title
CN109739994B (en) API knowledge graph construction method based on reference document
CN103970845B (en) Webpage filtering method based on program slicing technology
Kellou-Menouer et al. Schema discovery in RDF data sources
CN103678511B (en) The method and device of webpage content extraction is carried out according to visual template
CN103246732B (en) A kind of abstracting method of online Web news content and system
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN102567494B (en) Website classification method and device
Ji et al. Tag tree template for Web information and schema extraction
Liao et al. Research on learning owl ontology from relational database
CN105389330B (en) Across the community open source resources of one kind match correlating method
Hu et al. A Virtual Dataspaces Model for large-scale materials scientific data access
Liu et al. Dynamically querying possibilistic XML data
CN103853659B (en) Browser compatibility testing method and device
CN102768660B (en) Dynamic-interaction-based generation method of template of internet acquisition system
Setayesh et al. Presentation of an Extended Version of the PageRank Algorithm to Rank Web Pages Inspired by Ant Colony Algorithm
Kadam et al. A methodology for template extraction from heterogeneous web pages
Jiang et al. Personalized recommendation method of E-commerce based on fusion technology of smart ontology and big data mining
Li et al. Ontology-based modeling of manufacturing information and its semantic retrieval
Lim et al. Generalized and lightweight algorithms for automated web forum content extraction
CN102236735B (en) Method and system for processing data relationships in power design
Ye et al. Social network supported process recommender system
Xu et al. Micro-blog sentiment analysis based on emoticon preferences and emotion commonsense
Piris Extracting knowledge bases from table-structured web resources applied to the semantic based requirements engineering methodology softwiki
Wei et al. A new path filling method on data preprocessing in web mining
Loskyll et al. UbisEditor 3.0: Collaborative ontology development on the Web

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200521

Address after: Room 17-2-1209, Huaguoshan Avenue, Haizhou District, Lianyungang City, Jiangsu Province

Patentee after: Lianyungang Dayu Information Technology Co.,Ltd.

Address before: 222000, room 7, building 706, West Tower, dragon river building, Sinpo District, Jiangsu, Lianyungang, China

Patentee before: JIANGSU JINGE NETWORK TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240129

Address after: Room 302, Building G, Yunmi City, No. 19 Ningshuang Road, Yuhuatai District, Nanjing City, Jiangsu Province, 210012

Patentee after: Golden Feather Network Technology (Nanjing) Co.,Ltd.

Country or region after: China

Address before: Room 17-2-1209, Huaguoshan Avenue, Haizhou District, Lianyungang City, Jiangsu Province, 222000

Patentee before: Lianyungang Dayu Information Technology Co.,Ltd.

Country or region before: China