CN1327334C

CN1327334C - File grouping device

Info

Publication number: CN1327334C
Application number: CNB02151836XA
Authority: CN
Inventors: 武并佳则
Original assignee: Association For Advancement Of Information Processing; Sumitomo Electric Industries Ltd
Current assignee: Association For Advancement Of Information Processing; Sumitomo Electric Industries Ltd
Priority date: 2001-11-08
Filing date: 2002-11-08
Publication date: 2007-07-18
Anticipated expiration: 2022-11-08
Also published as: CN1432908A

Abstract

The present invention can quickly and easily perform processing for clustering a plurality of documents, and for deciding the central document of each cluster. A document clustering device 102 is provided with a document group storing part 118 for storing a document group, a keyword extracting part 18 for extracting a keyword from the document group, a similarity information retrieving part 20 for calculating similarity among all the documents, a similarity table 30 for storing the similarity, a clustering part 22 for clustering the documents based on the bias of the distribution of the similarity, a central document calculating part 112 for calculating the central document of each cluster, and a clustering information preparing part 114 and a clustering information storing part 120 for preparing and storing information related with each cluster. The device 102 is also provided with a document classifying part 116 for comparing an additional document with the characteristic document of each cluster, and for classifying it.

Description

The file grouping device

Technical field

The present invention relates to the file grouping device, particularly relate to according to the fault docket of storage and support the formulation of FAQ (Frequently Asked Questions), the file grouping device that similar file is divided into groups.

Background technology

For the thing dealer of enterprise who has numerous clients, the demand that how to satisfy the client is important problem.Interrogation and complaint by answering the client fast and accurately can improve client's satisfaction, and make its cost performance optimization, these etc. the problem of whether having a successful career about conduct directly, become common recognition.

In the past, during will complaining the assistance service platform system of information of answering from the interrogation of receiving the client as the fault docket storage, made that typical case by frequent enquirement interrogates and with it the interrogation formed of corresponding answer answer the FAQ that collects.When accepting client's interrogation complaint, at first carry out answer treatment with reference to this FAQ, improved corresponding client's efficient.

Yet usually the formation of FAQ is that the fault docket with storage is that basic handwork is finished.Therefore, when the fault docket quantity of storing was big, the formation of FAQ needed very many labours.And, successfully catch the prompting that is hidden in the customer demand in client's interrogation and the complaint, be important for the thing dealer of enterprise, yet when fault docket quantity was big, its analysis was difficult.

Finding the operation of this data analysis and meaning thereof, is not only the operation that forms FAQ according to fault docket.In addition, also analyze, often find the operation of the value that wherein comprises by the file group that the economic activity of every day, academic activities are worked out.Therefore, must carry out, wish the system that has not need a lot of labours and can finish this operation in the short as far as possible time with the operation of file group classification (grouping) for similar group.

(Japan) spy opens flat 5-205058 communique and discloses a kind of like this system.In the system of this communique record, according to the Data Elements number of sorted each group and the dispersion of data, determine the Packet State evaluation amount that obtains minimum value when best group is counted being categorized as, the group number according to this Packet State evaluation amount is a minimum value carries out classification of Data.

But there is the problem that is difficult to determine the Packet State evaluation amount in the system that the spy opens flat 5-205058 communique disclosure.Also there is the problem that can not classify and handle by the group number of appointment in this system.And, when after in a single day to the data classification, appending other data again, must carry out grouping once more, existence need be carried out the problem of long time treatment.

Summary of the invention

The present invention provides a kind of file grouping device of supporting similar documents is carried out packet transaction in order to solve above-mentioned problem, its objective is.

Another object of the present invention provides a kind of file grouping device that can carry out packet transaction in the short time execution to similar documents.

A further object of the present invention provides a kind of file grouping device that can arrive suitable group in the file grouping that the short time will append at any time.

File grouping device of the present invention comprises: similarity calculating unit, similarity between each file of calculation document group; The similarity threshold calculating unit is connected with described similarity calculating unit, according to the distribution bias of similarity between described each file, calculates and to be used for similarity threshold that described file group is divided into groups; And grouping parts, be connected with described similarity calculating unit with described similarity threshold calculating unit, according to the similarity between described similarity threshold and described each file, described file group is divided into groups, described similarity threshold calculating unit comprises: similarity threshold-group number concerns calculating unit, according to the similarity between described each file, ask similarity threshold arbitrarily and use this relation between group number when similarity threshold divides into groups by described grouping parts arbitrarily; And calculating unit, concern that with described similarity threshold-group number calculating unit is connected, according to the distribution bias of similarity between the described file that in described similarity threshold and described group of number relation, occurs, calculate similarity threshold.

The similarity threshold that utilization is determined by the similarity distribution bias between file can divide into groups to file group according to the similarity between file.Like this, can easily the file in the file group automatically be grouped into suitable group.

Desirable similarity threshold calculating unit comprises: according to the similarity between each file, similarity threshold-group number of obtaining the relation of the group number when similarity threshold is divided into groups by the grouping parts with this any similarity threshold of use arbitrarily concerns calculating unit; Concern that with similarity threshold-group number calculating unit is connected,, calculate the parts of similarity threshold according to the similarity deviation between the file that in similarity threshold and group number relation, occurs.

In similarity threshold and group number relation,, can calculate the similarity threshold that is suitable for dividing into groups according to the deviation that similarity distributes.Like this, can automatically calculate only similarity threshold.

The similarity threshold calculating unit preferably also comprises: concern that with similarity threshold-group number calculating unit is connected, the group number of parts according to operator's appointment calculates the best similarity threshold that file group is divided into groups by dividing into groups.

Be not only according to the group number that divides into groups automatically, can also calculate the new similarity threshold that is grouped into the designated groups number, carry out grouping again according to the group number of operator's appointment.Like this, file group can be categorized as desirable group of number.

The file grouping device preferably also comprises the similarity memory unit of storage by similarity between the file of similarity calculating unit calculating, similarity threshold calculating unit and grouping parts utilize the similarity of similarity storage component stores, carry out calculation of similarity degree respectively and handle and packet transaction.

Distribute in case stored the similarity of calculating, can make the processing high speed when carrying out similarity calculating and grouping repeatedly.

Desirable file grouping device also comprises: calculate the tag file calculating unit by each tag file of organizing of grouping parts grouping; According to the similarity between the tag file of ungrouped append file and each group, append the grouping parts to what append file divided into groups.

After the initial packet transaction, when further appending the file of conduct branch group objects, can divide into groups to the file that appends according to the tag file of each group and the similarity of append file.Owing to needn't then can carry out the packet transaction of appending at a high speed from repeating packet transaction at first.

Appending the grouping parts preferably also comprises: the peaked parts of similarity between the file that calculating is appended and the tag file of each group; Judge whether maximal value satisfies the parts of rated condition; When judging that maximal value satisfies rated condition, with the document classification that appends to the parts of having set peaked group.

At the document classification that will append under original group the situation, when and similarity between original group when not satisfying any one rated condition, all be inappropriate to any group then with the document classification that appends.Therefore, only satisfying under the situation of rated condition, the document classification that appends to the group of having set maximum similarity, can avoided inappropriate classification.

The desirable grouping parts that append also comprise: when judging that maximal value does not satisfy rated condition, the document classification that appends is arrived unfiled group specific parts.

Arrive specific unfiled group by being categorized into all inappropriate document classification of any group, can put together with all dissimilar file of original any group.

Appending the grouping parts preferably also comprises: the append file number that is classified into unfiled group is satisfied rated condition respond, the append file that is categorized into unfiled group is carried out the parts of packet transaction.

When the append file number that is categorized into unfiled group satisfies rated condition, these file group are carried out packet transaction.Because these files and original any group are all dissimilar, the result who then appends this packet transaction appends new group.Needn't carry out packet transaction repeatedly, only the file that is categorized into unfiled group be carried out packet transaction, consequently can carry out appropriate grouping to all files that comprises append file at short notice all files.

File grouping device of the present invention comprises: the similarity calculating unit of obtaining similarity between each file of file group; The group that receives operator's group number input is counted receiving-member; Is connected with group number receiving-member with the similarity calculating unit, according to the similarity threshold of being scheduled to and the distribution bias of similarity, the grouping parts that file group is divided into groups; Is connected with group number receiving-member and grouping parts, judges the whether consistent group number unanimity decision means of group number of group result with the group number of organizing the reception of number receiving-member from the operator; Be connected with group number receiving-member, the consistent decision means of group number and grouping parts, according to the output of organizing the consistent decision means of number, the similarity threshold change parts of similarity threshold that change is predetermined and supply grouping parts.

Adapt with the group number of operator's appointment, automatically determine appropriate similarity threshold, and automatically carry out grouping.Needn't use various similarity thresholds to divide into groups repeatedly, can carry out appropriate packet transaction according to desirable group of number.

The file grouping device preferably also comprises: from object by grouping parts grouping, and the parts except the group of the file that will comprise below the quantity of determining by prescriptive procedure.

Divide into groups again except the less group of the file that will comprise, can improve the precision of grouping.

The file grouping device preferably also comprises the similarity memory unit of storage by similarity between the file of similarity calculating unit calculating, the grouping parts use the similarity of storing in the similarity memory unit to carry out packet transaction under the situation of the up-to-date similarity of similarity storage component stores.

In case stored the similarity of calculating, needn't carry out calculation of similarity degree again when then dividing into groups afterwards, can make the processing high speed when repeating to divide into groups.

The simple declaration of accompanying drawing

Fig. 1 is that the FAQ of expression the present invention the 1st embodiment forms the block scheme that back-up system constitutes.

Fig. 2 is that expression FAQ forms the illustration of supporting key frame.

Fig. 3 is the process flow diagram that forms processing by the FAQ of a plurality of keyword specific modes.

Fig. 4 is the illustration of expression by the squeezed of conditional information retrieval.

Fig. 5 is the illustration of expression keyword extraction picture.

Fig. 6 forms the process flow diagram of handling by the FAQ that specifies similarity threshold.

Fig. 7 is the illustration of expression tab panel " similarity threshold appointment ".

Fig. 8 is the packet procssing flow figure by similarity threshold.

Fig. 9 is the process flow diagram that forms processing by the FAQ of designated groups number.

Figure 10 is the illustration of expression tab panel " the group number is specified ".

Figure 11 is the packet procssing flow figure by the appointment of group number.

Figure 12 is the packet procssing flow figure by the appointment of group number.

Figure 13 is the processing flow chart that forms FAQ automatically.

Figure 14 is the process flow diagram of automatic packet transaction.

Figure 15 is the illustration of expression similarity threshold-group number association diagram.

Figure 16 is the file grouping system block diagram of the present invention the 2nd embodiment.

Figure 17 is the process flow diagram of the branch group job general sequence of expression the 2nd embodiment system.

Figure 18 is in the branch group job of the 2nd embodiment system, the process flow diagram of initial treatment.

Figure 19 is the classification processing flow chart of the not packetized file of the 2nd embodiment system.

Inventive embodiment

[the 1st embodiment]

With reference to Fig. 1, the FAQ of the present invention the 1st embodiment makes back-up system 2 and comprises: server computer 40; The GUI (Graphical User Interface) 12 that on the picture of display (not shown) that is connected with server computer 40 etc., represents.GUI 12 illustrates a package, and display, keyboard, indication device and these device driver of user by being provided with in using a computer can carry out any instruction or data input to computing machine, and perhaps computing machine carries out information indicating to the user.

Server computer 40 comprises: the fault docket storage part 28 of storage failure record; Be connected with fault docket storage part 28, retrieve the conditional information retrieval portion 16 of the fault docket of the rated condition that satisfies operator's appointment; Be connected with fault docket storage part 28, from the keyword extraction portion 18 that fault docket storage part 28 extracts keyword; Be connected with fault docket storage part 28, calculate the similarity information retrieval portion (similarity calculating part) 20 of similarity between the relevant fault docket of all combinations; Be connected with similarity information retrieval portion 20, with the similarity chart 30 of the similarity of diagrammatic form storage computation; Be connected the grouping portion 22 that fault docket is divided into groups according to the similarity between the fault docket with similarity chart 30.Grouping portion 22 when the up-to-date similarity of similarity chart 30 storages, no longer carries out calculation of similarity degree, carries out packet transaction according to the similarity of similarity chart 30 storages.

Server computer 40 also comprises: be connected the similarity threshold-group number relation data storage part 32 of " similarity threshold-group number relation data " of similarity threshold that uses when storage representation divides into groups and group number relation with grouping portion 22; Be connected with similarity threshold-group number relation data storage part 32 with grouping portion 22,, calculate the representative fault docket calculating part 24 of the fault docket of representing this group each group of a plurality of groups of having divided into groups; Be connected with conditional information retrieval portion 16, store the FAQ storage part 34 of the fault docket of conditional information retrieval; Be connected with FAQ storage part 34, make FAQ, and the FAQ that is stored in FAQ storage part 34 makes portion 26 according to the fault docket of FAQ storage part 34 storage; With GUI 12, conditional information retrieval portion 16, keyword extraction portion 18, similarity information retrieval portion 20, grouping portion 22, represent fault docket calculating part 24 and FAQ to make portion 26 to be connected, Control Server computing machine 40 each several parts, and play and GUI 12 between the processing controls portion 14 of interface effect.

So-called " similarity threshold " is at the similarity threshold of judging when whether certain fault docket being categorized into certain group.Specifically, similarity of obtaining the whole fault dockets in certain fault docket and certain group is average, if this similarity on average more than similarity threshold, then is sorted in this group with this fault docket.If the average not enough similarity threshold of similarity then is not sorted in this group with this fault docket.In the present embodiment, average not enough any one similarity threshold of the similarity of the fault docket of certain fault docket and each group is then made the new group that comprises this fault docket.

In the present embodiment, can make FAQ by 3 kinds of modes such as (1) a plurality of keyword specific modes, (2) parameter specific mode and (3) automatic packet modes.The operator carries out the selection of 3 kinds of modes by GUI 12.Below, describe the processing of each mode in detail.

[(1) a plurality of keyword specific modes]

Expression FAQ shown in Figure 2 makes the support key frame in the GUI 12 that the operator uses.

Below, mainly with reference to Fig. 2 and Fig. 3, the formation by the FAQ of a plurality of keyword specific modes is described.

At first, from the fault docket of fault docket storage part 28 storages, retrieve the fault docket (S2) that forms process object as FAQ.Claim this to be treated to conditional information retrieval or keyword retrieval.In order to carry out conditional information retrieval, the object name of fault docket, the time that fault docket is made etc. are imported.When specifically, pressing the button 60 " carrying out the conditional information retrieval compression ... " of Fig. 2, the picture of displayed map 4.On the picture of Fig. 4,, compress by the object name of input fault record, the time that fault docket is made etc.The fault docket guide look of compression result is presented on the display field 68 " branch group objects fault docket ".

The operator presses the button 64 " keyword extraction " in order further to compress from the fault docket that has carried out conditional information retrieval.Like this, promptly show keyword extraction picture shown in Figure 5, show the keyword that from fault docket, extracts by lexicographic order.The operator presses " OK " button from wherein selecting desirable a plurality of keyword.Select to press the button 66 " compressions " after the keyword, in the fault docket that display field 68 shows, recompress out fault docket, be presented at display field 68 (S4) with article consistent with keyword.Do not press the button 64, directly keyword is input to input field 62 " keyword appointment ", press the button 66 again, can certainly compress by the operator.

The operator selects wireless transmission button 72 " not divide into groups ", presses the button 52 " FAQ candidates ".At this moment, the fault docket that shows at display field 68 is presented at FAQ candidate hurdle 80 as 1 group.During the selection group, the fault docket that this group comprises is presented at group internal fault record display field 82 from the FAQ candidate of 1 group showing.The operator from the fault docket (to call " representative fault docket " in the following text) of wherein specifying the representative group (S6).

The keyword extraction that S4 handles can be utilized prior arts such as morphemic analysis.So-called morphemic analysis is to be the knowledge of the figure of identical category according to the vocabulary knowledge of dictionary and morphological change rule etc. and relevant and word, and article is decomposed into morpheme, determines the processing of its part of speech.By morphemic analysis, the word row of identification input article, the i.e. candidate of the part of speech of compressible each word.

According to above processing, specify a plurality of keywords by the operator, extract the fault docket consistent with this keyword, represent fault docket as FAQ.

[(2) parameter specific mode]

The parameter specific mode is: specify various parameters by the operator, carry out the grouping of fault docket, seek the representative fault docket of each group, as FAQ.

With reference to Fig. 6, illustrate by the operator and specify similarity threshold, carry out the grouping of fault docket, obtain the method for FAQ.

At first, executive condition retrieval process (S2).It is the same that conditional information retrieval is handled with explanation with reference to Fig. 3.Therefore, no longer repeat this explanation here.

Then, the operator imports the similarity threshold (S12) that is used to divide into groups.At first, when pressing the tabulation 78 of Fig. 2, show tab panel shown in Figure 7.The suitable similarity threshold that the operator will rule of thumb determine is input to this similarity threshold hurdle.Similarity has 0～100 value, specified threshold value can be made as 1～99 value here.

After this, operator presses is wireless transmission button 70 " grouping " when pressing the button 52, is carried out the packet transaction (S14) according to the fault docket of the similarity threshold of input.Here, be the fault docket that the conditional information retrieval by S2 extracts as the fault docket of object.The back will be described S14 in detail and handle.

Be sorted in the fault docket of group, be presented at FAQ candidate hurdle 80.When selection group from be sorted in a plurality of groups the fault docket that is shown, the fault docket that this group is comprised is displayed on the interior fault docket display field 82 of group.The operator is from wherein selecting the representative fault docket of this group.So, this fault docket is registered (S16) as FAQ.

Below, with reference to Fig. 8, describe S14 in detail and handle.

The variable n (S42) of failure logging single numbers is represented in 1 substitution, 1 substitution is represented the variable K (S44) of sets of numbers.Form the affiliated group K (S46) of n fault docket.That is, form the 1st group 1 under the fault docket at this.

Make variable n increase by 1 (S48).With 1 be set to expression as the group of the comparison other that compares with n fault docket and variable i in 1 (S50).That is, current, as the comparison other of n fault docket, setting group 1.

0 substitution at n fault docket with belong to the variable max_ similarity (S52) of maximum one of expression in the similarity mean value between each fault docket of organizing i, is represented the i substitution variable max_ group (S54) of the group of this moment.

Obtain n fault docket and belong to similarity mean value (S56) between each fault docket of organizing i.Value to similarity mean value and variable max_ similarity compares (S58).If similarity mean value is greater than variable max_ similarity value (S58 "Yes"), then with similarity mean value substitution variable max_ similarity (S60), with the value substitution variable max_ group (S62) of variable i.

After the S62, be under the situation below the max_ similarity (S58 "No") perhaps, make the value of variable i increase by 1 (S64) at similarity mean value.The value of variable i and the value of variable K are compared (S66).That is,, whether carry out a series of processing (S66) from S56 to S64 between judgement and the whole group for n fault docket.(i＜=K) (S66 "No") returned S56 under the situation of untreated fish group having.

(i＞K) (S66 "Yes"), comparison variable max_ similarity value and pre-set threshold simThre-shold (S68) when whole groups processing finish.

When variable max_ similarity value is threshold value simThreshold when above (S68 "Yes"), n fault docket is sorted in the group (S70) of max_ group.

When variable max_ similarity value does not reach threshold value simThreshold (S68 "No"), variable K increase by 1 (S72) with expression group sum makes the affiliated group K of n fault docket, and this fault docket is sorted in group K (S74).After the processing of S70 or S74, the value of representing the variable n of current fault docket is increased by 1 (S76).

The value n of current fault docket and the total N of fault docket are compared (S78).When the value n of the fault docket of noting is that record sum N (when n＜N) is following (S78 "No"), owing to there being the ungrouped fault docket n that is untreated, then returns S50.When the value of the fault docket n that notes greater than the total N of record (during n＞N) (S78 "Yes"), because all fault docket n are grouped into one of them group, then end process.

As mentioned above, specify similarity threshold, carry out the grouping of fault docket, can obtain FAQ by the operator.

Below, with reference to Fig. 9, illustrate by operator's designated groups number to replace similarity threshold, carry out the grouping of fault docket, try to achieve the method for FAQ.

At first, executive condition retrieval process (S2).Conditional information retrieval is handled with the same with reference to the situation of Fig. 3 explanation.Therefore, no longer repeat this explanation.

Then, the operator carries out last group number and specifies (S22) when the fault docket of conditional information retrieval divides into groups.At first, when pressing the tabulation 76 of Fig. 2, show tab panel shown in Figure 10.The group number that the operator will wish is input to this group and counts the hurdle.Here, assignable group of number is from 2 maximum numbers to the fault docket that has carried out conditional information retrieval.

After this, by wireless transmission button 70 " grouping ", when pressing the button 52, carry out the packet transaction (S24) of fault docket by the designated groups number.Here, the fault docket as object is the fault docket that the S2 conditional information retrieval extracts.S24 will be described in detail in detail later on to be handled.

The fault docket that is categorized into group is presented at FAQ candidate hurdle 80, carries out S16 and handle.Thus, will represent fault docket to register as FAQ.S16 handles the same with the explanation with reference to Fig. 6.Therefore, no longer repeat its detailed description.

Below, describe S24 in detail with reference to Figure 11 and handle.

At first, the threshold value simThreshold that uses as when grouping, to the fault docket sum except that the group number of appointment again the value of multiplication by constants K1 set (S82).Constant k 1 adopts for example 2.0 values.

According to threshold value simThreshold, the S14 that carries out with reference to Fig. 8 explanation handles (S14).After this, the group number to group number after the processing execution of S14 and appointment compares (S86).When the group number of group number after the processing execution and appointment equates (S86 "Yes"), end process.

When the group number after the processing execution is counted greater than the group of appointment (S88 "Yes"), deduct constant k2 (S90) from threshold value simThreshold.After this, to when the judgement of the S88 of last time, whether the group number after the processing also judges (S92) greater than the group number of appointment.Group number after handling last time is greater than the group number of appointment, and perhaps this is (a S92 "Yes") when carrying out the processing of S88 first, then returns S14, carries out packet transaction once more according to new threshold value simThreshold.

When the group number after the processing execution is the group number of appointment when following (S88 "No"), in threshold value simThreshold, add constant k2 (S94).After this, to when the judgement of the S88 of last time, the group number after the processing whether below the group number of appointment (S96) also judge.Group number after handling last time is below the group number in appointment, and perhaps this is (a S96 "Yes") when carrying out the processing of S88 first, then returns S14, carries out packet transaction once more according to new threshold value simThreshold.

The comparative result of the group number of the comparative result of the group number after handling last time and the group number of appointment and the group number of this processing and appointment is (S92 "No", S96 "No") not simultaneously, and the group number after then handling is converged in the group number of appointment.Like this, will be with 2.0 values of removing constant k2 as new constant k2 (S98), constant k2 compares (S100) with regulation constant K 3 (for example K3 is 0.01).If constant k2 is a constant K more than 3 (S100 "No"), then return S14, divide into groups once more according to new threshold value simThreshold.

Do not reach stage (S100 "Yes") of constant K 3 at constant k2, Interrupt Process, the grouping that will carry out with the maximum group number that is no more than the designated groups number is as a result of exported.

As mentioned above,, carry out the grouping of fault docket, can try to achieve FAQ by operator's designated groups number.

By the designated groups number, carry out the fault docket grouping as a kind of,, also can adopt binary search (binary search) method in the hope of the method for FAQ.Below, the packet transaction that adopts binary search is described.

With reference to Figure 12, with the threshold value leftSimThres of 0.0 substitution region of search left end, the threshold value righSimThres of 1.0 substitution right-hand members (S142).

According to curSimThres=(leftSimThres+righSimThres)/2.0, obtain similarity threshold curSimThres (S144).According to similarity threshold curSimThres, carry out and above-mentioned same packet transaction (S14).When the result of packet transaction is a group number when equating with the group number of user's appointment (S146 "Yes"), end process.

As the result of packet transaction is the group number (S146 "No") not simultaneously of group number and appointment, check whether the amplitude (righSimThres-leftSimThres) of the region of search does not reach the threshold value simThresDiff of regulation, when not reaching threshold value simThresDiff (S148 "Yes"), end process.

Under the situation more than the threshold value simThresDiff, whether inspection group's number is greater than the group number (S150) of appointment, when counting (S150 "Yes"),, threshold value curSimThres substitution is represented the threshold value rightsimThres (S152) of hunting zone right-hand member in order to change the hunting zone greater than the group of appointment.

When counting (S150 "No"), threshold value curSimThres substitution is represented the threshold value leftSimThres (S154) of hunting zone left end less than the group of appointment.After the processing of S152 or S154, return S144.

Adopt this binary search, can carry out packet transaction at a high speed.

[(3) are packet mode automatically]

Automatically in the packet mode, the operator does not specify various parameters, determines that automatically suitable similarity threshold carries out the grouping of fault docket, tries to achieve the representative fault docket of each group, as FAQ.

With reference to Figure 13, executive condition retrieval process (S2).It is the same that conditional information retrieval is handled with explanation with reference to Fig. 3.Therefore, no longer repeat this explanation here.

The operation of dividing into groups automatically then, (S32).At first, press tabulation 74, press wireless transmission button 70 " grouping " again.After this, when pressing the button 52, promptly automatically perform the packet transaction (S32) of fault docket.Here, the fault docket as minute group objects is to handle the fault docket that extracts through the conditional information retrieval of S2.The back will be described S32 in detail and handle.

Be sorted in the fault docket of group, be presented at FAQ candidate hurdle 80, carry out the processing of S16.To represent fault docket to be registered as FAQ.It is the same that S16 handles with explanation with reference to Fig. 6.Therefore, no longer repeat this explanation here.

Below, with reference to Figure 14, describe S32 in detail and handle.

In the similarity chart 30, calculate in advance the similarity of all fault docket combinations is also stored.Similarity threshold-group number the association diagram (S112) of the similarity threshold of representing graphical representation as shown in figure 15 and the group number relation corresponding with it with reference to similarity chart 30, is made by grouping portion 22.When this graphical representation changed similarity threshold, how the group number changed.Can carry out processing shown in Figure 8 repeatedly while changing similarity threshold simThreshold, make this chart.

Below handle is that the group that search automatically changes with respect to similarity threshold is counted the change minimum part, promptly processing searched at the most smooth place of curve map shown in Figure 15 automatically.Can divide into groups rightly to fault docket by seeking this position.Its reason of following simple declaration.

As typical example, suppose that the fault docket group is classified into M group, and each group comprises N fault docket.At this moment, similarity simIn value bigger (for example simIn=0.8) between the fault docket in group.On the other hand, contain in essence different content mutually, so similarity simEx value and simIn comparison should be quite little values (for example simEx=0.2) between these fault dockets owing to belong to separately the fault docket of group.Therefore, similarity concentrates on bigger value and smaller 2 positions of value, forms uneven the distribution.

As similarity threshold simThreshold, if adopted the value that satisfies simIn＜sirmThreshold＜simEx, think then and can divide into groups rightly that the group number that as a result of obtains is M.At this moment, owing to have very big difference between the value of simIn and simEx, then front and back change some simThreshold values a little, and the group number that as a result of obtains changes hardly.Therefore, if adopt near the similarity threshold in the slowest place of gradient of curve shown in Figure 15 to divide into groups, the possibility that then is grouped into M group rightly is higher.That is, distribute, can determine similar mutually fault docket and mutual dissimilar fault docket are categorized into the similarity threshold of group separately according to similarity.

Last example is typical example, but when the difference of similarity between file is bigger, then between similarity threshold and the group number that obtains with it, also exists and above-mentioned roughly same relation.Therefore, on curve shown in Figure 15, find flat part, can adopt the similarity of this moment is similarity threshold.

Therefore, in the system of present embodiment, can be as described below, on curve shown in Figure 15, find flat part.That is, on one side mobile class range try to achieve the processing of upper and lower bound of the similarity threshold of group number corresponding to certain limit (being " class range ") on one side to call this scope in the following text.When the gamut of the longitudinal axis of curve shown in Figure 15 moved class range, at the class range of the difference minimum of the upper and lower bound of the similarity threshold of correspondence, it was the slowest to be judged to be slope of curve, adopts the similarity threshold corresponding to the group number at this class range center.Claim that the maximum group number in the class range is " a group maximal value ", the smallest group number is " a group minimum value ".

As the value that is illustrated in the class range size of using in the following processing, be defined as removing the value (S114) of maximum group number with constant K 5 (for example 10).Here, so-called " maximum group number " is different from " group maximal value ", and it is the admissible maximum number of group number that forms as the packet transaction result.Maximum group number is normally more than 2, and is as below the number of files of dividing group objects.This value is specified by the user when packet transaction is started automatically.Before this input, the number of calculating formula calculating according to the rules at first can be represented as the maximum group number of mistake.For example, can consider to extract the logarithm of obj ect file number, will be above the smallest positive integral of this value method as default maximum group number etc.Also can will remove the number of obj ect file number as default maximum group number with constant simply.

As the value that is illustrated in " the group increment " that use in the following processing, be defined as removing the value (S116) of above-mentioned maximum group number with constant K 6 (for example 20).In the present embodiment, along the mobile from top to bottom class range of the longitudinal axis of Figure 15, the increment size during the promptly mobile class range of so-called group increment.

Below, substitution 1 is as group minimum value (S118).Only move the group minimum value on one side and repeat following processing on one side by the group increment.

In this re-treatment, at first, obtain the value (S120) of the value that in the group minimum value, adds expression class range size.In view of the above, obtain the conduct group maximal value of the class range of research object now.Then, obtain the minimum value and the maximal value of the similarity threshold of group minimum value corresponding to thus time the and group maximal value area surrounded again.

Specifically, at first group maximal value and maximum group number are compared (S122).The judgement here is whether can finish mobile class range in gamut in order to understand.When the group maximal value is counted greater than the maximum group (S122 "No"), class range arrives the topmost of Figure 15 longitudinal axis, owing to surpass topmost, finishes re-treatment, and control enters S128.When the group maximal value is maximum group number when following (S122 "Yes"), obtains in this group minimum value and organize the scope (S124) of the similarity threshold that the zone between the maximal value comprises.Then, make the group minimum value increase (S126), turn back to S120 by the group increment.Like this, on one side mobile class range, obtain similarity threshold scope on one side corresponding to various situations.

When the group maximal value is counted greater than the maximum group (S122 "No"), control enters S128.At S128, in the value of the similarity threshold scope that S124 obtains, obtain the group minimum value and the group maximal value (S128) of (when curve shown in Figure 15 is the most smooth) when obtaining maximal value, group minimum value and the group obtained is peaked average as the group number (S130) of trying to achieve.That is, be the part that shown in Figure 15 group of number slowly changes,, so group number at this moment is defined as appropriate group number corresponding to the broadest part of similarity threshold scope of certain class range.

According to similarity threshold-group number association diagram, obtain and the corresponding similarity threshold of group number of trying to achieve at S130, should be worth as threshold value simThreshold (S132).Then, again according to threshold value simThreshold, the S14 that carries out with reference to Fig. 8 explanation handles (S14), end process.

As mentioned above, even operator's designated parameter not also can be carried out the grouping of fault docket, obtain FAQ.According to the distribution of similarity between whole fault dockets, automatically determine similarity threshold at this moment, as the value of discrete group best.

In the automatic packet transaction of above-mentioned the 1st embodiment,, on the longitudinal axis of similarity threshold-group number relation curve, carry out the re-treatment of mobile class range, determine similarity threshold in order to obtain similarity threshold.Yet, determine that the processing of similarity threshold is not limited to this method.For example,, carry out differential, also can determine slope of curve similarity threshold the most slowly with polynomial expression (for example 4 order polynomials) approximate similarity degree threshold value-group number relation curve.

[the 2nd embodiment]

Among above-mentioned the 1st embodiment,, obtain similarity with other whole fault dockets for whole fault dockets of minute group objects.For this reason, the calculated amount of grouping is pressed the quadratic power increase of failure logging odd number.Calculate the size of necessary storage area, the quadratic power of also pressing the failure logging odd number increases.Therefore, when the failure logging odd number as object increases, then exist hardware burden to increase, also press the problem that quadratic power increases computing time by quadratic power.For with hand-held hardware end process in the processing time in real time, must before packet transaction, the failure logging odd number as process object be reduced to a certain degree.

For this reason, by described conditional information retrieval of the 1st embodiment or keyword retrieval, must compress the fault docket of process object in advance.Yet the fault docket owing to having compressed in advance as process object can not carry out suitable grouping, has to produce incomplete danger in the FAQ content that in the end obtains.

This problem is not limited to fault docket, even also can produce in the general packet transaction of the file of daily production in enterprise.In addition, packet transaction is not once can finish, and after once dividing into groups, during append file, must carry out packet transaction once more.Therefore,, when carrying out packet transaction, need a lot of times, and should press the quadratic power increase time, so quantity of documents is unpractical for a long time owing to intactly follow the method for above-mentioned the 1st embodiment.

Among the 2nd embodiment, even more as the number of files of process object, also do not require excessive hardware, the processing time of available reality is carried out packet transaction.In case after the execute file packet transaction, when appending the file that should divide into groups, do not require that the excessive processing time can carry out appropriate packet transaction yet.

Below, the file grouping system of the present invention the 2nd embodiment is described, the accompanying drawing of Shi Yonging in the following description, the parts that have said function for the system with the 1st embodiment are given identical cross reference number and title, no longer repeat its detailed description.And, in the following description, owing to will not suppose that the branch group objects is defined in fault docket, so usually be called " file ".

With reference to Figure 16, the file grouping system 100 of the 2nd embodiment comprises: the server 102 of the file grouping system that carries out with computing machine or computer set; The GUI 12 that on the display pictures such as (not shown) that is connected with server 102, shows.

File grouping system server 102 comprises: the file group storage part 118 of storage file; From the file group of file group storage part 118 storages, retrieval and the attribute retrieval portion 110 that extracts the file of regulation attribute with operator's appointment; Be connected with file group storage part 118, from the keyword extraction portion 18 that the file group of file group storage part 118 storages is extracted keyword; Be connected with file group storage part 118, calculate the similarity information retrieval portion (similarity calculating part) 20 of similarity of all combinations (file to) of file in the relevant file group of extracting by attribute retrieval portion 110; Be connected with similarity information retrieval portion 20, with the similarity chart 30 of the similarity of the form storage computation of chart; Be connected the grouping portion 22 that file is divided into groups according to similarity between file with similarity chart 30.

File grouping system server 102 also comprises: similarity threshold-group number relation data storage part 32; Be connected with similarity threshold-group number relation data storage part 32 with grouping portion 22,, calculate the representation file calculating part 112 of the file of representing this group each group of having divided into groups; Be connected with attribute retrieval portion 110, storage is by the file of attribute retrieval portion 110 extractions and the grouping information storage part 120 of grouping information described later; Make grouping information from the file of grouping information storage part 120 storage, and the grouping information that is stored in grouping information storage part 120 is made portion 114; Be connected with grouping information storage part 120 with file group storage part 118, will be sorted in the group that obtains by initial packet transaction, and make the document classification portion 116 of new group where necessary as the file beyond the file of initial packet transaction object; Making portion 114 with GUI 12, attribute retrieval portion 110, keyword extraction portion 18, similarity information retrieval portion 20, grouping portion 22, representation file calculating part 112, grouping information is connected with document classification portion 116, control documents grouping system server 102 each several parts, and have play and GUI 12 between the processing controls portion 104 of interface effect.

File group storage part 118 is equivalent to the fault docket storage part 28 of the 1st embodiment.Attribute retrieval portion 110 is equivalent to the conditional information retrieval portion 16 of the 1st embodiment.Representation file calculating part 112 is equivalent to the representative fault docket calculating part 24 of the 1st embodiment.Grouping information is made the FAQ that portion 114 is equivalent to the 1st embodiment and is made portion 26.Grouping information storage part 120 is equivalent to the FAQ storage part 34 of the 1st embodiment.

Grouping information is made portion 114, and the file group from each group that is included in the packet transaction result is extracted keyword sets, gives the tag file of importance degree as each group at each keyword.The various yardsticks such as importance degree that keyword importance degree, the score in the time of can adopting keyword extraction according to purposes, each keyword are included in the frequency of file in the group, each keyword is given in advance.

In the present embodiment, set tag file=" REPRESENTATION FILE (BY THE FILE OF REPRESENTATION FILE CALCULATING PART 112 CALCULATING)+KEYWORD SETS " of each group.In the device of present embodiment, the user uses the GUI 12 can be to keyword sets editing and processing of appending, eliminate, changing of each group of making and being stored in grouping information storage part 120 automatically.For example can utilize general edit routine easily to realize editing and processing.And, also can prepare to be used for the application-specific of this purpose, the practitioner works out such application program easily.

Do not have among the 1st embodiment and the device of the 2nd embodiment comprises is document classification portion 116.Document classification portion 116, in case after packet transaction, just have and not to be categorized into the function of existing group according to method described later as the file of packet transaction object (be included in the initial file group but not as the file of the searching object of attribute retrieval portion 110, and be not included in the file that is appended to later in the file group in the file group at first).Document classification portion 116 also has following function: the document classification that can not be categorized into existing group is made new group and is stored in grouping information storage part 120 according to the file group that is categorized into " unfiled " when rated condition satisfies simultaneously to " unfiled " group.

Processing controls portion 104 has the function same with the processing controls portion 14 of Fig. 1, but has appended the function of controlling initial treatment described later and being handled by the document classification that document classification portion 116 carries out.

With reference to Figure 17, the program structure of file grouping system server 102 actions of this 2nd embodiment of control, diagrammatic illustration is as follows.As prerequisite, enactment document group storage part 118 has been stored the file of a considerable amount of minutes group objects.At first, in step 140, as initial treatment, the file that file group storage part 118 is stored carries out packet transaction, makes grouping information and is stored in grouping information storage part 120.The processing that this initial treatment 140 is carried out illustrates later on reference to Figure 18, its content in essence with the FAQ that carries out in the 1st embodiment system make handle identical.

Like this, to initial file group, once make grouping information.Then, rethink the situation of file appending to file group storage part 118.Be accompanied by business activity, all appending this file every day is frequent situation.Certainly, the file that appends is not grouping all.Claim this ungrouped file to be " not packetized file ".

In step 142, the not packetized file that utilizes 116 pairs in document classification portion the to append processing of classifying.This processing will be described in detail in the back.Say that roughly document classification portion 116 utilizes analog information search part (similarity calculating part) 20, the tag file of packetized file not and each group is compared calculate its similarity.Then, with this not packetized file be sorted in the highest group of similarity of calculating.Here, the similarity of establishing the branch time-like for certain more than certain threshold value, highest similarity is during less than this threshold value, packetized file is not sorted in " unfiled " group.As threshold value, specified threshold value in the time of can considering to adopt in the initial treatment formation group.

After step 142, judge whether the number of files as unfiled group of being sorted in of step 142 result surpasses stated number, for example 1000 (step 144).When not surpassing 1000, control turns back to step 142, and when surpassing 1000, control advances to step 146.

In step 146,, carry out the packet transaction same with the 1st embodiment and initial processing step 140 for the file (present embodiment is 1000 files) that is categorized into unfiled group.Consequently form new group, this new group is appended to initial group group.The result of this step 146 upgrades the initial group group that forms, and for the file that appends later on, appends the new group that only is made of the file that does not belong to any one group of making at first.After, utilize the group group of upgrading, the processing of repeating step 142～146.

With reference to Figure 18, the processing of carrying out at the initial processing step 140 of Figure 17 is described.At first, consider quantity of documents, judge the whether necessary necessary file (160) that is used to form initial group of group that from file group, extracts by the user in 118 storages of file group storage part.For example, when the file that comprises when file group is too much, in order in certain time, to finish packet transaction, must the compressed file number.

Under situation about must extract, utilize random number to carry out the extraction of some files in step 162.

Then, in step 164, the file group that will extract in step 162 is as object, perhaps is judged as under the situation about needn't extract with all files as object in step 160, carries out the formation of initial set and handles (packet transaction) (164).This processing is the same in fact with the automatic packet transaction of the fault docket that has illustrated at the 1st embodiment, but the details difference is illustrated with reference to Figure 19 later on.Analog information search part shown in Figure 16 (similarity calculating part) 20 and grouping portion 22 are used in this processing.

Then,, handle each group of making, determine representation file as a tag file part for forming by initial set in step 168.This processing is the same in fact with the processing that the S16 of Figure 13 in the 1st embodiment explanation carries out.

In step 168, extraction is as the keyword of tag file another part automatically in each group by keyword extraction portion 18, and alternative each importance degree is given each group.Not expression among the figure, but this processing back user can append, eliminates or change this keyword, and adjust the feature of each group in view of the above.

More than, finished the formation of initial set and handled.If also have ungrouped file, then utilize document classification portion 116 processing (170) of classifying.If there is not ungrouped file, then finish initial treatment.When step 160 is judged unnecessary extractions, all files is carried out packet transaction, so remain conduct classify the not packetized file of process object thereby the not processing of execution in step 170 in step 170.

Below, describe the control structure of the program that classification that the step 170 be implemented in Figure 18 carries out handles in detail.This is handled as shown in figure 19, with packet transaction according to the 1st embodiment similarity threshold shown in Figure 8 be same processing.But, what classification processing shown in Figure 19 and Fig. 8 handled is not both: when certain file and whole group tag file compare the similarity maximal value deficiency defined threshold that obtains, this document is sorted in unfiled group, forms new group (S74) comprising this document.

Figure 19 below is described.

With reference to Figure 19, at first, 0 substitution is represented the variable n (190) of the file number of branch group objects.Then, in processing procedure, 0 substitution represented the peaked variable max_ of similarity similarity.Make variable n add 1 (194) again, whether decision variable n counts (196) greater than the file (that is, the file that extracts, all files of the process object when extracting) of process object when the step 162 of Fig. 8 is extracted.Variable n during greater than the number of files of process object processing finish.Variable n controls when number of files is following and enters step 198.

In step 198,0 substitution is represented the variable i of sets of numbers.Then, make variable i add 1 in step 200.Judge that this result is whether the value of variable i surpasses group number (202).Value for variable i surpasses the situation of organizing number, will be explained below.When the value of variable i is group number when following, control enters step 204.

In step 204, utilize similarity information retrieval portion (similarity calculating part) 20 to calculate n file and organize similarity between the tag file of i (i group).If the similarity that obtains then will be in the similarity value substitution variable max_ similarity of step 204 calculating greater than variable max_ similarity, step 200 is returned in control.If the similarity of calculating is below the variable max_ similarity, then do not carry out any operation, step 200 is returned in control.

In the processing of step 202, when the value of decision variable i during greater than the group number, control enters step 220.In step 220, whether the value of decision variable max_ similarity is more than preassigned similarity threshold.If variable max_ similarity value is more than preassigned similarity threshold, then in step 222 with n document classification in the group that has obtained the similarity consistent with the max_ similarity, control and return step 192.If the not enough preassigned similarity threshold of variable max_ similarity value judges that then n file do not belong to existing any group, be sorted in " unfiled " group (step 224), step 192 is returned in control.

It more than is the detailed control structure of the processing carried out in the step 170 of Figure 18.

The processing that the step 142 of Figure 17 is carried out is corresponding to the processing of step 198 shown in Figure 19～224, and the practitioner is understandable.

Among the 2nd embodiment, when packetized file does not divide into groups, divide into groups except can be with the file that the comprises few group.Like this, remove interference, can improve the precision of grouping.As judging whether to remove the benchmark of group from minute group objects, whether the absolute number of the file that can comprise according to group is below stated number, and perhaps whether the number of files that comprises of group suitably determines with inferior operating position in the regulation ratio with respect to all files number.In addition, also can specify the parameter that is used for except number of files or the ratio etc.

File grouping system 100 actions of above-mentioned the 2nd embodiment are as follows.With the no longer repeat specification of part of the same action of the 1st embodiment, only explanation and part that Figure 17～processing shown in Figure 19 is relevant.

With reference to Figure 17, carry out initial treatment 170 at first.

In the initial treatment, with reference to Figure 18, the user is at first according to the number of files of process object, judge whether necessary will be as the File Compress of initial treatment object to some (160).If number of files is original just few, then do not carry out File Compress, if number of files is more, then carries out the extraction of step 162 and handle, quantity of documents is compressed.

Then, the file that has compressed is carried out the making processing (164) of initial set.This processing is identical with the situation that the 1st embodiment has illustrated, no longer repeats its detailed description.Here, from obj ect file group formation group automatically, and each file is grouped into each group.

Then, step 166,168 respectively organize the decision of representation file and keyword extraction, replace and give.After this, according to circumstances the user carries out the editor of keyword.

When beginning, initial treatment under the situation of execute file compression,, is categorized into the processing of initial set or " unfiled " group at the remaining file of step 170 (not packetized file).

With reference to Figure 19, in step 170 is handled, at first 0 substitution is represented the variable n and the variable max_ similarity (1 90,192) of file number.Then, make variable n add 1 (194).Whether judge this variable n greater than number of files (196), this result is generally not during the 1st judgement, and this result treatment enters step 198.

After step 198 is with 0 substitution variable i, make variable i add 1 (200).Then, whether the value of decision variable i (=1) is above forming the group number (202) that processing (step 164 of Figure 18) forms by initial set.General group number is most, and control enters step 204.In step 204, calculate the similarity between the tag file of the 1st file and the 1st group.

In step 206, judge that whether the similarity of calculating in step 204 is greater than variable max_ similarity.Now, the value of variable max_ similarity is in 0 of step 192 setting.Usually, because the similarity between the tag file of the 1st file and the 1st group is littler than 1 greatly than 0, so the result of determination here is a "Yes", will be in step 208 in the similarity value substitution variable max_ similarity of step 204 calculating, control turns back to step 200.At this moment, obtained the value (situation of explanation is i=1 now) of storage of variables i in the variable of group of maximum similarity in expression.

In step 200, make variable i add 1, consequently the value of variable i is 2.The processing of following steps 202～208 is carried out between the tag file of the 1st file and the 2nd group.Set variable i=3,4 again, 5... calculates the 1st file and whole similarities of the tag files of group, and wherein Zui Da similarity is stored in the variable max_ similarity.And the number of the group of this value is given in storage.

Like this, when meter calculates the similarity of the 1st file and whole tag files of group, be " "Yes" " in the result of determination of step 202, control and enter step 220.In step 220, whether the value of decision variable max_ similarity is more than preassigned similarity threshold.If result of determination is " "Yes" ", then the 1st file is classified into the group (222) of giving maximum similarity; If not like this, then the 1st file is classified into " unfiled " group." unfiled " group is not the process object of step 200～208.

Then, step 192 is returned in control, and 0 is updated to variable max_ similarity once more, makes n add 1 and becomes 2, and the 2nd file carried out and above-mentioned the 1st the identical processing of file.

Like this, handle by all files being carried out classification, ungrouped each file all is categorized into an initial set usually, with any one all dissimilar situation under, be categorized into " unfiled " group.If all files classification finishes, then finish initial treatment (step 140 of Figure 17).

Refer again to Figure 17, when appending certain file or the processing of execution in step below 142 during certain.

When appending certain file, the processing of execution in step 142.This is handled as previously mentioned, with handle shown in step 198～224 of Figure 19 identical.Consequently this document can both be sorted in the initial set one usually, when when all dissimilar, being sorted in " unfiled " group with the tag file of any group.

Like this, all carrying out the classification of this document when append file handles, but the number of files in step 144 is judged " unfiled " group surpasses at 1000 o'clock, is object in step 146 with the file in this " unfiled " group then, carries out " initial set formation " the same processing with Figure 18.During this was handled, the new group that forms was appended and is registered in the existing group.

Like this, the result of step 146 is to have appended new group in initial set, and all files in " unfiled " group all is categorized into one of them group.

Below, the processing of repeating step 142～146.Handle by repeating these, even divide the quantity of documents of group objects more, the quantity of documents that appends is more, can be not necessary condition with excessive hardware also, with divide into groups the classification with file of the time of reality.

More than the method that file is divided into groups has mainly been told about in explanation.But as the practitioner understood easily, present embodiment was not only for document classification, and no matter which kind of data all can be used for the grouping to data.Particularly many and when often carrying out data supplementing when the data of dividing group objects, can carry out the grouping and the classification of data expeditiously.

As seen from the above description, system of the present invention can realize with general computing machine and software thereof.Certainly, also can realize with specialized hardware.

This disclosed embodiment is illustration and should not be considered limiting in all respects.Scope of the present invention is not above-mentioned explanation but the scope of claim, be included in claim scope equivalence and scope in all changes.

Claims

1. a file grouping device is characterized in that, comprises:

The similarity calculating unit, similarity between each file of calculation document group;

The similarity threshold calculating unit is connected with described similarity calculating unit, according to the distribution bias of similarity between described each file, calculates and to be used for similarity threshold that described file group is divided into groups; And

The grouping parts are connected with described similarity calculating unit with described similarity threshold calculating unit, according to the similarity between described similarity threshold and described each file, described file group divided into groups,

Described similarity threshold calculating unit comprises:

Similarity threshold-group number concerns calculating unit, according to the similarity between described each file, asks similarity threshold arbitrarily and use this relation between group number when similarity threshold divides into groups by described grouping parts arbitrarily; And

Calculating unit concerns that with described similarity threshold-group number calculating unit is connected, and according to the distribution bias of similarity between the described file that occurs in described similarity threshold and described group of number relation, calculates similarity threshold.

2. as the file grouping device of claim 1 record, it is characterized in that, described similarity threshold calculating unit also comprises with lower member: concern that with described similarity threshold-group number calculating unit is connected, calculated according to the group number of operator's appointment by described grouping parts and be used for appropriate similarity threshold that described file group is divided into groups.

3. as the file grouping device of claim 1 record, it is characterized in that,

Also comprise the similarity memory unit of storage by similarity between the file of described similarity calculating unit calculating;

Described similarity threshold calculating unit and described grouping parts utilize the similarity of described similarity storage component stores to carry out the computing and the packet transaction of similarity threshold respectively under the situation of the up-to-date similarity of described similarity storage component stores.

4. as the file grouping device of claim 1 record, it is characterized in that also comprising:

Calculating is by the tag file calculating unit of each tag file of described group of described grouping parts grouping;

According to the similarity between ungrouped append file and each described group the tag file, append the grouping parts to what the described file that appends divided into groups.

5. as the file grouping device of claim 4 record, it is characterized in that the described grouping parts that append comprise:

Calculate the peaked parts of similarity between the described file that appends and each described group the tag file;

Judge whether described maximal value satisfies the parts of rated condition;

When judging that described maximal value satisfies described rated condition, giving described peaked group parts with the described document classification that appends.

6. as the file grouping device of claim 5 record, it is characterized in that the described grouping parts that append also comprise: when judging that described maximal value does not satisfy described rated condition, with the described document classification that appends at unfiled group specific parts.

7. as the file grouping device of claim 6 record, it is characterized in that the described grouping parts that append also comprise: response taxonomy satisfies rated condition in the described unfiled group number of files of appending, and carries out the parts of described packet transaction to being sorted in the described unfiled group file that appends.

8. a file grouping device is characterized in that, comprises:

The similarity calculating unit is obtained similarity between each file of file group;

The group number is accepted parts, accepts the group number input from the operator;

The grouping parts are accepted parts with described similarity calculating unit with described group of number and are connected, and according to the distribution bias of predetermined similarity threshold and described similarity, described file group are divided into groups;

The consistent decision means of group number is accepted parts with described group of number and is connected with described grouping parts, judges whether the group number of group result is consistent with the group number from described operator of being accepted parts acceptance by described group of number; And

Similarity threshold change parts are accepted parts with described group of number, the described group of consistent decision means of number is connected with described grouping parts, according to the output of described group of number unanimity decision means, change described predetermined similarity threshold, and supply with described grouping parts.

9. as the file grouping device of claim 1 or claim 8 record, it is characterized in that, also comprise: the parts that the following group of number of files that the number of files that will contain is determined for method is in accordance with regulations removed from the object by described grouping parts grouping.

10. as the file grouping device of claim 8 record, it is characterized in that,

Described grouping parts under the situation of the up-to-date similarity of described similarity storage component stores, use the similarity of described similarity storage component stores to carry out packet transaction.