[go: nahoru, domu]

CN109062887A - A kind of part-of-speech tagging method based on average perceived device algorithm - Google Patents

A kind of part-of-speech tagging method based on average perceived device algorithm Download PDF

Info

Publication number
CN109062887A
CN109062887A CN201810561207.4A CN201810561207A CN109062887A CN 109062887 A CN109062887 A CN 109062887A CN 201810561207 A CN201810561207 A CN 201810561207A CN 109062887 A CN109062887 A CN 109062887A
Authority
CN
China
Prior art keywords
word
speech
sentence
feature
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810561207.4A
Other languages
Chinese (zh)
Inventor
邵玉斌
郭海震
龙华
杜庆治
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201810561207.4A priority Critical patent/CN109062887A/en
Publication of CN109062887A publication Critical patent/CN109062887A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of part-of-speech tagging methods based on average perceived device algorithm, belong to natural language processing technique field.The present invention is trained first against training set: extracting word information in training set, such as the original shape of current word, two, end letter, the features such as the part of speech of previous word, and each possible probability of part of speech in each feature is updated according to corpus, local is finally stored in the data structure of nested dictionary and in a manner of byte stream;Secondly, to the word part-of-speech tagging stage in sentence: being pre-processed to the sentence to be marked, the feature for then obtaining word returns to the maximum part of speech of possibility by the comparison with model file.The present invention can reach higher accuracy rate with less training set, and also not high to equipment requirement, and the training time is not also grown.

Description

A kind of part-of-speech tagging method based on average perceived device algorithm
Technical field
The present invention relates to a kind of part-of-speech tagging methods based on average perceived device algorithm, belong to natural language processing technique neck Domain.
Background technique
Part-of-speech tagging is the basic project of natural language processing, is the basis of a lot of other natural language processing tasks, The final performance of follow-up work is largely affected simultaneously.Construct a high-performance, efficient part-of-speech tagging system With important academic significance and application value.
Perceptron is the linear classification model of two classification, and input is the feature vector of example, exports the class for example Not ,+1 and -1 two-value are taken.Perceptron corresponds in the input space (feature space) and surpasses the separation that example is divided into positive and negative two class Plane belongs to discrimination model.Perceptron study is intended to find out the separating hyperplance that training is carried out to linear partition, for this purpose, importing Loss function based on misclassification carries out minimization to loss function using gradient descent method, acquires perceptron model.Perceptron Learning algorithm has simple and is easily achieved a little, is divided into primitive form and dual form.Perceptron prediction is with learning To perceptron model classify to new input example.
Summary of the invention
The technical problem to be solved by the present invention is to propose a kind of part-of-speech tagging method based on average perceived device algorithm, to It solves the above problems.
The technical scheme is that a kind of part-of-speech tagging method based on average perceived device algorithm.Training stage: first First, corpus is pre-processed;Then, training data is read from corpus;Then, feature is extracted from corpus;So Afterwards, in training characteristics template weight value;Finally, calculating average weight.Test phase: firstly, being pre-processed to corpus; Then, the feature for extracting word in the sentence to be marked obtains the most probable word of the word with the model comparison that the training stage obtains Property.
Specific steps are as follows:
(1) from reading training data in corpus: reading word from corpus, when reading " fullstop ", representative is one The ending of sentence becomes several group of words of front in short, every words are stored in sentence variable, then will Sentence is added in train_data list as training set;
(2) it is read from train_data in short, in which: word is words list, and part of speech is tags list;
(3) word obtained to step (2) pre-processes;
(4) for adding spcial character before and after step (3) words list, processing first or the last one word Times are prevented It is wrong;
(5) word in the words list obtained to step (3) carries out part-of-speech tagging, successively proceeds as follows: in high frequency word The word is searched in allusion quotation and corresponds to part of speech, if so, then the word part of speech determines;If it is not, extracting the word feature;
(6) word part of speech is predicted with the word feature that step (4) are extracted, and weight is updated according to prediction result;
(7) judge whether train_data has been handled, if do not handled, circulation step (2) to step (6), if processing It is complete, then it carries out in next step;
(8) average weight, and by the corresponding each part of speech of each feature and weight with the data structure storage of nested dictionary and with The mode of byte stream is stored in local;
(9) part-of-speech tagging is carried out to the sentence of input, the word in sentence to be processed is stored in list words in order In;
(10) words list in step (9) is pre-processed;
(11) word in the words list obtained to step (10) carries out part-of-speech tagging, successively proceeds as follows: in high frequency The word is searched in dictionary and corresponds to part of speech, if so, then the word part of speech determines;If it is not, extracting the word feature;
(12) word part of speech is predicted with the word feature that step (11) are extracted, and be stored in tokens list;
(13) whether words list has been handled in judgment step (9), if do not handled, circulation step (11) to step (12), if processing is complete, tokens list is exported.
Sentence in the step (1) refers to: sentence=([], []), wherein first list is word, the Two lists are the corresponding parts of speech of each word in first list;
Train_data in the step (1) refers to: train_data=[sentence1, sentence2 ...], namely Train_data=[([], []), ([], []) ...], wherein each () is a sentence;
In the step (3) to word carry out pretreatment refer to: firstly, all words are converted to small letter;Secondly, will Number between 1900-2200 is defined as YEAR, and other definitions are DIGITS;Finally, continuous ten one-bit digital is defined as TELENUM;
Addition spcial character before and after words list is referred in the step (4): it is added before each words list ' START ' and ' START2 ', each words list end adds ' END ' and ' END2 ', prevent processing first or the last one It reports an error when word;
High frequency dictionary in the step (5) refers to: English in have general half word part of speech be it is determining, these words are just Without prediction, read directly from local high frequency dictionary, efficiency of algorithm can not only be can be improved by doing so, and can also be mentioned High-accuracy;
Extraction feature in the step (5) refers to: rear three letters of word, the initial of word, before sentence where word The part of speech of one word, the part of speech of sentence the first two word where word, word itself, the previous word folder of sentence where word Body, rear three letters of the previous word of sentence where word, sentence the first two word itself where word, sentence where word The latter word itself, rear three letters of sentence the latter word where word, sentence latter two word itself where word;
Predict that part of speech refers to according to feature in the step (6): from shaped like { feature 1:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... }, feature 2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } in feature, selects and weighed in all features of word Prediction part of speech of the maximum corresponding part of speech of weight values as the word;
Update weight in the step (6) to refer to: if correct according to the word part of speech that existing feature is predicted, the word is corresponding each The corresponding weighted value of part of speech is constant in feature, if prediction error, which is corresponded to the weighted value of correct part of speech in each feature Add one, the weighted value of mistake part of speech subtracts one;
It updates weight in the step (8) to refer to:, cannot be directly with having updated in order to allow the accuracy for improving perceptron algorithm Weight.The problem of above-mentioned algorithm, is, if the training in two slight different examples, may obtain completely different Model.Model cannot evolvement problem cleverly.Bigger problem is that algorithm can excessively pay close attention to the point of those misclassifications, and And the entire model of adjustment is to adapt to them.So ameliorative way is return average weight, rather than final weight.Concrete mode It is one additional dictionary of maintenance, records the time of each weight last time variation.When we change a feature weight When, we just take out this value for updating;
Nested dictionary refers in the step (8): shaped like: { feature 1:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... }, feature 2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } data structure;
Pretreatment refers in the step (10): firstly, all words are converted to small letter;Secondly, by between 1900-2200 Number is defined as YEAR, and other definitions are DIGITS;Finally, continuous ten one-bit digital is defined as TELENUM;
Step (11) the medium-high frequency dictionary refers to: English in have general half word part of speech be it is determining, these words are just Without prediction, read directly from local high frequency dictionary;
Extract feature in the step (11) to refer to: rear three letters of word, the initial of word, sentence where word are previous The part of speech of a word, the part of speech of sentence the first two word where word, word itself, the previous word folder of sentence where word Body, rear three letters of the previous word of sentence where word, sentence the first two word itself where word, sentence where word The latter word itself, rear three letters of sentence the latter word where word, sentence latter two word itself where word;
Prediction part of speech refers in the step (12): from shaped like { feature 1:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... }, Feature 2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } in feature, it is maximum to select weighted value in all features of word Prediction part of speech of the corresponding part of speech as the word;
Tokens list refers in the step (12): shaped like: [(word 1, part of speech 1), (word 2, part of speech 2) ... ...].
The beneficial effects of the present invention are: the present invention can reach higher accuracy rate with less training set, and to setting Standby requirement is not also high, and the training time is not also grown, and has the advantages that simple and is easily achieved.
Detailed description of the invention
Fig. 1 is step flow chart of the invention.
Specific embodiment
With reference to the accompanying drawings and detailed description, the invention will be further described.
Embodiment 1: as shown in Figure 1, a kind of part-of-speech tagging method based on average perceived device algorithm, specifically:
(1) from reading training data in corpus: reading word from corpus, when reading " fullstop ", representative is one The ending of sentence becomes several group of words of front in short, every words are stored in sentence variable, then will Sentence is added in train_data list as training set;
(2) it is read from train_data in short, in which: word is words list, and part of speech is tags list, specifically: words=[I,love,you],tags=[NNP,VB,NNP];
(3) word obtained to step (2) pre-processes;
(4) for adding spcial character before and after step (3) words list, processing first or the last one word Times are prevented It is wrong;
(5) word in the words list obtained to step (3) carries out part-of-speech tagging, successively proceeds as follows: in high frequency word The word is searched in allusion quotation and corresponds to part of speech, if so, then the word part of speech determines;If it is not, extracting the word feature;
(6) word part of speech is predicted with the word feature that step (4) are extracted, and weight is updated according to prediction result;
(7) judge whether train_data has been handled, if do not handled, circulation step (2) to step (6), if processing It is complete, then it carries out in next step;
(8) average weight, and by the corresponding each part of speech of each feature and weight with the data structure storage of nested dictionary and with The mode of byte stream is stored in local;
(9) part-of-speech tagging is carried out to the sentence of input, the word in sentence to be processed is stored in list words in order In;
(10) words list in step (9) is pre-processed;
(11) word in the words list obtained to step (10) carries out part-of-speech tagging, successively proceeds as follows: in high frequency The word is searched in dictionary and corresponds to part of speech, if so, then the word part of speech determines;If it is not, extracting the word feature;
(12) word part of speech is predicted with the word feature that step (11) are extracted, and be stored in tokens list, specifically: tokens=[(I,NNP),(love,VB),(you,NNP)];
(13) whether words list has been handled in judgment step (9), if do not handled, circulation step (11) to step (12), if processing is complete, tokens list is exported.
Sentence in the step (1) refers to: sentence=([], []), wherein first list is word, the Two lists are that each word is corresponding in first list, specifically: sentence=([I, love, you], [and NNP, VB, NNP]);
Train_data in the step (1) refers to: train_data=[sentence1, sentence2 ...], namely Train_data=[([], []), ([], []) ...], wherein each () is a sentence, specifically: train_data = [([I,love,you],[nnp,vb,nnp]),([a,good,man],[at,adj,n]);
In the step (3) to word carry out pretreatment refer to: firstly, all words are converted to small letter;Secondly, will Number between 1900-2200 is defined as YEAR, and other definitions are DIGITS;Finally, continuous ten one-bit digital is defined as TELENUM, specifically: general ' There is a tree ' is transformed into ' there is a tree ';
Addition spcial character before and after words list is referred in the step (4): it is added before each words list ' START ' and ' START2 ', each words list end adds ' END ' and ' END2 ', prevent processing first or the last one It reports an error when word, specifically: [START, START2I, love, you, END, END2];
High frequency dictionary in the step (5) refers to: English in have general half word part of speech be it is determining, these words are just Without prediction, read directly from local high frequency dictionary, efficiency of algorithm can not only be can be improved by doing so, and can also be mentioned High-accuracy, specifically: { good:adj, man:n ... ... };
Extraction feature in the step (5) refers to: rear three letters of word, the initial of word, before sentence where word The part of speech of one word, the part of speech of sentence the first two word where word, word itself, the previous word folder of sentence where word Body, rear three letters of the previous word of sentence where word, sentence the first two word itself where word, sentence where word The latter word itself, rear three letters of sentence the latter word where word, sentence latter two word itself where word, The feature of specifically: sentence is ' this be despite a system ', ' this ' are as follows: " ' i suffix his', ' i pref1 t','i-1 tag -START-','i-2 tag -START2-','i word this','i-1 suffix T2-','i-2 word -START-','i+1 word be','i+1 suffix be','i+2 word despite'";
Predict that part of speech refers to according to feature in the step (6): from shaped like { feature 1:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... }, feature 2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } in feature, selects and weighed in all features of word Prediction part of speech of the maximum corresponding part of speech of weight values as the word, specifically: from " ' i suffix his':{ ' dd1 n': 3.803, 'zz1_np1@\n': -1.0, 'nn2\n': -1.793, 'nn1_jjr\n': -0.947, 'jj\n': - 0.892, 'appge\n': 2.185, 'nn1\n': -0.814, 'np1\n': -0.541}, 'i pref1 t': {' dd1\n': 2.437, 'zz1_np1@\n': -1.0, 'to\n': 1.674, 'nn2\n': -0.972, 'ii\n': 1.065, 'cst\n': 1.827, 'nn1\n': -0.465, 'vvg\n': -1.238, 'ex\n': 1.827, ' mcmc\n': -0.994, 'csn\n': 1.013, 'pphs2\n': 1.885, 'vbr\n': -0.99, 'rg\n': 0.996, 'vbi\n': -0.985, 'dd1_cst\n': 1.273, 'rr\n': -1.929, 'vvz\n': 0.567, ' vvn_vvd\n': -0.983, 'ex_rl\n': 0.764, 'rl\n': 0.823, 'vvd\n': -0.466, 'vvn@\ n': 1.259, 'vvn\n': -2.43, 'nn\n': 0.307, 'vm\n': -0.971, 'jj\n': -1.395, ' appge\n': 1.8, 'nn1_vv0\n': -1.893, 'zz1\n': 0.869, 'vv0\n': -0.368, 'cst_ dd1\n': 1.39, 'rt\n': 2.07, 'pphs1\n': -0.949, 'nn1_jjr\n': -0.947, 'ii22\n': 0.09, 'dd2\n': 2.649, 'md\n': 1.322, 'np1\n': -0.728, 'rgt\n': -0.921, 'jj_ nn1\n': -0.849, 'xx\n': -0.919, 'ii33\n': 0.023, 'ii@\n': 0.096, 'nn1_jj\n': -0.91, 'ccb\n': -0.906, 'ii_rp@\n': 0.038, 'vvg_nn1@\n': -0.897, 'dd1_rg%_ cst\n': 0.062, 'db\n': -0.889, 'vvgk\n': -0.888, 'mc\n': 1.464, 'nnt1\n': 0.92, 'npd1\n': 0.036, 'rp\n': -0.86, 'nn2_vvz\n': 0.892, 'ppho2\n': 1.415, ' mf\n': -0.843, 'vvi\n': 0.435, 'vvg_jj\n': -0.802, 'db2_rr\n': -0.687, 'ddq\ n': -0.672, 'nnt2\n': -0.671, 'jjr_rrr@\n': -0.665, 'nn1_vv0@\n': -0.646, ' vbdz\n': -0.33, 'rr_nn1\n': -0.23}, 'i-1 tag -START-': {'dd1\n': 0.008, 'zz1_ np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, 'cs\n': 0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\n': 0.007, 'vv0_nn1\n': 0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\n': -0.724, 'db2_rr\ n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\n': -0.853, 'vm\n': -0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, 'fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\n': -0.701, 'db\n': -0.565, ' pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i-2 tag -START2-': {'dd1\n': 0.008, 'zz1_np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, 'cs\n': 0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\ n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\n': 0.007, 'vv0_nn1\n': 0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\n': -0.724, ' db2_rr\n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\ n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\n': -0.853, 'vm\n': - 0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, ' fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_ at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\n': -0.701, 'db\n': - 0.565, 'pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i tag+i-2 tag -START- - START2-': {'dd1\n': 0.008, 'zz1_np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, 'cs\n': 0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\ n': 0.007, 'vv0_nn1\n': 0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\n': -0.724, 'db2_rr\n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\ n': -0.853, 'vm\n': -0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, 'fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\ n': -0.701, 'db\n': -0.565, 'pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i word this': {'dd1\n': 3.803, 'zz1_np1@\n': -1.0, 'nn2\n': -0.963, 'nn1_jjr\n': - 0.947, 'jj\n': -0.892}, 'i-1 tag+i word -START- this': {'dd1\n': 1.963, 'zz1_ np1@\n': -1.0, 'nn2\n': -0.963}, 'i-1 word -START2-': {'dd1\n': 0.008, 'zz1_ np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, 'cs\n': 0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\n': 0.007, 'vv0_nn1\n':0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\n': -0.724, 'db2_rr\ n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\n': -0.853, 'vm\n': -0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, 'fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\n': -0.701, 'db\n': -0.565, ' pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i-1 suffix T2-': {'dd1\n': 0.008, ' zz1_np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, 'cs\n': 0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\ n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\n': 0.007, 'vv0_nn1\n': 0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\n': -0.724, ' db2_rr\n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\ n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\n': -0.853, 'vm\n': - 0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, ' fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_ at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\n': -0.701, 'db\n': - 0.565, 'pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i-2 word -START-': {'dd1\n': 0.008, 'zz1_np1@\n': 0.185, '"@\n': 0.706, 'md\n': 0.007, 'np1\n': -0.038, ' cs\n': 0.017, 'ge_"@\n': -0.982, 'rr\n': 0.07, 'pph1\n': 0.015, 'nn\n': 0.006, 'ii\n': 0.229, 'nn2\n': 0.143, 'nn1\n': -0.071, 'appge\n': 0.007, ' vv0_nn1\n': 0.011, 'ii21\n': 0.053, 'pphs2\n': 0.026, 'null\n': 0.882, 'at1\ n': -0.724, 'db2_rr\n': 0.025, 'dd2\n': 0.706, 'ppis1\n': 0.003, 'at1_zz1\n': 0.003, 'da\n': 0.36, 'vm_vv0%\n': 0.069, 'mc1\n': 0.307, 'ii_cs\n': -0.853, ' vm\n': -0.065, 'vv0@\n': 0.071, 'nnb\n': 0.042, 'ppis2\n': 0.194, 'vv0\n': 0.046, 'fo\n': 0.001, 'rt\n': 0.101, 'vv0_nn1_np1@\n': 0.446, 'db_rr@\n': 0.22, 'zz1_at1\n': -0.212, 'ppy\n': 0.033, 'nnu\n': -0.825, 'ge\n': -0.701, ' db\n': -0.565, 'pphs1\n': 0.357, 'zz1_nn1@\n': -0.318}, 'i+1 word be': {'dd1\ n': 1.09, 'zz1_np1@\n': -1.0, 'np1\n': 1.092, 'ii_cs\n': -0.998, 'nn1\n': 0.299, 'ex\n': 0.96, 'mcmc\n': -0.994, 'nn2\n': 1.411, 'jj\n': -1.969, 'vm\ n': 0.111, 'vvn\n': -0.986, 'nn\n': -0.133, 'pph1\n': 0.905, 'cs\n': -0.977, 'to\n': 0.888, 'xx\n': 0.051, 'ex_rl\n': -0.217, 'vv0\n': 0.15, 'zz1\n': 0.965, 'ddq\n': 1.022, 'ge_"@\n': 1.11, 'ge\n': -1.11, 'vv0@_nn1\n': -0.95, ' csn\n': 0.929, 'dd2\n': 0.627, 'csa\n': 1.492, 'ii\n': -0.886, 'ppis1\n': 0.906, 'ppy\n': 0.003, 'at1_zz1\n': -0.871, 'ppis2\n': 0.504, 'jj_nn1\n': - 0.859, 'cst\n': -0.741, 'vhd\n': 0.806, 'vhi\n': -0.824, 'vh0\n': 1.051, ' vhz\n': -0.23, 'y\n': -0.78, 'vvi\n': -0.765, 'cst_dd1\n': 0.747, 'vvn_vvd@\ n': -0.699, 'db2_rr\n': -0.687, 'ppho2\n': 0.567, 'pphs1\n': -0.185, 'rg\n': -0.452, 'rp\n': -0.413, 'pphs2\n': 0.04}, 'i+1 suffix be': {'dd1\n': 1.09, ' zz1_np1@\n': -1.0, 'np1\n': 1.092, 'ii_cs\n': -0.998, 'nn1\n': 0.299, 'ex\n': 0.96, 'mcmc\n': -0.994, 'nn2\n': 1.411, 'jj\n': -1.969, 'vm\n': 0.111, 'vvn\ n': -0.986, 'nn\n': -0.133, 'pph1\n': 0.905, 'cs\n': -0.977, 'to\n': 0.888, ' xx\n': 0.051, 'ex_rl\n': -0.217, 'vv0\n': 0.15, 'zz1\n': 0.965, 'ddq\n': 1.022, 'ge_"@\n': 1.11, 'ge\n': -1.11, 'vv0@_nn1\n': -0.95, 'csn\n': 0.929, ' dd2\n': 0.627, 'csa\n': 1.492, 'ii\n': -0.886, 'ppis1\n': 0.906, 'ppy\n': 0.003, 'at1_zz1\n': -0.871, 'ppis2\n': 0.504, 'jj_nn1\n': -0.859, 'cst\n': - 0.741, 'vhd\n': 0.806, 'vhi\n': -0.824, 'vh0\n': 1.051, 'vhz\n': -0.23, 'y\ n': -0.78, 'vvi\n': -0.765, 'cst_dd1\n': 0.747, 'vvn_vvd@\n': -0.699, 'db2_ rr\n': -0.687, 'ppho2\n': 0.567, 'pphs1\n': -0.185, 'rg\n': -0.452, 'rp\n': - 0.413, 'pphs2\n': 0.04}, 'i+2 word despite': {'dd1\n': 1.0, 'zz1_np1@\n': - 1.0, ' ii n':0.686, ' nn1_vv0_jj n':-0.686 " in select weight be 3.803 corresponding part of speech ' dd1 n';
Update weight in the step (6) to refer to: if correct according to the word part of speech that existing feature is predicted, the word is corresponding each The corresponding weighted value of part of speech is constant in feature, if prediction error, which is corresponded to the weighted value of correct part of speech in each feature Add one, the weighted value of mistake part of speech subtracts one;
It updates weight in the step (8) to refer to:, cannot be directly with having updated in order to allow the accuracy for improving perceptron algorithm Weight.The problem of above-mentioned algorithm, is, if the training in two slight different examples, may obtain completely different Model.Model cannot evolvement problem cleverly.Bigger problem is that algorithm can excessively pay close attention to the point of those misclassifications, and And the entire model of adjustment is to adapt to them.So ameliorative way is return average weight, rather than final weight.Concrete mode It is one additional dictionary of maintenance, records the time of each weight last time variation.When we change a feature weight When, we just take out this value for updating;
Nested dictionary refers in the step (8): shaped like: { feature 1:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... }, feature 2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } data structure, specifically such as data structure in 7;
Pretreatment refers in the step (10): firstly, all words are converted to small letter;Secondly, by between 1900-2200 Number is defined as YEAR, and other definitions are DIGITS;Finally, continuous ten one-bit digital is defined as TELENUM, specifically: General ' There is a tree ' is transformed into ' there is a tree ';
Step (11) the medium-high frequency dictionary refers to: English in have general half word part of speech be it is determining, these words are just Without prediction, read directly from local high frequency dictionary;
Extract feature in the step (11) to refer to: rear three letters of word, the initial of word, sentence where word are previous The part of speech of a word, the part of speech of sentence the first two word where word, word itself, the previous word folder of sentence where word Body, rear three letters of the previous word of sentence where word, sentence the first two word itself where word, sentence where word The latter word itself, rear three letters of sentence the latter word where word, sentence latter two word itself where word;
Prediction part of speech refers in the step (12): from shaped like { feature 1:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... }, Feature 2:{ part of speech 1: weight 1, part of speech 2: weight 2 ... ... } ... ... } in feature, it is maximum to select weighted value in all features of word Prediction part of speech of the corresponding part of speech as the word;
Tokens list refers in the step (12): shaped like: [(word 1, part of speech 1), (word 2, part of speech 2) ... ...], specifically Ground: [(a, at), (good, adj), (man, n)].
In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims (4)

1. a kind of part-of-speech tagging method based on average perceived device algorithm, it is characterised in that:
(1) from reading training data in corpus: reading word from corpus, when reading " fullstop ", representative is one The ending of sentence becomes several group of words of front in short, every words are stored in sentence variable, then will Sentence is added in train_data list as training set;
(2) it is read from train_data in short, in which: word is words list, and part of speech is tags list;
(3) word obtained to step (2) pre-processes;
(4) for adding spcial character before and after step (3) words list, processing first or the last one word Times are prevented It is wrong;
(5) word in the words list obtained to step (3) carries out part-of-speech tagging, successively proceeds as follows: in high frequency word The word is searched in allusion quotation and corresponds to part of speech, if so, then the word part of speech determines;If it is not, extracting the word feature;
(6) word part of speech is predicted with the word feature that step (4) are extracted, and weight is updated according to prediction result;
(7) judge whether train_data has been handled, if do not handled, circulation step (2) to step (6), if processing It is complete, then it carries out in next step;
(8) average weight, and by the corresponding each part of speech of each feature and weight with the data structure storage of nested dictionary and with The mode of byte stream is stored in local;
(9) part-of-speech tagging is carried out to the sentence of input, the word in sentence to be processed is stored in list words in order In;
(10) words list in step (9) is pre-processed;
(11) word in the words list obtained to step (10) carries out part-of-speech tagging, successively proceeds as follows: in high frequency The word is searched in dictionary and corresponds to part of speech, if so, then the word part of speech determines;If it is not, extracting the word feature;
(12) word part of speech is predicted with the word feature that step (11) are extracted, and be stored in tokens list;
(13) whether words list has been handled in judgment step (9), if do not handled, circulation step (11) to step (12), if processing is complete, tokens list is exported.
2. the part-of-speech tagging method according to claim 1 based on average perceived device algorithm, it is characterised in that: the step (3) in word carry out pretreatment refer to: firstly, all words are converted to small letter;Secondly, by between 1900-2200 Number is defined as YEAR, and other definitions are DIGITS;Finally, continuous ten one-bit digital is defined as TELENUM.
3. the part-of-speech tagging method according to claim 1 based on average perceived device algorithm, it is characterised in that: the step (5) the extraction feature in refers to: rear three letters of word, the initial of word, the word of the previous word of sentence where word Property, the part of speech of sentence the first two word where word, word itself, the previous word of sentence itself where word, word place Rear three letters of the previous word of sentence, sentence the first two word itself where word, sentence the latter word where word Itself, rear three letters of sentence the latter word where word, sentence latter two word itself where word.
4. the part-of-speech tagging method according to claim 1 based on average perceived device algorithm, it is characterised in that: the step (6) update weight in refer to: if correct according to the word part of speech that existing feature is predicted, which corresponds to part of speech pair in each feature The weighted value answered is constant, if prediction error, the weighted value which corresponds to correct part of speech in each feature is added one, mistake part of speech Weighted value subtract one.
CN201810561207.4A 2018-06-04 2018-06-04 A kind of part-of-speech tagging method based on average perceived device algorithm Pending CN109062887A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810561207.4A CN109062887A (en) 2018-06-04 2018-06-04 A kind of part-of-speech tagging method based on average perceived device algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810561207.4A CN109062887A (en) 2018-06-04 2018-06-04 A kind of part-of-speech tagging method based on average perceived device algorithm

Publications (1)

Publication Number Publication Date
CN109062887A true CN109062887A (en) 2018-12-21

Family

ID=64820276

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810561207.4A Pending CN109062887A (en) 2018-06-04 2018-06-04 A kind of part-of-speech tagging method based on average perceived device algorithm

Country Status (1)

Country Link
CN (1) CN109062887A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697289A (en) * 2018-12-28 2019-04-30 北京工业大学 It is a kind of improved for naming the Active Learning Method of Entity recognition

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295295A (en) * 2008-06-13 2008-10-29 中国科学院计算技术研究所 Chinese language lexical analysis method based on linear model
CN102831558A (en) * 2012-07-20 2012-12-19 桂林电子科技大学 System and method for automatically scoring college English compositions independent of manual pre-scoring
US20160140104A1 (en) * 2005-05-05 2016-05-19 Cxense Asa Methods and systems related to information extraction
CN107807910A (en) * 2017-10-10 2018-03-16 昆明理工大学 A kind of part-of-speech tagging method based on HMM

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160140104A1 (en) * 2005-05-05 2016-05-19 Cxense Asa Methods and systems related to information extraction
CN101295295A (en) * 2008-06-13 2008-10-29 中国科学院计算技术研究所 Chinese language lexical analysis method based on linear model
CN102831558A (en) * 2012-07-20 2012-12-19 桂林电子科技大学 System and method for automatically scoring college English compositions independent of manual pre-scoring
CN107807910A (en) * 2017-10-10 2018-03-16 昆明理工大学 A kind of part-of-speech tagging method based on HMM

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
TOGETHER_CZ: "词性标注的python实现-基于平均感知机算法", 《HTTPS://BLOG.CSDN.NET/TOGETHER_CZ/ARTICLE/DETAILS/73821852》 *
李正华: "汉语依存句法分析关键技术研究", 《中国博士学位论文全文数据库 (信息科技辑)》 *
蒲昊雨: "推特中的非特定事件检测方法研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697289A (en) * 2018-12-28 2019-04-30 北京工业大学 It is a kind of improved for naming the Active Learning Method of Entity recognition
CN109697289B (en) * 2018-12-28 2023-01-13 北京工业大学 Improved active learning method for named entity recognition

Similar Documents

Publication Publication Date Title
CN110347835B (en) Text clustering method, electronic device and storage medium
CN109063159B (en) Entity relation extraction method based on neural network
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN107729314B (en) Chinese time identification method and device, storage medium and program product
CN102289522B (en) Method of intelligently classifying texts
CN108984745A (en) A kind of neural network file classification method merging more knowledge mappings
CN109657239A (en) The Chinese name entity recognition method learnt based on attention mechanism and language model
CN110309868A (en) In conjunction with the hyperspectral image classification method of unsupervised learning
CN109598517B (en) Commodity clearance processing, object processing and category prediction method and device thereof
CN109684476B (en) Text classification method, text classification device and terminal equipment
CN113887643B (en) New dialogue intention recognition method based on pseudo tag self-training and source domain retraining
CN107832458A (en) A kind of file classification method based on depth of nesting network of character level
CN110263325A (en) Chinese automatic word-cut
CN111797622B (en) Method and device for generating attribute information
CN110781297B (en) Classification method of multi-label scientific research papers based on hierarchical discriminant trees
CN109993216A (en) A kind of file classification method and its equipment based on K arest neighbors KNN
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN111859967A (en) Entity identification method and device and electronic equipment
CN109543036A (en) Text Clustering Method based on semantic similarity
CN114218945A (en) Entity identification method, device, server and storage medium
CN115563982A (en) Advertisement text optimization method and device, equipment, medium and product thereof
CN115064154A (en) Method and device for generating mixed language voice recognition model
CN115700515A (en) Text multi-label classification method and device
CN114579743A (en) Attention-based text classification method and device and computer readable medium
CN110134956A (en) Place name tissue name recognition method based on BLSTM-CRF

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181221

RJ01 Rejection of invention patent application after publication