CN110019711A

CN110019711A - A kind of control method and device of pair of medicine text data structureization processing

Info

Publication number: CN110019711A
Application number: CN201711205811.5A
Authority: CN
Inventors: 罗震; 吴谨准; 贾虎; 徐盛; 顾春宏
Original assignee: Individual
Current assignee: Basebit Shanghai Information Technology Co ltd; WU JINZHUN
Priority date: 2017-11-27
Filing date: 2017-11-27
Publication date: 2019-07-16

Abstract

The present invention provides the control method of a kind of pair of medicine text data structureization processing, include the following steps: that a. is based on structural model and extracts to obtain multiple medicine entities mappings to medicine text data progress entity, the structural model includes multiple sequence labels, the sequence label is formed by way of model training on the basis of artificial mark, and the medicine text data includes multiple word insertions；B. multiple medicine entity mappings are combined to obtain the structured text.The present invention provides the control device of a kind of pair of medicine text data structureization processing, including entity draw-out device and structurizer.The present invention avoids the dependence to matching dictionary by the way of entity extraction, extracts effect, generalization ability and scalability and is improved, reduces maintenance cost.

Description

A kind of control method and device of pair of medicine text data structureization processing

Technical field

It is the invention belongs to technical field of information processing, in particular to a kind of that medical text is carried out using artificial intelligence technology The control method and device of the method for processing, especially a kind of pair of medicine text data structureization processing.

Background technique

Artificial intelligence (English: Artificial Intelligence, Al) refers to be shown by the machine that people manufactures The intelligence come.Usual artificial intelligence refers to the intelligence realized by common computer.Artificial intelligence includes weak artificial intelligence and strong man Work intelligence.It is generally believed that weak artificial intelligence (also referred to as narrow sense artificial intelligence) refers to being absorbed in and solves the problems, such as some specific area Artificial intelligence technology, it is also assumed that being the technical tool applied to the field.

Natural language processing technique is an important branch of narrow sense artificial intelligence, pay attention to processing to natural language and With having been obtained and be widely applied in human-computer interaction.The scope of natural language processing includes information retrieval, information pumping It takes, the fields such as machine translation, text are read aloud, segmented, part-of-speech tagging, autoabstract.

It, can using participle, the mark in natural language processing technique in the practical application of health medical treatment big data field To analyze for doctor using the case history of natural language description, symptom, medical information and the event etc. of patient are therefrom extracted Information.The clinical research research of doctor and artificial intelligence assisting in diagnosis and treatment system etc. are answered in the acquisition and standardization of these information It builds and all plays an important role.

Not currently exist the control method that natural language processing is carried out specifically for medical big data field, that is, at present The analysis of medical text is still handled using traditional participle, mask method, traditional segmenting method are as follows: establish word Allusion quotation；Score model is established according to the frequency that adjacent word occurs；Unacquainted neologisms auxiliary other methods are solved.It brings in this way Defect: first is that treatment process is longer, response speed is not fast enough；Second is that successful match rate is low when encountering neologisms.For medical text For this, its main feature is that language construction is relatively easy, it include a large amount of specialized vocabularies, and there are many long word, so if using existing Some participles, mask method identify medical text, can further slow down response speed, the reason is that, existing participle Method (i.e. by establish dictionary in the way of) be more for traditional language construction, be not then its strong point for technical term, In this way when identifying medical text, it will lead to the neologisms frequently occurred, next is exactly that successful match rate further drops therewith It is low.

Summary of the invention

For technological deficiency of the existing technology, according to an aspect of the present invention, a kind of pair of medicine text data knot The control method of structureization processing, the medicine text data for that will correspond to natural language carry out structuring processing and obtain structuring Text includes the following steps:

A. entity is carried out to medicine text data based on structural model to extract to obtain multiple medicine entities mappings, the structure Model includes multiple sequence labels, and the sequence label is formed by way of model training on the basis of artificial mark, institute Stating medicine text data includes multiple word insertions；

B. multiple medicine entity mappings are combined to obtain the structured text.

Preferably, the step a includes the following steps:

A1. two-way length memory network in short-term is inputted after the medicine text data being converted to word insertion two-dimensional matrix；

A2. the two-way length in short-term memory network output length be the corresponding sequence length of the medicine text data and Width is the medicine text data two-dimensional matrix of designated length；

A3. it is right to be passed into the maximum sequence label institute of conditional random fields acquisition score for the medicine text data two-dimensional matrix The word insertion answered is mapped as the medicine entity, and the score is by conditional random fields according to the overall situation of the medicine text data Information determines that a sequence label corresponds to a score based on the structural model.

Preferably, it is also executed the following steps: before the step a

I. two-way length memory network in short-term is inputted after standard word sequence being converted to word insertion two-dimensional matrix；

Ii. two-way length memory network output in short-term length is the corresponding sequence length of the standard word sequence and width Degree is the standard word sequence two-dimensional matrix of designated length, and the standard word sequence two-dimensional matrix is passed into conditional random fields；

Iii. the conditional random fields calculate the conditional probability of the structural model and obtain the loss of the structural model Value, updates each layer weight of the structural model using the algorithm of backpropagation and optimizes the penalty values；

Iv. step i, ii, iii are repeated until the structural model is restrained.

Preferably, the step b includes the following steps:

B1. participle operation is executed to the medicine text data and obtains text word segmentation result, to multiple medicine entities Mapping executes participle operation acquisition medicine entity and maps word segmentation result, and the participle operation is completed by participle model, and described point Word model is formed by way of model training on the basis of artificial mark；

B2. the text word segmentation result is matched with medicine entity mapping word segmentation result, and based on described Multiple preferred medicine entity mappings are filtered out from multiple medicine entity mappings with result；

B3. multiple preferred medicine entity mappings are combined to obtain the structured text.

Preferably, include the following steps: after the step b

C. multiple medicine entity mappings are input to database and carry out the multiple standard information segments of conversion acquisition；

D. multiple standard information segments are combined and obtain standard Structured text.

According to another aspect of the present invention, the control device of a kind of pair of medicine text data structureization processing is also provided, is used Structuring processing, which is carried out, in the medicine text data that will correspond to natural language obtains structured text, comprising:

Entity draw-out device is used to carry out entity to medicine text data based on structural model to extract to obtain multiple medicine Entity mapping, the structural model include multiple sequence labels, and the sequence label passes through model on the basis of artificial mark Trained mode is formed, and the medicine text data includes multiple word insertions；

Structurizer is used to be combined to obtain the structured text to multiple medicine entity mappings.

Preferably, the entity draw-out device includes following device:

First input unit inputs two-way length after being used to be converted to the medicine text data word insertion two-dimensional matrix Short-term memory network；

First output device, being used for two-way length memory network output in short-term length is the medicine text data pair The sequence length and width answered are the medicine text data two-dimensional matrix of designated length；

First acquisition device is used for the medicine text data two-dimensional matrix and is passed into conditional random fields acquisition score most The insertion of word corresponding to big sequence label is mapped as the medicine entity, and the score is by conditional random fields according to the doctor The global information for learning text data determines that a sequence label corresponds to a score based on the structural model.

Preferably, the control device further includes following device:

Secondary input device inputs two-way length and remembers in short-term after being used to be converted to standard word sequence word insertion two-dimensional matrix Recall network；

Second output device, being used for the two-way length, memory network output length is corresponding for the standard word sequence in short-term Sequence length and width be designated length standard word sequence two-dimensional matrix, and by the standard word sequence two-dimensional matrix quilt Incoming conditional random fields；

Second acquisition device is used for the conditional random fields and calculates described in conditional probability and the acquisition of the structural model The penalty values of structural model update each layer weight of the structural model using the algorithm of backpropagation and optimize the loss Value.

Preferably, the structurizer includes following device:

Device is segmented, is used to execute the medicine text data participle operation and obtains text word segmentation result and right Multiple medicine entity mappings execute participle operation acquisition medicine entity and map word segmentation result, and the participle operation passes through participle Model is completed, and the participle model is formed by way of model training on the basis of artificial mark；

Coalignment is used to match the text word segmentation result with medicine entity mapping word segmentation result, And multiple preferred medicine entity mappings are filtered out from multiple medicine entity mappings based on the matching result；

Combination unit is used to be combined to obtain the structured text to multiple preferred medicine entity mappings.

Preferably, the control device further includes following device:

Reforming unit, is used to multiple medicine entities mappings being input to database and carries out conversion and obtain multiple standards Information segment；

Multiple standard information segments are combined and obtain standard Structured text by standard combination device.

The present invention carries out entity extraction to medicine text data by the inclusion of the structural model of multiple sequence labels and is cured Entity mapping is learned, and maps to form final structured text based on medicine entity.The present invention is kept away by the way of entity extraction The dependence to matching dictionary is exempted from, has extracted effect, generalization ability and scalability and be improved, reduce maintenance cost.Meanwhile The present invention can also realize the purpose for automatically updating structural model, be also more in line with medical industries specialized vocabulary in this way and be continuously increased Characteristic.

Detailed description of the invention

Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon:

Fig. 1 shows the specific embodiment of the present invention, the control of a kind of pair of medicine text data structureization processing The flow chart of method processed；

Fig. 2 shows a specific embodiments of the invention, are combined to obtain to multiple medicine entity mappings The process of the structured text；

Fig. 3 shows the first embodiment of the present invention, the control method of a kind of pair of medicine text data structureization processing Flow chart；

Fig. 4 shows the second embodiment of the present invention, the flow chart for training structure model；

Fig. 5 shows the fourth embodiment of the present invention, and a kind of pair of medicine text data structureization processing obtains standard knots The flow chart of the control method of structure text；

Fig. 6 shows another embodiment of the present invention, the control of a kind of pair of medicine text data structureization processing The functional block diagram of device processed；And

Fig. 7 shows the fifth embodiment of the present invention, a kind of functional block diagram of the control device of training structure model.

Specific embodiment

Fig. 1 shows the specific embodiment of the present invention, the control of a kind of pair of medicine text data structureization processing The flow chart of method processed, the medicine text data for that will correspond to natural language carry out structuring processing and obtain structuring text This.It also include medical treatment it will be appreciated by those skilled in the art that both including natural language in content corresponding to the medicine text data Technical term, before executing the following steps, the medicine text data can be converted into multiple words as unit of monocase Insertion, and then realization indicates each monocase with fixed length vector, for subsequent data processing use.Specifically, the doctor Learning text data can be understood as a word sequence, such as the medicine text data includes " A, B, C, D, E, F ", and we are uncommon Hope a vector (the usually vector of low-dimensional) out corresponding to each character therein, if the corresponding vector of A is [0.30.7], B is corresponding to two [- 0.3 0.6], and so on obtain the corresponding vector of alphabet, this is the medicine text data packet The whole words insertion included, can carry out subsequent operation on this basis.It will be appreciated by those skilled in the art that illustrate herein only be for Explanation, is not offered as limitation of the invention.

Step S101 is first carried out, entity is carried out to medicine text data based on structural model and extracts to obtain multiple medicine realities Body mapping, the structural model include multiple sequence labels, and the sequence label is instructed on the basis of artificial mark by model Experienced mode is formed, and the medicine text data includes multiple word insertions.Specifically, the sequence label can be understood as according to The professional classification of medical terminology is manually set, for example, sequence label can correspond to symptom, symptom modification；Inspection item, Inspection result；Disease, disease modification；Treatment, treatment modification；Drug, drug modification；Disease time etc..Those skilled in the art Member understands, compared to dictionary is established, sets that multiple sequence labels are quicker, while data volume is also significantly smaller than the data of dictionary Amount.Institute's predicate is embedded in more specifically, is mapped by segment of multiple sequence labels to the medicine text data Multiple medicine entity mappings can be obtained.For example, the corresponding content of text of the medicine text data is that " at 9 points in patient's morning is just Examine, patient reflects that night yesterday is generated heat suddenly to 37 degree, but the state of mind is good at present ", correspondingly, multiple sequence labels pair The content answered includes " disease time ", " symptom " and " symptom description ", then by multiple sequence labels to the medicine textual data After carrying out entity extraction, " time: yesterday, symptom: fever, symptom description: 37 degree " is obtained, it thus can be to the medicine text Data make refining, and are finally included into medicine large database concept.Based on above description as can be seen that in this step, do not need pair The context for each words that medical text is included is made identification, therefore its processing time that medical text is greatly saved, and The purpose of this step is to be screened based on structural model to medical text, and then make contributions for the accumulation of big data.

Further, step S102 is executed, multiple medicine entity mappings are combined to obtain the structuring text This.Specifically, this step can be understood as the step of arranging to the operation result of above-mentioned steps 101, the structuring text Originally it can be ranked up, can not also be ranked up according to common logic of natural language according to common logic of natural language, For example, mapping the layout that is directly modular by multiple medicine entities that the step S101 is obtained, that is, each doctor Entity mapping is learned to be divided into corresponding module, and the module belongs to a part of medicine big data.Still with above-mentioned step After for rapid 101 specific example, obtaining " time: yesterday, symptom: fever, symptom description: 37 degree ", a kind of combination is obtained The result arrived is " yesterday generates heat 37 degree ", and another combination is respectively to incorporate " yesterday ", " fever ", " 37 degree " not into Same module, and corresponding module can be the crowd in some corresponding region, and then can be to the medicine of the crowd in the region Big data is counted.

Specifically, Fig. 2 shows a specific implementations of step S102, carry out to multiple medicine entity mappings Combination obtains the process of the structured text, includes the following steps:

Step S1021 is executed, participle operation is executed to the medicine text data and obtains text word segmentation result, to multiple institutes It states the mapping of medicine entity and executes participle operation acquisition medicine entity mapping word segmentation result, the participle operation is complete by participle model At the participle model is formed by way of model training on the basis of artificial mark.Specifically, it is mentioned to reduce entity Generated error is taken, after completing the mapping of entity extraction butterfly medicine entity, medicine entity can also be mapped and be optimized Processing correspondingly then carries out participle operation firstly the need of to medicine text data and the mapping of medicine entity, and participle operation is existing The common scheme of technology, a kind of mode are based on string matching, i.e. scanning character string, if it find that the substring and word of character string Identical, even if matching, a kind of mode is completed based on statistics and machine learning, and those skilled in the art can combine existing Segmentation methods are achieved.

Further, execute step S1022, by the text word segmentation result and the medicine entity map word segmentation result into Row matching, and multiple preferred medicine entity mappings are filtered out from multiple medicine entity mappings based on the matching result. Specifically, the purpose of this step is that the mapping of medicine entity is proofreaded based on medicine text data, and reason is medicine text data It is urtext, it is more objective as benchmark text.It preferably, can be with base after in addition to being proofreaded based on medicine text data It is proofreaded in the logical construction that multiple medicine entities map itself, that is, all the medicine entity maps word segmentation result for judgement Logic continuity, then obtain the mapping of preferred medicine entity on this basis.

Further, step S1023 is executed, multiple preferred medicine entity mappings are combined to obtain the structure Change text.

Fig. 3 shows the first embodiment of the present invention, the control method of a kind of pair of medicine text data structureization processing Flow chart, specifically comprise the following steps:

Step S201 is executed, two-way length is inputted after the medicine text data is converted to word insertion two-dimensional matrix and remembers in short-term Recall network.Specifically, the medicine text data is converted to word insertion two-dimensional matrix and can be embedded in conjunction in above-mentioned Fig. 1 about word Partial description, i.e., the described medicine text data will illustrate two-way length memory network in short-term with represented by bivector matrix, Firstly the need of introducing long memory network in short-term (referred to as " LSTM "), LSTM is actually to solve circulation memory network (referred to as " RNN ") the problem of and occur, improved by the hidden layer to RNN and realize its function, LSTM is substantially or one RNN, we can be understood as one and carry out an improved network on RNN framework, be cooperated by multitiered network more Better artificial intelligence operation effect is realized in layer multiplexing, and LSTM includes at least the cell for memory, is used for parameter input and output Input Gate and Output Gate, there are also for forgetting Forge Gate.On this basis, two-way long short-term memory Network (referred to as " BiLSTM ") can be understood as the improvement to (referred to as " BiRNN ") of bidirectional circulating memory network, and BiRNN It is different from RNN place to be, past context can either be accessed, additionally it is possible to access following context, basic thought It is to propose that each training sequence is forwardly and rearwardly two RNN respectively, and the two RNN are connected to an output layer, It can be supplied to the complete contextual information with future in the past that each in output layer list entries is put in this way, accordingly Ground, BiLSTM are also the improvement to BiRNN, that is, a two-way cell unit is increased on the basis of BiRNN, tool Body can be understood in conjunction with the description of this section of first half.It will be appreciated by those skilled in the art that the above-mentioned network architecture is actually being answered Can also correspond to specific algorithm in, but this and be not belonging to emphasis of the invention, it will not be described here.

Further, step S202 is executed, two-way length memory network output in short-term length is the medicine textual data It is the medicine text data two-dimensional matrix of designated length according to corresponding sequence length and width.Specifically, the medicine text The corresponding sequence length of data is determined based on the length that multiple words included by the medicine text data are embedded in, described specified Length is can to set different numerical value according to different algorithm bases by an artificial preset length value.More For the specifically medicine text data two-dimensional matrix can be understood as by BiLSTM to whole words be embedded in corresponding vector into The row two-dimensional matrix that treated obtains.

Further, step S203 is executed, the medicine text data two-dimensional matrix is passed into conditional random fields and obtains The insertion of word corresponding to maximum sequence label is divided to map as the medicine entity, the score is by conditional random fields according to institute The global information of medicine text data is stated based on structural model determination, is obtained described in one described sequence label correspondence one Point.Specifically, conditional random fields (referred to as " CRF ") are one of common algorithms of natural language processing field, are usually used in syntax point Analysis, name Entity recognition, part-of-speech tagging etc. use Markov Chain as the score metastasis model for implying variable, by can Observation state differentiates implicit variable, and it is a discrimination model that score is got also through mark collection statistics.CRF is substantially hidden Markov Chain and Observable state containing variable are to the score for implying variable, in the application of the prior art, with grammatical term for the character Property score for, it is assumed that part of speech label meets Markov property, i.e., current part of speech only have with a upper part of speech score turn Shifting relationship and it is unrelated with the part of speech of other positions, for example adjective is 0.5 followed by adjectival score, with modified " " It is scored at 0.5, is scored at 0 with verb.

Further, based on the application above to the description of CRF and the prior art, in the application of CRF, usually to phase The target of adjacent relationship is judged, such as in the judgement of above-mentioned part of speech, is confined to current part of speech and upper one or next word The score transfer relationship of property.And apply CRF in this step, it is used for the score that grammatical term for the character insertion meets sequence label, without It is same as the prior art, is to carry out global judgement and be not based on front-rear position relationship (adjacent pass in other words in this step System) judged, that is, judge whether the insertion of some word can map as medicine entity, needs to comprehensively consider the label sequence The relationship of column and whole sequence labels included by the structural model obtains the sequence label score in turn, and specific judgement is calculated Why method can use such technical solution, reason is still in the present invention not in conjunction with the realization of the prior art Need to obtain the value that each word is embedded in corresponding text, but it is embedding to screen the word to match with sequence label by CRF Enter, more specifically, judges if it is the prior art, then should be the judgement between whole word insertions, and turn of the invention It is changed to the judgement of word insertion and sequence label, this is " to be based on the structure according to the global information of the medical text data The process that model " is judged, it will be appreciated by those skilled in the art that the data volume of sequence label is significantly less than whole word insertions Data volume, by avoiding the judgment mode using front-rear position relationship (neighbouring relations in other words), it can significantly improve fortune Calculate efficiency.Meanwhile the dependence avoided to matching dictionary is extracted using entity, it extracts effect, generalization ability and scalability and obtains It improves, reduces maintenance cost.

Further, step S204 is executed, multiple medicine entity mappings are combined to obtain the structuring text This.This step can understand in conjunction with step S102.

Fig. 4 shows the second embodiment of the present invention, for the flow chart of training structure model, specifically, for training The step S101 that the process of structural model can be shown in FIG. 1 is carried out before executing, and is included the following steps:

Step S301 is first carried out, inputs two-way long short-term memory after standard word sequence is converted to word insertion two-dimensional matrix Network.Specifically, " standard word sequence " herein can be interpreted as to " the medicine text data " of step S201, it correspondingly, can To combine the description of step S201 to understand this step.

Further, step S302 is executed, two-way length memory network output in short-term length is the standard word sequence Corresponding sequence length and width are the standard word sequence two-dimensional matrix of designated length, and by the standard word sequence Two-Dimensional Moment Battle array is passed into conditional random fields.Specifically, " standard word sequence " herein can be interpreted as to " the medicine textual data of step S202 According to ", correspondingly, this step can be understood in conjunction with the description of step S202.

Further, step S303 is executed, the conditional random fields calculate the conditional probability of the structural model and acquisition The penalty values of the structural model update each layer weight of the structural model using the algorithm of backpropagation and optimize the damage Mistake value.Specifically, the step S301 and step S302 mentioned standard word sequence can be understood as training sample, can be with Do different variations according to trained actual conditions, and the conditional probability that the conditional random fields calculate the sequence label can be with Understand are as follows: in specified criteria random field P (Y | X), list entries x and output sequence y, design conditions probability P (Y_i=y_i| x), P (Y_i-1=y_i-1, Y_i=y_i| x) and the problem of corresponding mathematic expectaion, x therein, y are the corresponding score of the sequence label, Its specific algorithmic formula can be achieved in conjunction with the prior art.Correspondingly, in the whole marks for obtaining the structural model After the conditional probability for signing sequence, then the mean square error of all conditional probabilities of the sequence label is calculated, according to the mean square error It can be obtained the penalty values of the structural model, and penalty values are for judging whether current structural model is perfect enough.

Further, after obtaining penalty values, also penalty values are optimized, judge current structure model to improve Accuracy rate.Specifically, the algorithm of backpropagation be used to train at present the most frequently used of artificial neural network (referred to as " ANN ") and Most effective algorithm.Its main thought is: training set data being input to the input layer of ANN, by hidden layer, is finally reached defeated Out layer and export as a result, this is the propagated forward process of ANN；Since the output result and actual result of ANN have error, then calculate Error between estimated value and actual value, and by the error from output layer to hidden layer backpropagation, until traveling to input layer； During backpropagation, according to the value of error transfer factor various parameters；The continuous iteration above process, until convergence.

It will be appreciated by those skilled in the art that in order to obtain the perfect structural model of approach, need constantly to repeat step S301, S302 and S303 is until the structural model is restrained, that is, step S301 to step S303 is a training circulation.

Fig. 5 shows the fourth embodiment of the present invention, and a kind of pair of medicine text data structureization processing obtains standard knots The flow chart of the control method of structure text, specifically comprises the following steps:

Step S501 is executed, carries out entity to medicine text data based on structural model and extract to obtain multiple medicine entities to reflect It penetrates, the structural model includes multiple sequence labels, and the sequence label is formed by way of manually marking, the medicine text Notebook data includes multiple word insertions；Step S502 is executed, multiple medicine entity mappings are combined to obtain the structure Change text.The step S501 and step S502 can be understood in conjunction with step S101 and step S102.

Further, step S503 is executed, multiple medicine entity mappings are input to database and carry out conversion acquisition Multiple standard information segments.It will be appreciated by those skilled in the art that in order to advise the format content for ultimately generating structured text more Model is unified, can pre-establish database, is the information of standard by medicine entity Mapping and Converting by the database, such as can be with The format content for being converted to [type, description, value, time, additional information] is engaged in realization to each dimensional information of medical text Standard expression.

Further, step S504 is executed, multiple standard information segments are combined and obtain standard Structured text This.Specifically, combination can be combined according to the logical construction of natural language, can not also logically structure be carried out Combination, this can be achieved the object of the present invention.

As another embodiment of the invention, Fig. 6 shows a kind of pair of medicine text data structureization processing Control device functional block diagram, for will correspond to natural language medicine text data carry out structuring processing obtain structure Change text, comprising:

Entity draw-out device 10 is used for it and is used to carry out entity to medicine text data based on structural model to extract to obtain Multiple medicine entity mappings, the structural model include multiple sequence labels, and the sequence label is on the basis of artificial mark It is formed by way of model training, the medicine text data includes multiple word insertions.

Structurizer 20 is used to be combined to obtain the structured text to multiple medicine entity mappings.

Reforming unit 30, is used to multiple medicine entities mappings being input to database and carries out conversion and obtain multiple marks Calibration information segment.

Multiple standard information segments are combined and obtain standard Structured text by standard combination device 40.

Preferably, the entity draw-out device 10 includes following device:

First input unit 101, input is double after being used to be converted to the medicine text data word insertion two-dimensional matrix To long memory network in short-term；

First output device 102, being used for two-way length memory network output in short-term length is the medicine textual data It is the medicine text data two-dimensional matrix of designated length according to corresponding sequence length and width；

First acquisition device 103 is used to the medicine text data two-dimensional matrix and is passed into conditional random fields obtain The word insertion greater than the first score threshold is divided to map as the medicine entity, the score is by conditional random fields according to the doctor The global information for treating text data determines that the score corresponds to the insertion of institute's predicate based on the structural model.

Preferably, the structurizer 20 further includes following device:

Device 201 is segmented, is used to execute the medicine text data participle operation acquisition text word segmentation result, and Participle operation acquisition medicine entity is executed to multiple medicine entity mappings and maps word segmentation result；

Coalignment 202 is used for the text word segmentation result and medicine entity mapping word segmentation result progress Match, and multiple preferred medicine entity mappings are filtered out from multiple medicine entity mappings based on the matching result；

Combination unit 203203 is used to be combined to obtain the structuring to multiple preferred medicine entity mappings Text.

Fig. 7 shows the fifth embodiment of the present invention, a kind of functional block diagram of the control device of training structure model, packet It includes:

Secondary input device 104 inputs two-way length after being used to be converted to standard word sequence word insertion two-dimensional matrix When memory network；

Second output device 105, being used for two-way length memory network output in short-term length is the standard word sequence Corresponding sequence length and width are the standard word sequence two-dimensional matrix of designated length, and by the standard word sequence Two-Dimensional Moment Battle array is passed into conditional random fields；

Second acquisition device 106 is used for conditional probability and acquisition that the conditional random fields calculate the structural model The penalty values of the structural model update each layer weight of the structural model using the algorithm of backpropagation and optimize the damage Mistake value.

It will be appreciated by those skilled in the art that figure 6 above realization of each apparatus function into Fig. 7 can be hardware, by processor The software or combination of execution.It specifically, can be by preparatory burning program described in if realized by software module In processor, or by software installation into preset system；If by hardware realization, using field-programmable gate array Column (FPGA) realize corresponding function immobilization.

Further, the software module can store in RAM memory, flash memory, ROM memory, eprom memory, The storage medium of hard disk or any other form known in the art.By the way that the storage medium is coupled to processor, thus So that the processor is read information from the storage medium, and information can be written to the storage medium.As A kind of variation, the storage medium can be component part or the processor and the equal position of the storage medium of processor In on specific integrated circuit (ASIC).

Further, the hardware can be that by the general processor of concrete function, digital signal processor (DSP), specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device, discrete gate or The combination of transistor logic, discrete hardware components or the above hardware.As a kind of variation, can also be set by calculating Standby combination realizes, for example, the combination of DSP and microprocessor, the combination of multi-microprocessor, with DSP communicating one combined Or combination of multi-microprocessor etc..

Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow Ring substantive content of the invention.

Claims

1. the control method of a kind of pair of medicine text data structureization processing, for the medicine textual data of natural language will to be corresponded to Structured text is obtained according to structuring processing is carried out, which comprises the steps of:

A. entity is carried out to medicine text data based on structural model to extract to obtain multiple medicine entities mappings, the structural model Including multiple sequence labels, the sequence label is formed by way of model training on the basis of artificial mark, the doctor Learning text data includes multiple word insertions；

2. control method according to claim 1, which is characterized in that the step a includes the following steps:

A2. two-way length memory network output in short-term length is the corresponding sequence length of the medicine text data and width For the medicine text data two-dimensional matrix of designated length；

A3. the medicine text data two-dimensional matrix is passed into conditional random fields and obtains corresponding to the maximum sequence label of score Word insertion is mapped as the medicine entity, and the score is by conditional random fields according to the global information of the medicine text data It is determined based on the structural model, the corresponding score of a sequence label.

3. control method according to claim 2, which is characterized in that also executed the following steps: before the step a

Ii. two-way length memory network output in short-term length is the corresponding sequence length of the standard word sequence and width is The standard word sequence two-dimensional matrix of designated length, and the standard word sequence two-dimensional matrix is passed into conditional random fields；

Iii. the conditional random fields calculate the conditional probability of the structural model and obtain the penalty values of the structural model, make Each layer weight of the structural model is updated with the algorithm of backpropagation and optimizes the penalty values；

Iv. step i, ii, iii are repeated until the structural model is restrained.

4. control method according to any one of claim 1 to 4, which is characterized in that the step b includes following step It is rapid:

B1. participle operation is executed to the medicine text data and obtains text word segmentation result, multiple medicine entities are mapped It executes participle operation and obtains medicine entity mapping word segmentation result, the participle operation is completed by participle model, the participle mould Type is formed by way of model training on the basis of artificial mark；

B2. the text word segmentation result is matched with medicine entity mapping word segmentation result, and is tied based on the matching Fruit filters out multiple preferred medicine entity mappings from multiple medicine entity mappings；

5. control method according to any one of claim 1 to 5, which is characterized in that include as follows after the step b Step:

6. the control device of a kind of pair of medicine text data structureization processing, for the medicine textual data of natural language will to be corresponded to Structured text is obtained according to structuring processing is carried out characterized by comprising

Entity draw-out device is used to carry out entity to medicine text data based on structural model to extract to obtain multiple medicine entities Mapping, the structural model include multiple sequence labels, and the sequence label passes through model training on the basis of artificial mark Mode formed, the medicine text data includes the insertion of multiple words；

7. control device according to claim 6, which is characterized in that the entity draw-out device includes following device:

First input unit inputs two-way length in short-term after being used to be converted to the medicine text data word insertion two-dimensional matrix Memory network；

First output device, being used for two-way length memory network output in short-term length is that the medicine text data is corresponding Sequence length and width are the medicine text data two-dimensional matrix of designated length；

First acquisition device, be used for the medicine text data two-dimensional matrix be passed into conditional random fields obtain score it is maximum The insertion of word corresponding to sequence label is mapped as the medicine entity, and the score is by conditional random fields according to the medicine text The global information of notebook data determines that a sequence label corresponds to a score based on the structural model.

8. control device according to claim 7, which is characterized in that the control device further includes following device:

Secondary input device inputs two-way long short-term memory net after being used to be converted to standard word sequence word insertion two-dimensional matrix Network；

Second output device, being used for two-way length memory network output in short-term length is the corresponding sequence of the standard word sequence Column length and width are the standard word sequence two-dimensional matrix of designated length, and the standard word sequence two-dimensional matrix is passed into Conditional random fields；

Second acquisition device is used for the conditional random fields and calculates the conditional probability of the structural model and obtain the structure The penalty values of model update each layer weight of the structural model using the algorithm of backpropagation and optimize the penalty values.

9. the control device according to any one of claim 6 to 8, which is characterized in that the structurizer includes such as Lower device:

Device is segmented, is used to execute the medicine text data participle operation acquisition text word segmentation result, and to multiple The medicine entity mapping executes participle operation acquisition medicine entity and maps word segmentation result, and the participle operation passes through participle model It completes, the participle model is formed by way of model training on the basis of artificial mark；

Coalignment is used to match the text word segmentation result with medicine entity mapping word segmentation result, and base Multiple preferred medicine entity mappings are filtered out from multiple medicine entity mappings in the matching result；

10. control device according to any one of claims 6 to 9, which is characterized in that the control device further include as Lower device:

Reforming unit, is used to multiple medicine entities mappings being input to database and carries out conversion and obtain multiple standard information Segment；