CN103853710A - Coordinated training-based dual-language named entity identification method - Google Patents

Coordinated training-based dual-language named entity identification method Download PDF

Info

Publication number
CN103853710A
CN103853710A CN201310593746.3A CN201310593746A CN103853710A CN 103853710 A CN103853710 A CN 103853710A CN 201310593746 A CN201310593746 A CN 201310593746A CN 103853710 A CN103853710 A CN 103853710A
Authority
CN
China
Prior art keywords
named entity
bilingual
language
mark
target language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310593746.3A
Other languages
Chinese (zh)
Other versions
CN103853710B (en
Inventor
黄河燕
史树敏
李业刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201310593746.3A priority Critical patent/CN103853710B/en
Publication of CN103853710A publication Critical patent/CN103853710A/en
Application granted granted Critical
Publication of CN103853710B publication Critical patent/CN103853710B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a dual-language coordinated training-based named entity identification method, and belongs to the technical field of natural language processing in computer science. Parallel Chinese and English sentence datasets are considered as two different view of a dataset for dual-language coordinated training, a log-linear model is used for correcting projection marks in a projection process, and named entity dual-language aligned annotation consistency is introduced as a measurement index for mark confidence estimation when the model is used for predicting an unseen case. Compared with the prior art, the method has the advantages that the domain dependence of named entity identification is reduced, the advantages of dual-language identification are fused, the problem of partial identification ambiguity in single-language identification is solved, and the method is particularly suitable for the dual-language named entity synchronous identification of large-scale language materials.

Description

A kind of bilingual named entity recognition method based on coorinated training
Technical field
The present invention relates to a kind of recognition methods of bilingual named entity, be particularly useful for as processing the early stage of mechanical translation, extensive cross-cutting bilingual corpora is carried out to the identification of named entity, belong to natural language processing (NLP) technical field in computer science.
Background technology
Named entity is the proprietary name of unique individuality.Named entity recognition is an important foundation technical barrier in natural language processing field, has become one of the technical bottleneck in the multi-language information processing such as cross-language information retrieval and mechanical translation field.
At present, researchist has developed a lot of models for named entity recognition.Wherein, because rule-based method is unfavorable for promoting between variety classes language, in the last few years, the method based on statistics had been subject to extensive concern.In statistical method, supervised learning method has good performance in named entity recognition task, but it has two weak points: one, and the method needs a large amount of labeled data to ensure the accuracy of study, is therefore unsuitable for the language that those resources are relatively poor; Its two, in the time that existing labeled data and data to be determined do not belong to same field, the performance of supervised learning method can obviously decline.Unsupervised method performance is unsatisfactory.Improving these not enough methods is exactly in conjunction with a small amount of mark language material and a large amount of un-annotated datas, adopts the coorinated training method based on semi-supervised learning.
Summary of the invention
The object of the invention is, in order to overcome prior art solving the deficiency in bilingual named entity recognition in extensive cross-cutting language material, to propose a kind of bilingual named entity recognition method based on coorinated training.
The technical solution adopted in the present invention is: by these two data sets of parallel Chinese-English bilingual sentence, two different views regarding a data set as carry out bilingual coorinated training.At Chinese-English two ends, on a small amount of labeled data, carry out respectively initial marking model training, produce two initiation sequence marking model.The initiation sequence marking model that utilization trains is carried out named entity mark to cross-cutting fraction un-annotated data, then annotation results is projected to another corresponding language end.In projection process, use a log-linear model, merging single language syntactic feature and bilingual alignment feature revises projection mark, thereby reduce the possibility of mark example mistake mark, reduce the noise of another one sequence labelling model and introduce, and then improve the quality of coorinated training.In the time utilizing sequence labelling model to predict having no example, introduce the named entity bilingual alignment mark concordance rate measurement index that degree of confidence estimates that serves as a mark, implicit expression is estimated mark degree of confidence, using not marking mark set that the mark of bilingual alignment in sample concordance rate the is the highest increment mark as the other end, break away from thus the dependence to small sample flag data, improve the generalization ability of algorithm, thereby improved the cross-cutting recognition capability of named entity.
For the bilingual collaborative identification mission of named entity is carried out smoothly, this method will adopt three steps, respectively: marking model initialization, bilingual coorinated training, bilingual named entity mark.As shown in Figure 1, specific implementation process is as follows:
Step 1, initialization sequence marking model, marked corpus Chinese-English sentence level alignment some and closed and train respectively initiation sequence marking model.Wherein, sequence labelling model can be selected condition random field (CRF), maximum entropy etc.
Step 2, as shown in Figure 2 extracts the sentence of some alignment from the un-annotated data set of Chinese-English sentence level alignment, utilizes sequence labelling model to mark respectively bilingual sentence, forms
Figure BDA0000419659250000021
calculate
Figure BDA0000419659250000022
bilingual mark concordance rate, the set of initialization mark language material increment is empty.
Described bilingual mark concordance rate refers on a small amount of bilingual un-annotated data, by the consistent ratio of mark of the alignment words after sequence labelling model mark.
The set of described mark language material increment refers in the time completing a coorinated training, adds the automatic marking language material of another model as mark language material to.
Concrete, at random from
Figure BDA0000419659250000023
the sentence of middle extraction 10% is right, forms
Figure BDA0000419659250000024
according to word alignment from
Figure BDA0000419659250000025
arrive mark projection.First the named entity projected area from source language to target language is expanded, made it to hold more target language named entity hypothesis.Then merge single language feature of target language named entity and the alignment feature of bilingual named entity, set up a log-linear model projection result is revised.Revised result, as mark language material increment, re-starts model training.Model after training is again right
Figure BDA0000419659250000027
mark, recalculate bilingual mark concordance rate, so circulate 10 times, finally when the highest bilingual mark concordance rate corresponding mark language material increment as the source language end mark language material increment of this coorinated training.Same method is found the increment mark language material of target language end.
Single language feature of described named entity refers to the boundary combinations feature of single language end named entity, is mainly used in ensureing that increment mark language material in coorinated training meets the feature of named entity.
The alignment feature of described bilingual named entity refers to the consistance of bilingual named entity, takes full advantage of bilingual identification complementarity.
Step 3, circulation execution step two, by experiment on exploitation collection, until algorithm convergence.After circulation finishes, finally produce two bilingual sequence labelling models, the bilingual Named Entity Extraction Model training.Then large-scale cross-cutting bilingual corpora is carried out to the identification of named entity, further build named entity dictionary; Also can directly carry out the identification of named entity to single sentence to be translated, improve the quality of mechanical translation.
Beneficial effect
The present invention, by introduced the thought of coorinated training in the training process of the sequence labelling model of named entity, utilizes the complementarity of bilingual named entity recognition and the intertranslation of named entity, carries out the coorinated training of model of cognition.This method contrast prior art, can realize the identification complementation of bilingual named entity, improves recognition correct rate and the recall rate of named entity in extensive cross-cutting language material; Effectively reduce named entity recognition the field of mark language material is relied on, make model there is stronger generalization ability; The present invention produces bilingual Named Entity Extraction Model simultaneously, and the introducing of coorinated training improves the bilingual identification consistance of named entity, contributes to the structure of further named entity dictionary.Comprehensively above-mentioned, the present invention is especially suitable for the consistent identification of bilingual named entity in extensive cross-cutting language material.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the inventive method;
Fig. 2 is the schematic flow sheet of coorinated training process in the inventive method.
Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in further details.
Based on a bilingual named entity recognition method for coorinated training, comprise the following steps:
Step 1, the bilingual sequence labelling model of initialization are trained respectively Chinese-English sequence labelling model: Cmodel (s) and Cmodel (t) on the language material of mark set Ls, the Lt of Chinese-English sentence level alignment.Mark has marked three kinds of named entities in language material altogether, is respectively PER(name), LOC(place name) and ORG(organizational structure name).Selected the set of BIO mark, all words have 7 kinds of mark: B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG and O.Word or the word combination feature of one word feature, single word feature, a 2-3 position selected in Chinese; Word, part of speech, initial letter capital and small letter Feature Combination masterplate selected in English.
The sentence that extracts 1000 alignment step 2, the un-annotated data set Us aliging from Chinese-English sentence level and Ut, utilizes respectively sequence labelling MODEL C model (s) and Cmodel (t) to mark, and forms with
Figure DEST_PATH_GDA0000490178560000042
calculate bilingual mark concordance rate conformity_ration
Figure DEST_PATH_GDA0000490178560000043
, initialization
Figure DEST_PATH_GDA0000490178560000044
the set of initialization mark language material increment is empty,
In bilingual named entity coorinated training process, once certain increment mark selects mistake to make mistakes, this mistake will will further be learnt and be strengthened, and cause the hydraulic performance decline of coorinated training algorithm.This just needs coorinated training algorithm to take effective measures to prevent noise data from introducing.Named entity possesses intertranslation, and the Chinese-English named entity of correct identification should have the consistance of mark.Therefore, using the mark concordance rate that aligns as the measurement index of selecting increment mark.The calculating of alignment mark concordance rate as formula (1) as shown in:
conformity _ ratio = 1 n Σ U 1 K Σ k = 1 K conformity ( ws i , wt j ) k - - - ( 1 )
Wherein, conformity ( ws i , wt j ) k = 1 T ( ws i ) = T ( wt j ) 0 T ( ws i ) ≠ T ( wt j ) , (ws i, wt j) krepresent the right the k(1≤k≤K of parallel sentence) to word pair; T (ws i), T (wt j) represent respectively the mark at the Chinese-English two ends of named entity; U represents un-annotated data collection; N represents the sentence number in U.Because Chinese and english has larger difference on word order, in the time calculating alignment mark concordance rate, the difference of ignore-tag " B " and " I ", thinks that they are identical marks.
At random from
Figure BDA0000419659250000048
100 sentences of middle extraction are to forming
Figure BDA0000419659250000049
according to word alignment from arrive
Figure BDA00004196592500000411
mark projection.Language difference between Chinese-English is larger, is only projected and is obtained target language named entity by mark, and result has part not fully up to expectations.By merging single language feature of target language named entity and the alignment feature of bilingual named entity, projection result is revised.First the named entity projected area from source language to target language is expanded, made it to hold more target language named entity hypothesis, bilingual named entity hypothesis of each named entity projection hypothesis and source language named entity composition.
Figure BDA00004196592500000412
in any one named entity be expressed as
Figure BDA00004196592500000413
by word project obtain target language end continuous and the centre word piece that comprises center of projection word as minimum candidate region
Figure BDA00004196592500000414
the projected area that comprises all projection words
Figure BDA00004196592500000415
two ends respectively to 4 words of external expansion (arrive beginning of the sentence or sentence tail may be less than 4 words) as Maximum alternative region.
At target language end, set up a sliding window, from minimum candidate region, constantly expand word to any side of sentence, until reach Maximum alternative zone boundary, thus expansion produces a series of target language end candidate named entity hypothesis.Each target language end named entity hypothesis with
Figure BDA0000419659250000051
combination, forms a bilingual named entity hypothesis, is expressed as
Figure BDA0000419659250000052
Then, construct a log-linear model, merge the syntax degree of confidence of target language named entity and the alignment degree of confidence of bilingual named entity, to all comprehensive marking of bilingual named entity hypothesis.Wherein named entity list statement method degree of confidence.Meet the syntactic feature of named entity in order to ensure the projection of target language end named entity, select the named entity syntax degree of confidence of border, left and right distribution probability as target language.Border distribution probability comprises left margin binary part of speech co-occurrence frequency and right margin binary part of speech co-occurrence frequency.Left margin binary part of speech co-occurrence frequency definition as formula (2) as shown in:
P ( ENTxl | ENT x ~ a b , S ) = max ( count ( t i , t i + 1 , lw ) count ( lw ) , count ( t i - 1 , t i , lw ) count ( lw ) ) - - - ( 2 )
The definition of right margin binary part of speech co-occurrence frequency as formula (3) as shown in:
P ( ENTxr | ENT x ~ a b , S ) = max ( count ( t i , t i + 1 , rw ) count ( rw ) , count ( t i - 1 , t i , rw ) count ( rw ) ) - - - ( 3 )
Wherein, the t in formula i, t i-1, t i+1represent respectively border word w ipart of speech, border word w iprevious word w i-1part of speech and border word w ia rear word w i+1part of speech; Count (*, *, *) represents named entity border word w in corpus ithe number of times that occurs of binary part of speech combination; Count (rw i) and count (lw i) represent respectively the number of times that border, left and right occurs in language material.Data smoothing is processed and is used Katz back-off, computing method as formula (4) as shown in:
P smooth ( t i | t i - n + 1 i - 1 ) = P smooth if C ( t i - n + 1 i - 1 ) > 0 γ ( t i - n + 1 i - 1 ) if C ( t i - n + 1 i - 1 ) = 0 - - - ( 4 )
Merge left and right boundary information, the calculating of single statement method degree of confidence of projection named entity as formula (5) as shown in:
P ( ENTx | ENT x ~ , S ) = P ( ENTxl | ENT x ~ a b , S ) P ( ENTxr | ENT x ~ a b , S ) - - - ( 5 )
Maximum entropy model can merge dissimilar feature, for the alignment degree of confidence of bilingual named entity
Figure BDA0000419659250000057
make fundamental function f m ( a k , ENTc a b , ENT e ~ c d , CS , ES ) , n = 1,2 , . . . , M , Utilize maximum entropy model to carry out modeling, as formula (6) as shown in.For each fundamental function f m, corresponding model parameter is λ m, m=1,2 ..., M.
P ( a k | ENT c a b , ENT e ~ c d , CS , ES ) = exp ( Σ m = 1 M λ m f m ( a k , ENT c a b , ENT e ~ c d , CS , ES ) ) Σ A exp ( Σ m = 1 M λ m f m ( a k , ENT c a b , ENT e ~ c d , CS , ES ) ) - - - ( 6 )
Adopt 3 features, bilingual named entity alignment degree of confidence is carried out to modeling, be respectively: bilingual named entity part of speech combination co-occurrence feature, bilingual named entity intertranslation feature and bilingual named entity length linked character.
Part of speech combination co-occurrence feature refers to Chinese-English part of speech corresponding in bilingual named entity and is combined in the co-occurrence frequency in whole corpus.Concrete calculate as formula (7) as shown in:
f m ( a k , ENT c a b , ENT e ~ c d , CS , ES ) = f m ( a k , t _ ENT c a b , t _ ENT e ~ c d , CS , ES ) = count ( t _ ENT c a b , t _ ENT e ‾ c d ) count ( * , * ) - - - ( 7 )
Wherein,
Figure DEST_PATH_GDA0000490178560000063
represent that named entity part of speech is combined in the number of times of co-occurrence in language material, count (*, *) represents the quantity of named entity in language material.
For the bilingual named entity of candidate, the mutual translation probability between source language named entity and target language end projection named entity is used respectively
Figure BDA0000419659250000064
with
Figure BDA0000419659250000065
represent, bilingual named entity intertranslation feature as formula (8) as shown in:
f m ( a k , ENT c a b , ENT e ~ c d , CS , ES ) = log ( P ( ENT c a b | ENT e ~ c d ) ) + log ( P ( ENT e ~ c d | ENT c a b ) ) - - - ( 8 )
For the bilingual named entity of optimum
Figure BDA0000419659250000067
,
Figure BDA0000419659250000068
difference in length approximate meet standardized normal distribution, definition length linked character as formula (9) as shown in:
f m ( a k , ENT c a b , ENT e ~ c d , CS , ES ) ≈ f m ( a k , | ENT c a b | , | ENT e ~ c d | ) = | ENT c a b | - δ | ENT e ~ c d | ( | ENT c a b | + 1 ) σ 2 - - - ( 9 )
Wherein, δ = 1 n Σ i = 1 n ( count ( ENTe i ) count ( ENTc i ) ) , σ 2 = 1 n Σ j = 1 n ( count ( ENTe j ) count ( ENTc j ) - 1 n Σ i = 1 n ( count ( ENTe i ) count ( ENTc i ) ) 2 ) . Wherein, count (*) represents the number of characters that * comprises, and English is alphabetical number, and Chinese is Chinese character number.
The bilingual named entity Assumption set of expansion
Figure BDA00004196592500000611
in each hypothesis
Figure BDA00004196592500000612
score value be expressed as formula form (10):
score ( h i ) = log ( P ( h i | ENTc , ENT e ~ , CS , ES ) ) + log ( P ( ENTe | ENT e ~ , S ) ) - - - ( 10 )
Finally, obtain the bilingual named entity Assumption set of sentence to optimum by a greed search.Source language is exactly that target language named entity of supposing with the optimum bilingual named entity of source language named entity composition in the optimum projection result of target language end.Utilize the bilingual named entity hypothesis of all expansions of formula (10) distich centering to give a mark, select the right bilingual named entity Assumption set of optimum of sentence by following greedy search procedure, thereby obtain optimum target language named entity projection:
First, the bilingual named entity Assumption set of this optimum of initialization is empty;
Then, calculate the score (h of all bilingual named entity hypothesis of sentence centering according to formula (10) i), and by descending sort;
Afterwards, choose successively bilingual named entity in one and the bilingual named entity Assumption set of current optimum and there is no the bilingual named entity hypothesis of the expansion h of border clash iput into optimum bilingual named entity Assumption set.Repeat this step, until can not find the bilingual named entity hypothesis of the expansion satisfying condition.
Then, right successively
Figure BDA0000419659250000072
in sentence project correction, form projection result
Figure BDA0000419659250000073
?
Figure DEST_PATH_GDA0000490178560000074
on training sequence marking model again,
Figure DEST_PATH_GDA0000490178560000075
utilize sequence labelling MODEL C model (t) right
Figure DEST_PATH_GDA0000490178560000076
again mark, calculate
Figure DEST_PATH_GDA0000490178560000077
if
Figure DEST_PATH_GDA0000490178560000078
?
Figure DEST_PATH_GDA00004901785600000710
Figure DEST_PATH_GDA00004901785600000711
training sequence marking model Cmodel (t) ← Cmodel (Lt) again on Lt.
Similar, at random from 100 sentences of middle extraction are to forming
Figure BDA00004196592500000713
according to word alignment from
Figure BDA00004196592500000714
arrive
Figure BDA00004196592500000715
mark projection, projection result merges
Figure BDA00004196592500000716
after revising, form
? on training sequence marking model again
Figure BDA00004196592500000719
Utilize sequence labelling MODEL C model (s) right
Figure DEST_PATH_GDA00004901785600000720
mark, recalculate
Figure DEST_PATH_GDA00004901785600000721
if
Figure DEST_PATH_GDA00004901785600000722
?
Figure DEST_PATH_GDA00004901785600000723
Figure DEST_PATH_GDA00004901785600000724
training sequence marking model again on Ls
Figure DEST_PATH_GDA00004901785600000725
Step 3, circulation execution step two, observe the test result of bilingual sequence labelling model on exploitation collection, until algorithm convergence, final production model Cmodel (s) and Cmodel (t).Utilize Cmodel (s) to carry out named entity recognition to source language language material, utilize Cmodel (s) to carry out named entity recognition to target language language material, and further compile named entity dictionary.

Claims (5)

1. the bilingual named entity recognition method based on coorinated training, is characterized in that comprising the following steps:
Step 1, initialization marking model; On 2000 bilingual corporas that marked named entity, train respectively the initial marking model of Chinese-English named entity;
Step 2, on the not mark named entity language material of Chinese-English sentence level alignment, utilize 10 times of cross selection increments marks, carry out bilingual coorinated training; Detailed process is as follows:
First,, from the sentence of randomly drawing 1000 alignment the set of named entity language material that do not mark of Chinese-English sentence level alignment, be expressed as
Figure DEST_PATH_FDA0000468996390000011
the marking model of utilizing step 1 to obtain, carries out respectively named entity mark to bilingual sentence; Calculate
Figure DEST_PATH_FDA00004689963900000119
bilingual mark concordance rate, the set of initialization mark language material increment is empty;
Then, at random from the sentence of middle extraction 10% is right, forms
Figure DEST_PATH_FDA0000468996390000013
according to word alignment from
Figure DEST_PATH_FDA0000468996390000014
to marking projection, and projection named entity tab area is expanded, made it to hold more target language named entity hypothesis, bilingual named entity hypothesis of each named entity projection hypothesis and source language named entity composition; Afterwards, merge single language feature of target language named entity and the alignment feature of bilingual named entity, projection result is revised, the mark language material increment using revised result as target language end ?
Figure DEST_PATH_FDA0000468996390000016
on re-start target language named entity marking model training, and again right by the marking model after training
Figure DEST_PATH_FDA0000468996390000017
in
Figure DEST_PATH_FDA0000468996390000018
mark, recalculate
Figure DEST_PATH_FDA0000468996390000019
bilingual mark concordance rate;
Said process is carried out in circulation, carries out 10 times of intersections, and mark language material increment corresponding when bilingual mark concordance rate is the highest in circulating marks language material increment as the target language end of this coorinated training
Figure DEST_PATH_FDA00004689963900000110
?
Figure DEST_PATH_FDA00004689963900000111
on re-start target language named entity marking model training;
Make to use the same method, find the increment mark language material of source language end
Figure DEST_PATH_FDA00004689963900000112
? on re-start source language named entity marking model training;
Step 3, circulation execution step two, by testing until algorithm convergence on exploitation collection; After circulation finishes, finally produce Chinese-English two named entity marking model, the bilingual Named Entity Extraction Model training; Finally, cross-cutting bilingual corpora is carried out to the identification of named entity, further build named entity dictionary.
2. a kind of bilingual named entity recognition method based on coorinated training as claimed in claim 1, is characterized in that, calculates
Figure DEST_PATH_FDA00004689963900000114
the method of bilingual mark concordance rate is as follows:
If
Figure DEST_PATH_FDA00004689963900000115
bilingual mark concordance rate is, conformity_ration
Figure DEST_PATH_FDA00004689963900000116
initialization
max←conformity_ration
Figure DEST_PATH_FDA00004689963900000117
The set of initialization mark language material increment is empty,
Figure DEST_PATH_FDA0000468996390000021
Wherein,
Figure DEST_PATH_FDA0000468996390000022
(ws i, wt j) krepresent the right the k(1≤k≤K of parallel sentence) to word pair; T (ws i), T (wt j) represent respectively the mark at the Chinese-English two ends of named entity; U represents un-annotated data collection; N represents the sentence number in U; In mark language material, mark altogether three kinds of named entities,---name, LOC---place name and ORG---the organizational structure's name that is respectively PER; According to BIO mark collection mark, all characters have 7 kinds of mark: B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG and O;
While calculating alignment mark concordance rate, the difference of ignore-tag " B " and " I ", thinks that they are identical marks.
3. a kind of bilingual named entity recognition method based on coorinated training as claimed in claim 1, is characterized in that in described step 2, and the method that projection named entity tab area is expanded is as follows:
First the named entity projected area from source language to target language is expanded, made it to hold more target language named entity hypothesis, bilingual named entity hypothesis of each named entity projection hypothesis and source language named entity composition; in any one named entity be expressed as by word project obtain target language end continuous and the centre word piece that comprises center of projection word as minimum candidate region the projected area that comprises all projection words
Figure DEST_PATH_FDA0000468996390000026
two ends respectively to 4 words of external expansion as Maximum alternative region;
At target language end, set up a sliding window, from minimum candidate region, constantly expand word to any side of sentence, until reach Maximum alternative zone boundary, thus expansion produces a series of target language end candidate named entity hypothesis; Each target language end named entity hypothesis with
Figure DEST_PATH_FDA0000468996390000027
combination, forms a bilingual named entity hypothesis, is expressed as
Figure DEST_PATH_FDA0000468996390000028
4. a kind of bilingual named entity recognition method based on coorinated training as claimed in claim 1, it is characterized in that in described step 2, merge single language feature of target language named entity and the alignment feature of bilingual named entity, and the method that projection result is revised is as follows:
By constructing a log-linear model, merge the syntax degree of confidence of target language named entity and the alignment degree of confidence of bilingual named entity, to all comprehensive marking of bilingual named entity hypothesis;
For guaranteeing that the projection of target language end named entity meets the syntactic feature of named entity, selects the named entity syntax degree of confidence of border, left and right distribution probability as target language; Border distribution probability comprises left margin binary part of speech co-occurrence frequency and right margin binary part of speech co-occurrence frequency; Left margin binary part of speech co-occurrence frequency definition as formula (2) as shown in:
The definition of right margin binary part of speech co-occurrence frequency as formula (3) as shown in:
Figure DEST_PATH_FDA0000468996390000032
Wherein, the t in formula i, t i-1, t i+1represent respectively border word w ipart of speech, border word w iprevious word w i-1part of speech and border word w ia rear word w i+1part of speech; Count (*, *, *) represents named entity border word w in corpus ithe number of times that occurs of binary part of speech combination; Count (rw i) and count (lw i) represent respectively the number of times that border, left and right occurs in language material;
Merge left and right boundary information, the calculating of single statement method degree of confidence of projection named entity as formula (4) as shown in:
Figure DEST_PATH_FDA0000468996390000033
Maximum entropy model can merge dissimilar feature, for the alignment degree of confidence of bilingual named entity make fundamental function
Figure DEST_PATH_FDA0000468996390000034
utilize maximum entropy model to carry out modeling, as formula (5) as shown in; For each fundamental function f m, corresponding model parameter is λ m, m=1,2 ..., M;
Figure DEST_PATH_FDA0000468996390000035
adopt 3 features, bilingual named entity alignment degree of confidence is carried out to modeling, be respectively: bilingual named entity part of speech combination co-occurrence feature, bilingual named entity intertranslation feature and bilingual named entity length linked character; Part of speech combination co-occurrence feature refers to Chinese-English part of speech corresponding in bilingual named entity and is combined in the co-occurrence frequency in whole corpus; Calculate as formula (6) as shown in:
Figure DEST_PATH_FDA0000468996390000041
Wherein, represent that named entity part of speech is combined in the number of times of co-occurrence in language material, count (*, *) represents the quantity of named entity in language material;
For the bilingual named entity of candidate, the mutual translation probability between source language named entity and target language end projection named entity is used respectively with
Figure DEST_PATH_FDA0000468996390000044
represent, bilingual named entity intertranslation feature as formula (7) as shown in:
Figure DEST_PATH_FDA0000468996390000045
For the bilingual named entity of optimum
Figure DEST_PATH_FDA0000468996390000046
,
Figure DEST_PATH_FDA0000468996390000047
difference in length approximate meet standardized normal distribution, definition length linked character as formula (8) as shown in:
Figure DEST_PATH_FDA0000468996390000048
Wherein,
Figure DEST_PATH_FDA0000468996390000049
wherein, count (*) represents the number of characters that * comprises, and English is alphabetical number, and Chinese is Chinese character number;
The bilingual named entity Assumption set of expansion
Figure DEST_PATH_FDA00004689963900000410
in each hypothesis score value be expressed as formula form (9):
Figure DEST_PATH_FDA00004689963900000412
Finally, obtain the bilingual named entity Assumption set of sentence to optimum by a greed search, thereby obtain optimum target language named entity projection; Source language is exactly the target language named entity of supposing with the optimum bilingual named entity of source language named entity composition in the optimum projection result of target language end.
5. a kind of bilingual named entity recognition method based on coorinated training as claimed in claim 4, is characterized in that, described greedy search procedure is:
First, the bilingual named entity Assumption set of this optimum of initialization is empty;
Then, according to
Figure DEST_PATH_FDA0000468996390000051
calculate the score (h of all bilingual named entity hypothesis of sentence centering i) and by descending sort;
Afterwards, choose successively bilingual named entity in one and the bilingual named entity Assumption set of current optimum and there is no the bilingual named entity hypothesis of the expansion h of border clash iput into optimum bilingual named entity Assumption set; Repeat this step, until can not find the bilingual named entity hypothesis of the expansion satisfying condition.
Finally, right successively
Figure DEST_PATH_FDA0000468996390000052
in sentence project correction, form projection result
Figure DEST_PATH_FDA0000468996390000053
CN201310593746.3A 2013-11-21 2013-11-21 A kind of bilingual name entity recognition method based on coorinated training Active CN103853710B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310593746.3A CN103853710B (en) 2013-11-21 2013-11-21 A kind of bilingual name entity recognition method based on coorinated training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310593746.3A CN103853710B (en) 2013-11-21 2013-11-21 A kind of bilingual name entity recognition method based on coorinated training

Publications (2)

Publication Number Publication Date
CN103853710A true CN103853710A (en) 2014-06-11
CN103853710B CN103853710B (en) 2016-06-08

Family

ID=50861378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310593746.3A Active CN103853710B (en) 2013-11-21 2013-11-21 A kind of bilingual name entity recognition method based on coorinated training

Country Status (1)

Country Link
CN (1) CN103853710B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method
CN104298714A (en) * 2014-09-16 2015-01-21 北京航空航天大学 Automatic massive-text labeling method based on exception handling
CN104965821A (en) * 2015-07-17 2015-10-07 苏州大学张家港工业技术研究院 Data annotation method and apparatus
CN106339371A (en) * 2016-08-30 2017-01-18 齐鲁工业大学 English and Chinese word meaning mapping method and device based on word vectors
CN106649289A (en) * 2016-12-16 2017-05-10 中国科学院自动化研究所 Realization method and realization system for simultaneously identifying bilingual terms and word alignment
CN107357786A (en) * 2017-07-13 2017-11-17 山西大学 A kind of Bayes's Word sense disambiguation method based on a large amount of pseudo- data
CN107797988A (en) * 2017-10-12 2018-03-13 北京知道未来信息技术有限公司 A kind of mixing language material name entity recognition method based on Bi LSTM
CN107798386A (en) * 2016-09-01 2018-03-13 微软技术许可有限责任公司 More process synergics training based on unlabeled data
CN107797987A (en) * 2017-10-12 2018-03-13 北京知道未来信息技术有限公司 A kind of mixing language material name entity recognition method based on Bi LSTM CNN
CN107977353A (en) * 2017-10-12 2018-05-01 北京知道未来信息技术有限公司 A kind of mixing language material name entity recognition method based on LSTM-CNN
CN107992468A (en) * 2017-10-12 2018-05-04 北京知道未来信息技术有限公司 A kind of mixing language material name entity recognition method based on LSTM
WO2018153130A1 (en) * 2017-02-22 2018-08-30 华为技术有限公司 Translation method and apparatus
CN108959255A (en) * 2018-06-28 2018-12-07 北京百度网讯科技有限公司 Entity labeled data collection construction method, device and equipment
CN110765276A (en) * 2019-10-21 2020-02-07 北京明略软件系统有限公司 Entity alignment method and device in knowledge graph
CN111062215A (en) * 2019-12-10 2020-04-24 金蝶软件(中国)有限公司 Named entity recognition method and device based on semi-supervised learning training
CN111143571A (en) * 2018-11-06 2020-05-12 马上消费金融股份有限公司 Entity labeling model training method, entity labeling method and device
CN111209754A (en) * 2020-02-25 2020-05-29 桂林电子科技大学 Data set construction method for Vietnamese entity recognition
CN111274829A (en) * 2020-02-07 2020-06-12 中国科学技术大学 Sequence labeling method using cross-language information
CN111461330A (en) * 2020-04-03 2020-07-28 中国建设银行股份有限公司 Multi-language knowledge base construction method and system based on multi-language resume
CN111723587A (en) * 2020-06-23 2020-09-29 桂林电子科技大学 Chinese-Thai entity alignment method oriented to cross-language knowledge graph
CN111738024A (en) * 2020-07-29 2020-10-02 腾讯科技(深圳)有限公司 Entity noun tagging method and device, computing device and readable storage medium
CN113221539A (en) * 2021-07-08 2021-08-06 华东交通大学 Method and system for identifying nested named entities integrated with syntactic information
CN114610852A (en) * 2022-05-10 2022-06-10 天津大学 Course learning-based fine-grained Chinese syntax analysis method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7171350B2 (en) * 2002-05-03 2007-01-30 Industrial Technology Research Institute Method for named-entity recognition and verification
CN101295292A (en) * 2007-04-23 2008-10-29 北大方正集团有限公司 Method and device for modeling and naming entity recognition based on maximum entropy model
CN101763344A (en) * 2008-12-25 2010-06-30 株式会社东芝 Method for training translation model based on phrase, mechanical translation method and device thereof
CN102682763A (en) * 2011-03-10 2012-09-19 北京三星通信技术研究有限公司 Method, device and terminal for correcting named entity vocabularies in voice input text
CN103268339A (en) * 2013-05-17 2013-08-28 中国科学院计算技术研究所 Recognition method and system of named entities in microblog messages

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7171350B2 (en) * 2002-05-03 2007-01-30 Industrial Technology Research Institute Method for named-entity recognition and verification
CN101295292A (en) * 2007-04-23 2008-10-29 北大方正集团有限公司 Method and device for modeling and naming entity recognition based on maximum entropy model
CN101763344A (en) * 2008-12-25 2010-06-30 株式会社东芝 Method for training translation model based on phrase, mechanical translation method and device thereof
CN102682763A (en) * 2011-03-10 2012-09-19 北京三星通信技术研究有限公司 Method, device and terminal for correcting named entity vocabularies in voice input text
CN103268339A (en) * 2013-05-17 2013-08-28 中国科学院计算技术研究所 Recognition method and system of named entities in microblog messages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李波: "基于自主推理的中文命名实体识别方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑 》, no. 1, 15 January 2013 (2013-01-15) *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method
CN104298714A (en) * 2014-09-16 2015-01-21 北京航空航天大学 Automatic massive-text labeling method based on exception handling
CN104298714B (en) * 2014-09-16 2017-12-08 北京航空航天大学 A kind of mass text automatic marking method based on abnormality processing
CN104965821A (en) * 2015-07-17 2015-10-07 苏州大学张家港工业技术研究院 Data annotation method and apparatus
CN104965821B (en) * 2015-07-17 2018-01-05 苏州大学 A kind of data mask method and device
CN106339371A (en) * 2016-08-30 2017-01-18 齐鲁工业大学 English and Chinese word meaning mapping method and device based on word vectors
CN106339371B (en) * 2016-08-30 2019-04-30 齐鲁工业大学 A kind of English-Chinese meaning of a word mapping method and device based on term vector
CN107798386A (en) * 2016-09-01 2018-03-13 微软技术许可有限责任公司 More process synergics training based on unlabeled data
CN106649289A (en) * 2016-12-16 2017-05-10 中国科学院自动化研究所 Realization method and realization system for simultaneously identifying bilingual terms and word alignment
WO2018153130A1 (en) * 2017-02-22 2018-08-30 华为技术有限公司 Translation method and apparatus
US11244108B2 (en) 2017-02-22 2022-02-08 Huawei Technologies Co., Ltd. Translation method and apparatus
CN107357786A (en) * 2017-07-13 2017-11-17 山西大学 A kind of Bayes's Word sense disambiguation method based on a large amount of pseudo- data
CN107797988A (en) * 2017-10-12 2018-03-13 北京知道未来信息技术有限公司 A kind of mixing language material name entity recognition method based on Bi LSTM
CN107992468A (en) * 2017-10-12 2018-05-04 北京知道未来信息技术有限公司 A kind of mixing language material name entity recognition method based on LSTM
CN107977353A (en) * 2017-10-12 2018-05-01 北京知道未来信息技术有限公司 A kind of mixing language material name entity recognition method based on LSTM-CNN
CN107797987A (en) * 2017-10-12 2018-03-13 北京知道未来信息技术有限公司 A kind of mixing language material name entity recognition method based on Bi LSTM CNN
CN107797987B (en) * 2017-10-12 2021-02-09 北京知道未来信息技术有限公司 Bi-LSTM-CNN-based mixed corpus named entity identification method
CN108959255A (en) * 2018-06-28 2018-12-07 北京百度网讯科技有限公司 Entity labeled data collection construction method, device and equipment
CN111143571B (en) * 2018-11-06 2020-12-25 马上消费金融股份有限公司 Entity labeling model training method, entity labeling method and device
CN111143571A (en) * 2018-11-06 2020-05-12 马上消费金融股份有限公司 Entity labeling model training method, entity labeling method and device
CN110765276A (en) * 2019-10-21 2020-02-07 北京明略软件系统有限公司 Entity alignment method and device in knowledge graph
CN111062215B (en) * 2019-12-10 2024-02-13 金蝶软件(中国)有限公司 Named entity recognition method and device based on semi-supervised learning training
CN111062215A (en) * 2019-12-10 2020-04-24 金蝶软件(中国)有限公司 Named entity recognition method and device based on semi-supervised learning training
CN111274829A (en) * 2020-02-07 2020-06-12 中国科学技术大学 Sequence labeling method using cross-language information
CN111274829B (en) * 2020-02-07 2023-06-16 中国科学技术大学 Sequence labeling method utilizing cross-language information
CN111209754B (en) * 2020-02-25 2023-06-02 桂林电子科技大学 Data set construction method for Vietnam entity recognition
CN111209754A (en) * 2020-02-25 2020-05-29 桂林电子科技大学 Data set construction method for Vietnamese entity recognition
CN111461330A (en) * 2020-04-03 2020-07-28 中国建设银行股份有限公司 Multi-language knowledge base construction method and system based on multi-language resume
CN111461330B (en) * 2020-04-03 2023-09-15 中国建设银行股份有限公司 Multilingual knowledge base construction method and system based on multilingual resume
CN111723587A (en) * 2020-06-23 2020-09-29 桂林电子科技大学 Chinese-Thai entity alignment method oriented to cross-language knowledge graph
CN111738024B (en) * 2020-07-29 2023-10-27 腾讯科技(深圳)有限公司 Entity noun labeling method and device, computing device and readable storage medium
CN111738024A (en) * 2020-07-29 2020-10-02 腾讯科技(深圳)有限公司 Entity noun tagging method and device, computing device and readable storage medium
CN113221539A (en) * 2021-07-08 2021-08-06 华东交通大学 Method and system for identifying nested named entities integrated with syntactic information
CN114610852A (en) * 2022-05-10 2022-06-10 天津大学 Course learning-based fine-grained Chinese syntax analysis method and device

Also Published As

Publication number Publication date
CN103853710B (en) 2016-06-08

Similar Documents

Publication Publication Date Title
CN103853710A (en) Coordinated training-based dual-language named entity identification method
CN103154936B (en) For the method and system of robotization text correction
Hu et al. A state-transition framework to answer complex questions over knowledge base
CN107766324B (en) Text consistency analysis method based on deep neural network
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN105068997B (en) The construction method and device of parallel corpora
CN103942192B (en) The interpretation method that a kind of bilingual maximum noun chunk separates-merges
CN106383818A (en) Machine translation method and device
CN101866337A (en) Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
CN111476031A (en) Improved Chinese named entity recognition method based on L attice-L STM
CN104915337A (en) Translation text integrity evaluation method based on bilingual text structure information
CN110427619B (en) Chinese text automatic proofreading method based on multi-channel fusion and reordering
CN113312922B (en) Improved chapter-level triple information extraction method
CN106610937A (en) Information theory-based Chinese automatic word segmentation method
Bilgin et al. Sentiment analysis with term weighting and word vectors
CN102270196A (en) Machine translation method
CN104317882A (en) Decision-based Chinese word segmentation and fusion method
Huber et al. Predicting above-sentence discourse structure using distant supervision from topic segmentation
Qi et al. Translation-based matching adversarial network for cross-lingual natural language inference
Zhao Research and design of automatic scoring algorithm for english composition based on machine learning
CN102945231B (en) Construction method and system of incremental-translation-oriented structured language model
Tran et al. Preordering for Chinese-Vietnamese statistical machine translation
CN113190690A (en) Unsupervised knowledge graph inference processing method, unsupervised knowledge graph inference processing device, unsupervised knowledge graph inference processing equipment and unsupervised knowledge graph inference processing medium
Finch et al. A bayesian model of transliteration and its human evaluation when integrated into a machine translation system
Su et al. Alignment-consistent recursive neural networks for bilingual phrase embeddings

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant