CN103853710A

CN103853710A - Coordinated training-based dual-language named entity identification method

Info

Publication number: CN103853710A
Application number: CN201310593746.3A
Authority: CN
Inventors: 黄河燕; 史树敏; 李业刚
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2013-11-21
Filing date: 2013-11-21
Publication date: 2014-06-11
Anticipated expiration: 2033-11-21
Also published as: CN103853710B

Abstract

The invention discloses a dual-language coordinated training-based named entity identification method, and belongs to the technical field of natural language processing in computer science. Parallel Chinese and English sentence datasets are considered as two different view of a dataset for dual-language coordinated training, a log-linear model is used for correcting projection marks in a projection process, and named entity dual-language aligned annotation consistency is introduced as a measurement index for mark confidence estimation when the model is used for predicting an unseen case. Compared with the prior art, the method has the advantages that the domain dependence of named entity identification is reduced, the advantages of dual-language identification are fused, the problem of partial identification ambiguity in single-language identification is solved, and the method is particularly suitable for the dual-language named entity synchronous identification of large-scale language materials.

Description

A kind of bilingual named entity recognition method based on coorinated training

Technical field

The present invention relates to a kind of recognition methods of bilingual named entity, be particularly useful for as processing the early stage of mechanical translation, extensive cross-cutting bilingual corpora is carried out to the identification of named entity, belong to natural language processing (NLP) technical field in computer science.

Background technology

Named entity is the proprietary name of unique individuality.Named entity recognition is an important foundation technical barrier in natural language processing field, has become one of the technical bottleneck in the multi-language information processing such as cross-language information retrieval and mechanical translation field.

At present, researchist has developed a lot of models for named entity recognition.Wherein, because rule-based method is unfavorable for promoting between variety classes language, in the last few years, the method based on statistics had been subject to extensive concern.In statistical method, supervised learning method has good performance in named entity recognition task, but it has two weak points: one, and the method needs a large amount of labeled data to ensure the accuracy of study, is therefore unsuitable for the language that those resources are relatively poor; Its two, in the time that existing labeled data and data to be determined do not belong to same field, the performance of supervised learning method can obviously decline.Unsupervised method performance is unsatisfactory.Improving these not enough methods is exactly in conjunction with a small amount of mark language material and a large amount of un-annotated datas, adopts the coorinated training method based on semi-supervised learning.

Summary of the invention

The object of the invention is, in order to overcome prior art solving the deficiency in bilingual named entity recognition in extensive cross-cutting language material, to propose a kind of bilingual named entity recognition method based on coorinated training.

The technical solution adopted in the present invention is: by these two data sets of parallel Chinese-English bilingual sentence, two different views regarding a data set as carry out bilingual coorinated training.At Chinese-English two ends, on a small amount of labeled data, carry out respectively initial marking model training, produce two initiation sequence marking model.The initiation sequence marking model that utilization trains is carried out named entity mark to cross-cutting fraction un-annotated data, then annotation results is projected to another corresponding language end.In projection process, use a log-linear model, merging single language syntactic feature and bilingual alignment feature revises projection mark, thereby reduce the possibility of mark example mistake mark, reduce the noise of another one sequence labelling model and introduce, and then improve the quality of coorinated training.In the time utilizing sequence labelling model to predict having no example, introduce the named entity bilingual alignment mark concordance rate measurement index that degree of confidence estimates that serves as a mark, implicit expression is estimated mark degree of confidence, using not marking mark set that the mark of bilingual alignment in sample concordance rate the is the highest increment mark as the other end, break away from thus the dependence to small sample flag data, improve the generalization ability of algorithm, thereby improved the cross-cutting recognition capability of named entity.

For the bilingual collaborative identification mission of named entity is carried out smoothly, this method will adopt three steps, respectively: marking model initialization, bilingual coorinated training, bilingual named entity mark.As shown in Figure 1, specific implementation process is as follows:

Step 1, initialization sequence marking model, marked corpus Chinese-English sentence level alignment some and closed and train respectively initiation sequence marking model.Wherein, sequence labelling model can be selected condition random field (CRF), maximum entropy etc.

Step 2, as shown in Figure 2 extracts the sentence of some alignment from the un-annotated data set of Chinese-English sentence level alignment, utilizes sequence labelling model to mark respectively bilingual sentence, forms

calculate

bilingual mark concordance rate, the set of initialization mark language material increment is empty.

Described bilingual mark concordance rate refers on a small amount of bilingual un-annotated data, by the consistent ratio of mark of the alignment words after sequence labelling model mark.

The set of described mark language material increment refers in the time completing a coorinated training, adds the automatic marking language material of another model as mark language material to.

Concrete, at random from

the sentence of middle extraction 10% is right, forms

according to word alignment from

arrive mark projection.First the named entity projected area from source language to target language is expanded, made it to hold more target language named entity hypothesis.Then merge single language feature of target language named entity and the alignment feature of bilingual named entity, set up a log-linear model projection result is revised.Revised result, as mark language material increment, re-starts model training.Model after training is again right

mark, recalculate bilingual mark concordance rate, so circulate 10 times, finally when the highest bilingual mark concordance rate corresponding mark language material increment as the source language end mark language material increment of this coorinated training.Same method is found the increment mark language material of target language end.

Single language feature of described named entity refers to the boundary combinations feature of single language end named entity, is mainly used in ensureing that increment mark language material in coorinated training meets the feature of named entity.

The alignment feature of described bilingual named entity refers to the consistance of bilingual named entity, takes full advantage of bilingual identification complementarity.

Step 3, circulation execution step two, by experiment on exploitation collection, until algorithm convergence.After circulation finishes, finally produce two bilingual sequence labelling models, the bilingual Named Entity Extraction Model training.Then large-scale cross-cutting bilingual corpora is carried out to the identification of named entity, further build named entity dictionary; Also can directly carry out the identification of named entity to single sentence to be translated, improve the quality of mechanical translation.

Beneficial effect

The present invention, by introduced the thought of coorinated training in the training process of the sequence labelling model of named entity, utilizes the complementarity of bilingual named entity recognition and the intertranslation of named entity, carries out the coorinated training of model of cognition.This method contrast prior art, can realize the identification complementation of bilingual named entity, improves recognition correct rate and the recall rate of named entity in extensive cross-cutting language material; Effectively reduce named entity recognition the field of mark language material is relied on, make model there is stronger generalization ability; The present invention produces bilingual Named Entity Extraction Model simultaneously, and the introducing of coorinated training improves the bilingual identification consistance of named entity, contributes to the structure of further named entity dictionary.Comprehensively above-mentioned, the present invention is especially suitable for the consistent identification of bilingual named entity in extensive cross-cutting language material.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of the inventive method;

Fig. 2 is the schematic flow sheet of coorinated training process in the inventive method.

Embodiment

Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in further details.

Based on a bilingual named entity recognition method for coorinated training, comprise the following steps:

Step 1, the bilingual sequence labelling model of initialization are trained respectively Chinese-English sequence labelling model: Cmodel (s) and Cmodel (t) on the language material of mark set Ls, the Lt of Chinese-English sentence level alignment.Mark has marked three kinds of named entities in language material altogether, is respectively PER(name), LOC(place name) and ORG(organizational structure name).Selected the set of BIO mark, all words have 7 kinds of mark: B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG and O.Word or the word combination feature of one word feature, single word feature, a 2-3 position selected in Chinese; Word, part of speech, initial letter capital and small letter Feature Combination masterplate selected in English.

The sentence that extracts 1000 alignment step 2, the un-annotated data set Us aliging from Chinese-English sentence level and Ut, utilizes respectively sequence labelling MODEL C model (s) and Cmodel (t) to mark, and forms with

calculate bilingual mark concordance rate conformity_ration

, initialization

the set of initialization mark language material increment is empty,

In bilingual named entity coorinated training process, once certain increment mark selects mistake to make mistakes, this mistake will will further be learnt and be strengthened, and cause the hydraulic performance decline of coorinated training algorithm.This just needs coorinated training algorithm to take effective measures to prevent noise data from introducing.Named entity possesses intertranslation, and the Chinese-English named entity of correct identification should have the consistance of mark.Therefore, using the mark concordance rate that aligns as the measurement index of selecting increment mark.The calculating of alignment mark concordance rate as formula (1) as shown in:

conformity_ratio = \frac{1}{n} \underset{U}{Σ} \frac{1}{K} Σ_{k = 1}^{K} conformity {({ws}_{i}, {wt}_{j})}_{k} - - - (1)

Wherein,

{conformity ({ws}_{i}, {wt}_{j})}_{k} = \{\begin{matrix} 1 & T ({ws}_{i}) = T ({wt}_{j}) \\ 0 & T ({ws}_{i}) &NotEqual; T ({wt}_{j}) \end{matrix}

, (ws _i, wt _j) _krepresent the right the k(1≤k≤K of parallel sentence) to word pair; T (ws _i), T (wt _j) represent respectively the mark at the Chinese-English two ends of named entity; U represents un-annotated data collection; N represents the sentence number in U.Because Chinese and english has larger difference on word order, in the time calculating alignment mark concordance rate, the difference of ignore-tag " B " and " I ", thinks that they are identical marks.

At random from

100 sentences of middle extraction are to forming

according to word alignment from arrive

mark projection.Language difference between Chinese-English is larger, is only projected and is obtained target language named entity by mark, and result has part not fully up to expectations.By merging single language feature of target language named entity and the alignment feature of bilingual named entity, projection result is revised.First the named entity projected area from source language to target language is expanded, made it to hold more target language named entity hypothesis, bilingual named entity hypothesis of each named entity projection hypothesis and source language named entity composition.

in any one named entity be expressed as

by word project obtain target language end continuous and the centre word piece that comprises center of projection word as minimum candidate region

the projected area that comprises all projection words

two ends respectively to 4 words of external expansion (arrive beginning of the sentence or sentence tail may be less than 4 words) as Maximum alternative region.

At target language end, set up a sliding window, from minimum candidate region, constantly expand word to any side of sentence, until reach Maximum alternative zone boundary, thus expansion produces a series of target language end candidate named entity hypothesis.Each target language end named entity hypothesis with

combination, forms a bilingual named entity hypothesis, is expressed as

Then, construct a log-linear model, merge the syntax degree of confidence of target language named entity and the alignment degree of confidence of bilingual named entity, to all comprehensive marking of bilingual named entity hypothesis.Wherein named entity list statement method degree of confidence.Meet the syntactic feature of named entity in order to ensure the projection of target language end named entity, select the named entity syntax degree of confidence of border, left and right distribution probability as target language.Border distribution probability comprises left margin binary part of speech co-occurrence frequency and right margin binary part of speech co-occurrence frequency.Left margin binary part of speech co-occurrence frequency definition as formula (2) as shown in:

P (ENTxl | {ENT \tilde{x}}_{a}^{b}, S) = \max (\frac{count (t_{i}, t_{i + 1}, lw)}{count (lw)}, \frac{count (t_{i - 1}, t_{i}, lw)}{count (lw)}) - - - (2)

The definition of right margin binary part of speech co-occurrence frequency as formula (3) as shown in:

P (ENTxr | ENT {\tilde{x}}_{a}^{b}, S) = \max (\frac{count (t_{i}, t_{i + 1}, rw)}{count (rw)}, \frac{count (t_{i - 1}, t_{i}, rw)}{count (rw)}) - - - (3)

Wherein, the t in formula _i, t _i-1, t _i+1represent respectively border word w _ipart of speech, border word w _iprevious word w _i-1part of speech and border word w _ia rear word w _i+1part of speech; Count (*, *, *) represents named entity border word w in corpus _ithe number of times that occurs of binary part of speech combination; Count (rw _i) and count (lw _i) represent respectively the number of times that border, left and right occurs in language material.Data smoothing is processed and is used Katz back-off, computing method as formula (4) as shown in:

P_{smooth} (t_{i} | t_{i - n + 1}^{i - 1}) = \{\begin{matrix} P_{smooth} & if & C (t_{i - n + 1}^{i - 1}) > 0 \\ γ (t_{i - n + 1}^{i - 1}) & if & C (t_{i - n + 1}^{i - 1}) = 0 \end{matrix} - - - (4)

Merge left and right boundary information, the calculating of single statement method degree of confidence of projection named entity as formula (5) as shown in:

P (ENTx | ENT \tilde{x}, S) = P (ENTxl | ENT {\tilde{x}}_{a}^{b}, S) P (ENTxr | ENT {\tilde{x}}_{a}^{b}, S) - - - (5)

Maximum entropy model can merge dissimilar feature, for the alignment degree of confidence of bilingual named entity

make fundamental function

f_{m} (a_{k}, {ENTc}_{a}^{b}, ENT {\tilde{e}}_{c}^{d}, CS, ES), n = 1,2, . . ., M,

Utilize maximum entropy model to carry out modeling, as formula (6) as shown in.For each fundamental function f _m, corresponding model parameter is λ _m, m=1,2 ..., M.

P (a_{k} | ENT c_{a}^{b}, ENT {\tilde{e}}_{c}^{d}, CS, ES) = \frac{\exp (Σ_{m = 1}^{M} λ_{m} f_{m} (a_{k}, ENT c_{a}^{b}, ENT {\tilde{e}}_{c}^{d}, CS, ES))}{\underset{A}{Σ} \exp (Σ_{m = 1}^{M} λ_{m} f_{m} (a_{k}, ENT c_{a}^{b}, ENT {\tilde{e}}_{c}^{d}, CS, ES))} - - - (6)

Adopt 3 features, bilingual named entity alignment degree of confidence is carried out to modeling, be respectively: bilingual named entity part of speech combination co-occurrence feature, bilingual named entity intertranslation feature and bilingual named entity length linked character.

Part of speech combination co-occurrence feature refers to Chinese-English part of speech corresponding in bilingual named entity and is combined in the co-occurrence frequency in whole corpus.Concrete calculate as formula (7) as shown in:

f_{m} (a_{k}, ENT c_{a}^{b}, ENT {\tilde{e}}_{c}^{d}, CS, ES) = f_{m} (a_{k}, t_ENT c_{a}^{b}, t_ENT {\tilde{e}}_{c}^{d}, CS, ES) = \frac{count (t_ENT c_{a}^{b}, t_ENT {\overset{&OverBar;}{e}}_{c}^{d})}{count (*, *)} - - - (7)

Wherein,

represent that named entity part of speech is combined in the number of times of co-occurrence in language material, count (*, *) represents the quantity of named entity in language material.

For the bilingual named entity of candidate, the mutual translation probability between source language named entity and target language end projection named entity is used respectively

with

represent, bilingual named entity intertranslation feature as formula (8) as shown in:

f_{m} (a_{k}, ENT c_{a}^{b}, ENT {\tilde{e}}_{c}^{d}, CS, ES) = \log (P (ENT c_{a}^{b} | ENT {\tilde{e}}_{c}^{d})) + \log (P (ENT {\tilde{e}}_{c}^{d} | ENT c_{a}^{b})) - - - (8)

For the bilingual named entity of optimum

,

difference in length approximate meet standardized normal distribution, definition length linked character as formula (9) as shown in:

f_{m} (a_{k}, ENT c_{a}^{b}, ENT {\tilde{e}}_{c}^{d}, CS, ES) \approx f_{m} (a_{k}, | ENT c_{a}^{b} |, | ENT {\tilde{e}}_{c}^{d} |) = \frac{| ENT c_{a}^{b} | - δ | ENT {\tilde{e}}_{c}^{d} |}{\sqrt{{(| ENT c_{a}^{b} | + 1)}^{σ 2}}} - - - (9)

Wherein,

δ = \frac{1}{n} Σ_{i = 1}^{n} (\frac{count ({ENTe}_{i})}{count ({ENTc}_{i})}), σ^{2} = \frac{1}{n} Σ_{j = 1}^{n} (\frac{count ({ENTe}_{j})}{count ({ENTc}_{j})} - \frac{1}{n} Σ_{i = 1}^{n} {(\frac{count ({ENTe}_{i})}{count ({ENTc}_{i})})}^{2}) .

Wherein, count (*) represents the number of characters that * comprises, and English is alphabetical number, and Chinese is Chinese character number.

The bilingual named entity Assumption set of expansion

in each hypothesis

score value be expressed as formula form (10):

score (h_{i}) = \log (P (h_{i} | ENTc, ENT \tilde{e}, CS, ES)) + \log (P (ENTe | ENT \tilde{e}, S)) - - - (10)

Finally, obtain the bilingual named entity Assumption set of sentence to optimum by a greed search.Source language is exactly that target language named entity of supposing with the optimum bilingual named entity of source language named entity composition in the optimum projection result of target language end.Utilize the bilingual named entity hypothesis of all expansions of formula (10) distich centering to give a mark, select the right bilingual named entity Assumption set of optimum of sentence by following greedy search procedure, thereby obtain optimum target language named entity projection:

First, the bilingual named entity Assumption set of this optimum of initialization is empty;

Then, calculate the score (h of all bilingual named entity hypothesis of sentence centering according to formula (10) _i), and by descending sort;

Afterwards, choose successively bilingual named entity in one and the bilingual named entity Assumption set of current optimum and there is no the bilingual named entity hypothesis of the expansion h of border clash _iput into optimum bilingual named entity Assumption set.Repeat this step, until can not find the bilingual named entity hypothesis of the expansion satisfying condition.

Then, right successively

in sentence project correction, form projection result

?

on training sequence marking model again,

utilize sequence labelling MODEL C model (t) right

again mark, calculate

if

?

Figure DEST_PATH_GDA00004901785600000710

Figure DEST_PATH_GDA00004901785600000711

training sequence marking model Cmodel (t) ← Cmodel (Lt) again on Lt.

Similar, at random from 100 sentences of middle extraction are to forming

according to word alignment from

arrive

mark projection, projection result merges

after revising, form

? on training sequence marking model again

Utilize sequence labelling MODEL C model (s) right

Figure DEST_PATH_GDA00004901785600000720

mark, recalculate

Figure DEST_PATH_GDA00004901785600000721

if

Figure DEST_PATH_GDA00004901785600000722

?

Figure DEST_PATH_GDA00004901785600000723

Figure DEST_PATH_GDA00004901785600000724

training sequence marking model again on Ls

Figure DEST_PATH_GDA00004901785600000725

Step 3, circulation execution step two, observe the test result of bilingual sequence labelling model on exploitation collection, until algorithm convergence, final production model Cmodel (s) and Cmodel (t).Utilize Cmodel (s) to carry out named entity recognition to source language language material, utilize Cmodel (s) to carry out named entity recognition to target language language material, and further compile named entity dictionary.

Claims

1. the bilingual named entity recognition method based on coorinated training, is characterized in that comprising the following steps:

Step 1, initialization marking model; On 2000 bilingual corporas that marked named entity, train respectively the initial marking model of Chinese-English named entity;

Step 2, on the not mark named entity language material of Chinese-English sentence level alignment, utilize 10 times of cross selection increments marks, carry out bilingual coorinated training; Detailed process is as follows:

First,, from the sentence of randomly drawing 1000 alignment the set of named entity language material that do not mark of Chinese-English sentence level alignment, be expressed as

the marking model of utilizing step 1 to obtain, carries out respectively named entity mark to bilingual sentence; Calculate

Figure DEST_PATH_FDA00004689963900000119

bilingual mark concordance rate, the set of initialization mark language material increment is empty;

Then, at random from the sentence of middle extraction 10% is right, forms

according to word alignment from

to marking projection, and projection named entity tab area is expanded, made it to hold more target language named entity hypothesis, bilingual named entity hypothesis of each named entity projection hypothesis and source language named entity composition; Afterwards, merge single language feature of target language named entity and the alignment feature of bilingual named entity, projection result is revised, the mark language material increment using revised result as target language end ?

on re-start target language named entity marking model training, and again right by the marking model after training

in

mark, recalculate

bilingual mark concordance rate;

Said process is carried out in circulation, carries out 10 times of intersections, and mark language material increment corresponding when bilingual mark concordance rate is the highest in circulating marks language material increment as the target language end of this coorinated training

Figure DEST_PATH_FDA00004689963900000110

?

Figure DEST_PATH_FDA00004689963900000111

on re-start target language named entity marking model training;

Make to use the same method, find the increment mark language material of source language end

Figure DEST_PATH_FDA00004689963900000112

? on re-start source language named entity marking model training;

Step 3, circulation execution step two, by testing until algorithm convergence on exploitation collection; After circulation finishes, finally produce Chinese-English two named entity marking model, the bilingual Named Entity Extraction Model training; Finally, cross-cutting bilingual corpora is carried out to the identification of named entity, further build named entity dictionary.

2. a kind of bilingual named entity recognition method based on coorinated training as claimed in claim 1, is characterized in that, calculates

Figure DEST_PATH_FDA00004689963900000114

the method of bilingual mark concordance rate is as follows:

If

Figure DEST_PATH_FDA00004689963900000115

bilingual mark concordance rate is, conformity_ration

Figure DEST_PATH_FDA00004689963900000116

initialization

max←conformity_ration

Figure DEST_PATH_FDA00004689963900000117

The set of initialization mark language material increment is empty,

Wherein,

(ws _i, wt _j) _krepresent the right the k(1≤k≤K of parallel sentence) to word pair; T (ws _i), T (wt _j) represent respectively the mark at the Chinese-English two ends of named entity; U represents un-annotated data collection; N represents the sentence number in U; In mark language material, mark altogether three kinds of named entities,---name, LOC---place name and ORG---the organizational structure's name that is respectively PER; According to BIO mark collection mark, all characters have 7 kinds of mark: B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG and O;

While calculating alignment mark concordance rate, the difference of ignore-tag " B " and " I ", thinks that they are identical marks.

3. a kind of bilingual named entity recognition method based on coorinated training as claimed in claim 1, is characterized in that in described step 2, and the method that projection named entity tab area is expanded is as follows:

First the named entity projected area from source language to target language is expanded, made it to hold more target language named entity hypothesis, bilingual named entity hypothesis of each named entity projection hypothesis and source language named entity composition; in any one named entity be expressed as by word project obtain target language end continuous and the centre word piece that comprises center of projection word as minimum candidate region the projected area that comprises all projection words

two ends respectively to 4 words of external expansion as Maximum alternative region;

At target language end, set up a sliding window, from minimum candidate region, constantly expand word to any side of sentence, until reach Maximum alternative zone boundary, thus expansion produces a series of target language end candidate named entity hypothesis; Each target language end named entity hypothesis with

combination, forms a bilingual named entity hypothesis, is expressed as

4. a kind of bilingual named entity recognition method based on coorinated training as claimed in claim 1, it is characterized in that in described step 2, merge single language feature of target language named entity and the alignment feature of bilingual named entity, and the method that projection result is revised is as follows:

By constructing a log-linear model, merge the syntax degree of confidence of target language named entity and the alignment degree of confidence of bilingual named entity, to all comprehensive marking of bilingual named entity hypothesis;

For guaranteeing that the projection of target language end named entity meets the syntactic feature of named entity, selects the named entity syntax degree of confidence of border, left and right distribution probability as target language; Border distribution probability comprises left margin binary part of speech co-occurrence frequency and right margin binary part of speech co-occurrence frequency; Left margin binary part of speech co-occurrence frequency definition as formula (2) as shown in:

Wherein, the t in formula _i, t _i-1, t _i+1represent respectively border word w _ipart of speech, border word w _iprevious word w _i-1part of speech and border word w _ia rear word w _i+1part of speech; Count (*, *, *) represents named entity border word w in corpus _ithe number of times that occurs of binary part of speech combination; Count (rw _i) and count (lw _i) represent respectively the number of times that border, left and right occurs in language material;

Merge left and right boundary information, the calculating of single statement method degree of confidence of projection named entity as formula (4) as shown in:

Maximum entropy model can merge dissimilar feature, for the alignment degree of confidence of bilingual named entity make fundamental function

utilize maximum entropy model to carry out modeling, as formula (5) as shown in; For each fundamental function f _m, corresponding model parameter is λ _m, m=1,2 ..., M;

adopt 3 features, bilingual named entity alignment degree of confidence is carried out to modeling, be respectively: bilingual named entity part of speech combination co-occurrence feature, bilingual named entity intertranslation feature and bilingual named entity length linked character; Part of speech combination co-occurrence feature refers to Chinese-English part of speech corresponding in bilingual named entity and is combined in the co-occurrence frequency in whole corpus; Calculate as formula (6) as shown in:

Wherein, represent that named entity part of speech is combined in the number of times of co-occurrence in language material, count (*, *) represents the quantity of named entity in language material;

For the bilingual named entity of candidate, the mutual translation probability between source language named entity and target language end projection named entity is used respectively with

represent, bilingual named entity intertranslation feature as formula (7) as shown in:

For the bilingual named entity of optimum

,

difference in length approximate meet standardized normal distribution, definition length linked character as formula (8) as shown in:

Wherein,

wherein, count (*) represents the number of characters that * comprises, and English is alphabetical number, and Chinese is Chinese character number;

The bilingual named entity Assumption set of expansion

Figure DEST_PATH_FDA00004689963900000410

in each hypothesis score value be expressed as formula form (9):

Figure DEST_PATH_FDA00004689963900000412

Finally, obtain the bilingual named entity Assumption set of sentence to optimum by a greed search, thereby obtain optimum target language named entity projection; Source language is exactly the target language named entity of supposing with the optimum bilingual named entity of source language named entity composition in the optimum projection result of target language end.

5. a kind of bilingual named entity recognition method based on coorinated training as claimed in claim 4, is characterized in that, described greedy search procedure is:

Then, according to

calculate the score (h of all bilingual named entity hypothesis of sentence centering _i) and by descending sort;

Afterwards, choose successively bilingual named entity in one and the bilingual named entity Assumption set of current optimum and there is no the bilingual named entity hypothesis of the expansion h of border clash _iput into optimum bilingual named entity Assumption set; Repeat this step, until can not find the bilingual named entity hypothesis of the expansion satisfying condition.

Finally, right successively

in sentence project correction, form projection result