The performance of Spacy's NER is not satisfying. And different behaviors are observed between Spacy and Dispacy. #7493

ShengdingHu · 2021-03-19T09:32:32Z

When I use the dispacy to do NER for the sentence
"drawing from mutualism, mikhail bakunin founded collectivist anarchism and entered the international workingmen's association, a class worker union later known as the first international that formed in 1864 to unite diverse revolutionary currents."
I get the result

However when I use the offline version, I only get two named entities, which is far from satisfactory.

Similar observations appear in other sentences.

The package version is 3.0.0, which is different from Dispacy's 2.3.0, However, I believe the more recent version should not yield a worse performance.

Did I miss something, such as preprocessing? Thanks in advance!

adrianeboyd · 2021-03-19T12:39:45Z

Hi, this is a good question! The performance difference is due to a bug related to data augmentation for the v3.0.0 models. We added data augmentation with lowercasing to our training for the v2.3 models, but there was a bug in the augmenter and it wasn't applied when we trained the v3.0.0 models. We've fixed the bug and plan to add it back for the v3.1.0 models.

In general, if you need the exact same performance, you want to be sure you use the exact same model version like v2.3.0. We try to provide useful pretrained models, but we don't try to guarantee the exact same performance across versions. For instance, we may change or update the underlying datasets or change training parameters between versions. (Actually a number of people complained about the differences between v2.2 and v2.3 models because the tagger confused common nouns and proper nouns more due to the lowercasing. It's impossible to provide a model that's perfect for every use case!)

Here my best recommendation would be to use v2.3.x models for processing texts without standard capitalization and reevaluate when v3.1.0 models are released. Or if you train your own model with spacy v3.0.5+, the lowercasing bugs should hopefully be fixed in the provided augmenters.

omri374 · 2021-03-20T20:38:36Z

Not sure if this is related, but there's a regression with the en_core_web_lg model from 2.3 to 3.0.5. A very simple example doesn't work:

In spaCy 3.0.5 with en_core_web_lg version 3.0.0:

import spacy
nlp = spacy.load("en_core_web_lg")
doc = nlp("My name is David")
for ent in doc.ents:
    print("Found entity:")
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

# No entities found

with spaCy 2.3.2 and en_core_web_lg version 2.3.0:

import spacy
nlp = spacy.load("en_core_web_lg")
doc = nlp("My name is David")
for ent in doc.ents:
    print("Found entity:")
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Found entity:
David 11 16 PERSON

ShengdingHu · 2021-03-21T03:51:12Z

Thanks! I tried version 2.3.0 (https://github.com/explosion/spacy-models/releases/tag/en_core_web_sm-2.3.0) and found it's performance is the same with Dispacy, which is insensitive to the case! Hope this bug can be fixed soon.

adrianeboyd · 2021-03-22T08:07:48Z

For reference, you can see the model version number used in the demo in the displacy demo screenshot above under "Model". This demo is running this exact model underneath, so you should get the same results as with this model and spacy.displacy.serve locally.

The 3.0.0 test with "David" looks more like a model fluke more than a major regression related to NER or PERSON? The performance for PERSON for en_core_web_lg is slightly lower (91.5->90.0) but in that context or in different contexts "David" or "David Johnson" or other names seem to work as expected?

omri374 · 2021-03-22T09:42:17Z

@adrianeboyd were previous models (like en_core_web_lg) retrained for spaCy >=3?

adrianeboyd · 2021-03-22T09:50:09Z

Yes, the models are always retrained for new minor versions. v2.2, v2.3, v3.0, etc. and sometimes there are multiple compatible model version for a minor version if there are bug fixes or minor changes in model settings/data. The v3 models with configs are not compatible with v2 at all, and vice versa for the v2 models without configs.

Whenever the model version number is different (2.3.0 vs. 2.3.1 vs. 3.0.0) it's a newly trained model. The first two numbers in the model version tell you which version of spacy it should be compatible with.

github-actions · 2021-03-31T22:07:30Z

This issue has been automatically closed because it was answered and there was no follow-up discussion.

omri374 · 2021-05-05T12:49:39Z

Coming back to this issue, I did a small test with different names and it seems like there's a pattern here.

Biblical names (on my small test) are much less likely to be identified on both en_core_web_lg and en_core_web_trf.
My speculated reason is that a lot of sentences in the OntoNotes dataset originating from the old and new testament are actually not labeled (All tokens in a sentence would get the O tag even if it contains entities).

This is the short test I did, which is comparing non-biblical names (both old and new) with biblical names, on two simple template sentences (tested on spacy version 3.0.5):

biblical_names = ["David", "Moses", "God", 
                  "Abraham", "Samuel", "Jacob", 
                  "Isaac", "Jesus", "Matthew", 
                  "John", "Judas","Simon", "Mary"] # Random biblical names

other_names = ["Beyonce", "Ariana", "Katy", # Singers
               "Michael", "Lebron", "Coby", # NBA players
               "William", "Charles","Ruth", "Margaret","Elizabeth", "Anna", # Most popular (non biblical) names in 1900 (https://www.ssa.gov/oact/babynames/decades/names1900s.html)
               "Ronald", "George", "Bill", "Barack", "Donald", "Joe" # Presidents
               ]

template1 = "My name is {}"
template2 = "This is what God said to {}" # Note that 'God' in theory should also be a named entity.

Evaluating recall:

from typing import List
import pprint
import spacy

def names_recall(nlp: spacy.lang.en.English, names: List[str], template: str):
    """
    Run the spaCy NLP model on the template + name, 
    calculate recall for detecting the "PERSON" entity 
    and return a detailed list of detection
    :param nlp: spaCy nlp model
    :param names: list of names to run model on
    :param template: sentence with placeholder for name (e.g. "He calls himself {}")
    """
    results = {}
    for name in names:
        doc = nlp(template.format(name))
        results[name] = len([ent for ent in doc.ents if ent.label_ == "PERSON"]) > 0
    recall = sum(results.values()) / (len(names)*1.0)
    print(f"Recall: {recall:.2f}\n")
    return results



name_sets = {"Biblical": biblical_names, "Other": other_names}
templates = (template1, template2)

detailed_results = {}

print("Model name: en_core_web_lg")
for name_set, template in itertools.product(name_sets.items(),templates):
    print(f"Name set: {name_set[0]}, Template: \"{template}\"")
    results = names_recall(en_core_web_lg, name_set[1], template)
    detailed_results[(name_set[0], template)] = results

print("\nDetailed results:")
pprint.pprint(detailed_results)

Here are the results for the en_core_web_lg model:

Model name: en_core_web_lg
Name set: Biblical, Template: "My name is {}"
Recall: 0.23

Name set: Biblical, Template: "This is what God said to {}"
Recall: 0.00

Name set: Other, Template: "My name is {}"
Recall: 0.94

Name set: Other, Template: "This is what God said to {}"
Recall: 0.83

Detailed results:
{('Biblical', 'My name is {}'): {'Abraham': True,
'David': False,
'God': False,
'Isaac': False,
'Jacob': False,
'Jesus': False,
'John': False,
'Judas': False,
'Mary': True,
'Matthew': True,
'Moses': False,
'Samuel': False,
'Simon': False},
('Biblical', 'This is what God said to {}'): {'Abraham': False,
'David': False,
'God': False,
'Isaac': False,
'Jacob': False,
'Jesus': False,
'John': False,
'Judas': False,
'Mary': False,
'Matthew': False,
'Moses': False,
'Samuel': False,
'Simon': False},
('Other', 'My name is {}'): {'Anna': True,
'Ariana': True,
'Barack': True,
'Beyonce': True,
'Bill': True,
'Charles': True,
'Coby': False,
'Donald': True,
'Elizabeth': True,
'George': True,
'Joe': True,
'Katy': True,
'Lebron': True,
'Margaret': True,
'Michael': True,
'Ronald': True,
'Ruth': True,
'William': True},
('Other', 'This is what God said to {}'): {'Anna': False,
'Ariana': True,
'Barack': True,
'Beyonce': True,
'Bill': True,
'Charles': True,
'Coby': False,
'Donald': True,
'Elizabeth': True,
'George': False,
'Joe': True,
'Katy': True,
'Lebron': True,
'Margaret': True,
'Michael': True,
'Ronald': True,
'Ruth': True,
'William': True}}

The en_core_web_trf model has different results but there's still a difference:

Model name: en_core_web_trf
Name set: Biblical, Template: "My name is {}"
Recall: 0.46

Name set: Biblical, Template: "This is what God said to {}"
Recall: 0.00

Name set: Other, Template: "My name is {}"
Recall: 1.00

Name set: Other, Template: "This is what God said to {}"
Recall: 0.44

Detailed results:
{('Biblical', 'My name is {}'): {'Abraham': False,
'David': True,
'God': False,
'Isaac': True,
'Jacob': False,
'Jesus': False,
'John': True,
'Judas': False,
'Mary': True,
'Matthew': True,
'Moses': False,
'Samuel': True,
'Simon': False},
('Biblical', 'This is what God said to {}'): {'Abraham': False,
'David': False,
'God': False,
'Isaac': False,
'Jacob': False,
'Jesus': False,
'John': False,
'Judas': False,
'Mary': False,
'Matthew': False,
'Moses': False,
'Samuel': False,
'Simon': False},
('Other', 'My name is {}'): {'Anna': True,
'Ariana': True,
'Barack': True,
'Beyonce': True,
'Bill': True,
'Charles': True,
'Coby': True,
'Donald': True,
'Elizabeth': True,
'George': True,
'Joe': True,
'Katy': True,
'Lebron': True,
'Margaret': True,
'Michael': True,
'Ronald': True,
'Ruth': True,
'William': True},
('Other', 'This is what God said to {}'): {'Anna': False,
'Ariana': True,
'Barack': True,
'Beyonce': True,
'Bill': True,
'Charles': False,
'Coby': True,
'Donald': True,
'Elizabeth': False,
'George': False,
'Joe': False,
'Katy': True,
'Lebron': False,
'Margaret': False,
'Michael': False,
'Ronald': True,
'Ruth': False,
'William': False}}

Not sure how to fix this though. Either the OntoNotes should be fixed (if this is in fact the reason), or additional augmentations should take place before training (e.g., extract a template out of a sentence in OntoNotes, and inject other values for this entity)

I would appreciate @adrianeboyd @ines @honnibal's thoughts on this. Thanks for the amazing work!

adrianeboyd · 2021-05-06T07:42:31Z

As a note, the Biblical sections of OntoNotes are annotated for training purposes as having "missing" NER annotation, not as O, so while I'd agree that there are some performance regressions for NER in v3.0 models, I don't think it's related to this section of the training data. All the built-in components are designed to be able to handle partial annotation both in training and evaluation.

It does make sense to try out things like this, but I'm also not sure whether your comparison is testing what you intended here. :) Michael, Anna, and Ruth appear in the Biblical section and the training data is old enough that Beyonce and Lebron do not occur at all.

One suspect is missing lowercase augmentation (there was a bug when we trained v3.0 and the lowercase augmentation was skipped vs. v2.3), and overall it looks like some more general augmentation around proper names (especially to cover newer names that aren't in the old training data) is definitely a good idea!

omri374 · 2021-05-06T09:51:25Z

Thanks @adrianeboyd! Are there any public versions of the spaCy training flow which handles the missing NER annotations?
I updated my analysis, updated some of the names to remove biblical names from the "Other" set. Results are still quite overwhelming. Lebron and Beyonce are detected (even given that they are not in the training set) whereas names like Jacob and Isaac aren't. Please see a notebook gist here.

Results on three sentences using the en_core_web_lg model:

biblical_names = ["David", "Moses", "Abraham", "Samuel", "Jacob", 
                  "Isaac", "Jesus", "Matthew", 
                  "John", "Judas","Simon", "Mary"] # Random biblical names

other_names = ["Beyonce", "Ariana", "Katy", # Singers
               "Michael", "Lebron", "Coby", # NBA players
               "William", "Charles","Robert", "Margaret","Frank", "Helen", # Popular (non biblical) names in 1900 (https://www.ssa.gov/oact/babynames/decades/names1900s.html)
               "Ronald", "George", "Bill", "Barack", "Donald", "Joe" # Presidents
               ]

template1 = "My name is {}"
template2 = "And {} said, Why hast thou troubled us?"
template3 = "And she conceived again, a bare a son; and she called his name {}."

name_sets = {"Biblical": biblical_names, "Other": other_names}
templates = (template1, template2, template3)

Model name: en_core_web_lg
Name set: Biblical, Template: "My name is {}" Recall: 0.25
Name set: Other, Template: "My name is {}" Recall: 0.94
Name set: Biblical, Template: "And {} said, Why hast thou troubled us?" Recall: 0.67
Name set: Other, Template: "And {} said, Why hast thou troubled us?" Recall: 0.94
Name set: Biblical, Template: "And she conceived again, a bare a son; and she called his name {}." Recall: 0.58
Name set: Other, Template: "And she conceived again, a bare a son; and she called his name {}." Recall: 0.94

How would you suggest to add augmentation to the spaCy training pipelines? In the work we do on Presidio we extracted template sentences from OntoNotes, injected fake entities instead of the original entities and trained the model on more data this way. Is this something you would consider as a contribution to spaCy?

adrianeboyd · 2021-05-06T12:32:04Z

This is also speculation to some degree, but I think another problem is that you rarely have first names in isolation in the training data, so it ends up learning that David is B-PERSON and never sees U-PERSON. I suspect that unknown names are more likely to end up being accepted as U-PERSON vs. known first names, but I don't really know for sure.

And without doing a lot of debugging I can't entirely rule out that something is going wrong with the training, either. I think that training with partial annotation is something that we do internally more than users typically do, so it's possible we've missed a subtle bug.

We have a corpus augmentation option in our corpus reader and the augmenters are here:

https://github.com/explosion/spaCy/blob/6788d90f61a1071c150ee73bc66efaf41a4e8da0/spacy/training/augment.py

There are docs with some examples here: https://spacy.io/usage/training#data-augmentation-custom

We haven't done a lot of testing related to what kinds/amount of augmentation works best or how to evaluate it well. So far our augmentation just lowers our scores on our dev sets because the training texts are slightly less similar to the training sets than in the original.

We also don't currently have any examples for augmentations that change the number of tokens. We also wanted to have some whitespace augmentation, but our initial attempt with random augmentation caused problems for the NER annotation (which shouldn't start or end with whitespace) and we stopped working on it at some point.

I think the trickiest part would be aligning all the other annotation layers, so if you're modifying names in a way that changes the number of tokens or the internal structure of the name, you'd also have to adjust the parse and tags, if present. On the other hand, lots of people only work on NER models, so an augmenter that is restricted to NER-only training data could still be quite useful.

In general, contributions are welcome! We'd have to discuss a bit about whether particular augmenters make sense in the core library, potentially in a separate package that we manage to some degree, or as a "spaCy universe" contribution where we're happy to advertise it but don't maintain the package.

omri374 · 2021-05-06T13:29:04Z

Thanks @adrianeboyd. Very interesting points. I'll definitely look into the corpus augmentation option, and agree this is difficult to evaluate as the datasets themselves were not augmented prior to being labeled.
While the current augmentation capabilities in spaCy are limited to the same number of tokens, one can create an augmented dataset a-priori and train a model on that (for NER only, as you mention).

FYI, a similar evaluation on two Flair models show that the a model trained on OntoNotes achieves significantly lower results on this test. The CONLL based model actually does pretty well.

CONLL-03 based model results:
Model name: ner-english (CONLL)
Name set: Biblical, Template: "My name is {}" Recall: 1.00
Name set: Other, Template: "My name is {}" Recall: 1.00
Name set: Biblical, Template: "And {} said, Why hast thou troubled us?" Recall: 1.00
Name set: Other, Template: "And {} said, Why hast thou troubled us?" Recall: 1.00
Name set: Biblical, Template: "And she conceived again, a bare a son; and she called his name {}." Recall: 1.00
Name set: Other, Template: "And she conceived again, a bare a son; and she called his name {}." Recall: 0.94

OntoNotes based model results:
Model name: ner-english-ontonotes
Name set: Biblical, Template: "My name is {}" Recall: 0.50
Name set: Other, Template: "My name is {}" Recall: 1.00
Name set: Biblical, Template: "And {} said, Why hast thou troubled us?" Recall: 0.00
Name set: Other, Template: "And {} said, Why hast thou troubled us?" Recall: 0.83
Name set: Biblical, Template: "And she conceived again, a bare a son; and she called his name {}." Recall: 0.00
Name set: Other, Template: "And she conceived again, a bare a son; and she called his name {}." Recall: 0.00

Hope this helps in any way :)

github-actions · 2021-10-23T00:01:53Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

adrianeboyd · 2022-09-26T13:23:57Z

@omri374: I just had another look at this and I think you were exactly right about the missing entity annotation in some sections of OntoNotes (in particular the Biblical sections) causing problems in the v3 models. I just tracked down bugs in the augmenters causing missing entity annotation to be converted to O in the augmented versions of the data, so the models were being trained with a mix of missing and O annotation for entities that occur in those sections.

Now to retrain the English pipelines and see what happens...

adrianeboyd · 2022-09-30T09:12:50Z

With the draft en_core_web_lg v3.4.1:

Model name: en_core_web_lg
Name set: Biblical, Template: "My name is {}"
Recall: 1.00

Name set: Other, Template: "My name is {}"
Recall: 0.94

Name set: Biblical, Template: "And {} said, Why hast thou troubled us?"
Recall: 1.00

Name set: Other, Template: "And {} said, Why hast thou troubled us?"
Recall: 0.94

Name set: Biblical, Template: "And she conceived again, a bare a son; and she called his name {}."
Recall: 1.00

Name set: Other, Template: "And she conceived again, a bare a son; and she called his name {}."
Recall: 1.00

The draft en_core_web_trf results are still surprisingly bad, so I will keep looking into it before publishing anything:

Model name: en_core_web_trf
Name set: Biblical, Template: "My name is {}"
Recall: 0.67

Name set: Other, Template: "My name is {}"
Recall: 1.00

Name set: Biblical, Template: "And {} said, Why hast thou troubled us?"
Recall: 0.00

Name set: Other, Template: "And {} said, Why hast thou troubled us?"
Recall: 0.33

Name set: Biblical, Template: "And she conceived again, a bare a son; and she called his name {}."
Recall: 0.00

Name set: Other, Template: "And she conceived again, a bare a son; and she called his name {}."
Recall: 0.56

adrianeboyd · 2022-10-10T12:53:42Z

After looking into this some more, it appears that training a shared transformer on data where there is partial NER annotation leads to the behavior above with widely varying recall.

We're still looking into exactly why this is happening, but as an initial improvement we've added silver NER annotation to the sections where NER is missing and retrained the pipeline, with 100% recall for the examples above and equivalent overall performance:

TOK      99.93
TAG      97.81
UAS      95.33
LAS      93.99
NER P    89.54
NER R    90.28
NER F    89.91
SENT P   95.86
SENT R   86.99
SENT F   91.21

We'll plan to publish these models as model version v3.4.1, probably at the same time as spacy v3.4.2 is released so we can include all the details in the release notes.

adrianeboyd · 2022-10-20T08:17:23Z

The new models (model version v3.4.1) have been published alongside spacy v3.4.2. They will be compatible with any spacy v3.4.x release.

omri374 · 2022-10-20T12:57:51Z

Thank you @adrianeboyd!

github-actions · 2022-10-28T00:07:22Z

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions · 2022-11-28T00:02:18Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

polm added the feat / ner Feature: Named Entity Recognizer label Mar 20, 2021

omri374 mentioned this issue Mar 20, 2021

[Draft]: Bump spaCy version to 3.0.5 (transformer based nlp models) microsoft/presidio#624

Closed

svlandeg added perf / accuracy Performance: accuracy resolved The issue was addressed / answered labels Mar 31, 2021

github-actions bot closed this as completed Mar 31, 2021

omri374 mentioned this issue Apr 7, 2021

Upgrade spacy version microsoft/presidio#656

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 23, 2021

adrianeboyd mentioned this issue Sep 26, 2022

PROPN (such as John) as the first word of a sentence, with another PROPN in the sentence, does not have an entity type #11523

Closed

adrianeboyd reopened this Sep 26, 2022

github-actions bot removed the resolved The issue was addressed / answered label Sep 26, 2022

explosion unlocked this conversation Sep 30, 2022

adrianeboyd added the resolved The issue was addressed / answered label Oct 20, 2022

github-actions bot removed the resolved The issue was addressed / answered label Oct 20, 2022

polm added the resolved The issue was addressed / answered label Oct 21, 2022

github-actions bot closed this as completed Oct 28, 2022

github-actions bot removed the resolved The issue was addressed / answered label Oct 28, 2022

github-actions bot locked as resolved and limited conversation to collaborators Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The performance of Spacy's NER is not satisfying. And different behaviors are observed between Spacy and Dispacy. #7493

The performance of Spacy's NER is not satisfying. And different behaviors are observed between Spacy and Dispacy. #7493

The performance of Spacy's NER is not satisfying. And different behaviors are observed between Spacy and Dispacy. #7493

The performance of Spacy's NER is not satisfying. And different behaviors are observed between Spacy and Dispacy. #7493

Comments