[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement flat:name...OR NOT #468

Closed
2 of 4 tasks
nschneid opened this issue Nov 6, 2023 · 12 comments
Closed
2 of 4 tasks

Implement flat:name...OR NOT #468

nschneid opened this issue Nov 6, 2023 · 12 comments

Comments

@nschneid
Copy link
Contributor
nschneid commented Nov 6, 2023

See also #81

@amir-zeldes
Copy link
Contributor

You're adding :name to EWT? I'm not sure I'm so keen on this, because then we are inflating the relation inventory again, and I think non-name will be rather rare. That would be bad for training parsers, plotting confmats, and everything else I raised RE nsubj:outer. Do you know how many cases of non-name flats we would have if we make this split?

@nschneid
Copy link
Contributor Author
nschneid commented Nov 6, 2023

Didn't we say flat:name was a universal recommendation? I wouldn't want English to be incompatible with other languages....as for non-name flats, if we go by non-PROPN words, I'm seeing 74 in GUM and 200 in EWT.

@amir-zeldes
Copy link
Contributor

Hm, 74 and 200... that makes it even rarer than nsubj:outer (180 in GUM, which I opposed among other reasons due to rarity) or orphan (92). There are other labels which are rarer (csubj:pass at 18, for example, and csubj:outer at a whopping 6), so it's not totally unheard of, but it is quite low, and I don't love introducing super rare labels which complicate things for parsing.

I know we want to be linguistically faithful, but supporting parsing is an explicit goal of UD and it seems a bit irresponsible to add labels we don't badly need in this way. How big do we want the label set to be? (currently 52 types)

@nschneid
Copy link
Contributor Author
nschneid commented Nov 8, 2023

I don't know that flat:name would actually be hard for parsers because it is largely redundant with flat + PROPN. OTOH, that raises the question of why we need the subtype. It seems like it could be useful crosslinguistically if there are a wider range of non-name constructions in certain languages that correspond to flat. Perhaps @dan-zeman would like to weigh in?

@dan-zeman
Copy link
Member

https://lindat.mff.cuni.cz/services/teitok/ud212/index.php?action=cqp

[deprel = "flat:name" & upos != "PROPN"] within text

There are 8683 results across UD 2.12. They can be SYM, NUM, NOUN, CCONJ, X, ADJ, ADP, DET... I'm not saying all of them are good examples of flat. Some of them should probably be annotated differently. But even then it won't mean that flat:name can be deduced from flat + PROPN.

There is a relatively long list of flat subtypes used in various UD treebanks. Other treebanks may use flat for various constructions and we may not see it because they don't distinguish them by subtypes.

@amir-zeldes
Copy link
Contributor

I'm not necessarily saying it's trivial, but I am wondering whether we have some obligation to try to keep the recommended tagset small and ideally free of very rare categories. Non :name flat in GUM would be reduced to just 70 cases by this label split, and the way that I was taught to build tagsets was to make important distinctions, but also avoid ones that would have too few occurrences.

For example, we have NER tags for PERSON and ANIMAL, and they are sometimes applied to supernatural beings like "angels/PERSON" or "virus/ANIMAL", and I think most people would agree that angels and viruses are not quite PERSON and ANIMAL - we just lump them in there because splitting off another category for such cases would be very sparse.

Adding lots of small and subtle label distinctions makes many things complicated: parser training, any evaluation of labels involving confusion matrices, teaching guidelines to students, and even annotation interfaces. Right now when annotating deprels, the drop down of possible values already doesn't fit on one screen. And when every year for teaching UD I have to cram in a few more distinctions, it really overloads what the students can reasonably get a grip on in the space of time I have to teach UD. So for me that raises the question for any proposed label split: is it worth it?

@nschneid
Copy link
Contributor Author
nschneid commented Nov 8, 2023

Looking at @dan-zeman's query results: From what I can tell, a lot of these are cases where the head if not the dependent would likely be PROPN (e.g. numbers in addresses or brand names, or "Ó" in Irish names tagged as PART); years in citations (whether a full citation is a name or even flat is debatable); and words of foreign names tagged as X.

Foreign names are perhaps the best argument for flat:name—if it is in fact correct to tag the words as X rather than PROPN (cf. #440). Perhaps the English corpora have fewer foreign names tagged as X than many of the non-English corpora. For Foreign=Yes tokens, GUM uses a mix of X and PROPN, as does Irish IDT, whereas Czech-PDT, German-HDT, and Hebrew-IAHLTwiki are pretty much exclusively X (to choose just a small sample of treebanks).

@amir-zeldes
Copy link
Contributor

Well, the issue is not that foreign names are rare, and in any case, the proposal is to give them the same deprel as non-foreign names. The issue is that the remaining cases of flat-not-name would be very very rare, making the label hard to justify for me. We already have a very high cognitive load in the label space for annotators to track, and those drop downs are starting to look ridiculous - they mess with the browser windows and cause things to scroll out of sight.

These are technical annoyances, to be sure, but I think they are a symptom of essentially having too many subtypes as part of the UD deprel annotation task (you'll notice I'm much more tolerant of FEATs, because we assign them automatically, their annotation is non-relational and therefore easy, and they basically constitute a separate, though not totally independent task). I think deprels should not need so many values, since we don't have separate labels for each tense/morphological class, but somehow we now have many more deprels than even xpos in English... In fact I would be for eliminating some of them, not expanding (I'm looking at :tmod and list in particular, and you know my opinion about nsubj/csubj:outer)

@jnivre
Copy link
jnivre commented Nov 8, 2023

As pointed out by Dan previously, the main reason why both “flat:name” and “flat:foreign” exist (and are mentioned in the guidelines) is that there were distinct relations “name” and “foreign” in version 1. When we decided that these were superfluous and could be subsumed under the new “flat” relation, we simply suggested that people who were unhappy about losing this information could use subtypes to preserve it.

In the case of “name”, the situation is complicated by the fact that we also changed the general guidelines concerning names into the current recommendation that names that have internal syntactic structure should be annotated accordingly. Therefore, many complex names, in particular song, book and movie titles (like “She loves you”, “The old man and the sea”, and “Some like it hot”), which used to be annotated with the “name” relation in v1 (at least in some treebanks), should not use the new “flat” relation. The combined effect of these changes is that the “:name” subtype is almost redundant, as observed by Nathan and Amir, and it is only a small subset of all names (in the wider sense) that can be retrieved by looking for that specific subtype (and many of these could also be retrieved by looking for PROPN tags). I guess we didn’t quite see this effect when we launched v2 (at least I don’t remember any discussions about this).

Hence, I am not opposed to dropping the recommendation to use “flat:name”, but since all subtypes are optional, it seems hard to do anything stronger than that.

@nschneid
Copy link
Contributor Author
nschneid commented Nov 9, 2023 via email

@jnivre
Copy link
jnivre commented Nov 9, 2023 via email

@nschneid
Copy link
Contributor Author
nschneid commented Dec 25, 2023

Not implementing :name subtype. Done with the clear errors. Other potential improvements to the use of flat are in linked issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants