Implement flat:name...OR NOT #468

nschneid · 2023-11-06T05:41:56Z

PROPN-[flat]->non-PROPN: many are things like "Bush administration" which should be compound
- excluding numbered entities: https://universal.grew.fr/?custom=6588515c63838
the reverse, mostly errors
PROPN-[flat]->PROPN: includes many titles (compound/flat inconsistency #59), e.g. President Bush, the Prophet Mohammed/the Rev. Gerald Robinson (should be appos?)

See also #81

The text was updated successfully, but these errors were encountered:

amir-zeldes · 2023-11-06T15:18:01Z

You're adding :name to EWT? I'm not sure I'm so keen on this, because then we are inflating the relation inventory again, and I think non-name will be rather rare. That would be bad for training parsers, plotting confmats, and everything else I raised RE nsubj:outer. Do you know how many cases of non-name flats we would have if we make this split?

nschneid · 2023-11-06T16:35:53Z

Didn't we say flat:name was a universal recommendation? I wouldn't want English to be incompatible with other languages....as for non-name flats, if we go by non-PROPN words, I'm seeing 74 in GUM and 200 in EWT.

amir-zeldes · 2023-11-07T20:48:31Z

Hm, 74 and 200... that makes it even rarer than nsubj:outer (180 in GUM, which I opposed among other reasons due to rarity) or orphan (92). There are other labels which are rarer (csubj:pass at 18, for example, and csubj:outer at a whopping 6), so it's not totally unheard of, but it is quite low, and I don't love introducing super rare labels which complicate things for parsing.

I know we want to be linguistically faithful, but supporting parsing is an explicit goal of UD and it seems a bit irresponsible to add labels we don't badly need in this way. How big do we want the label set to be? (currently 52 types)

nschneid · 2023-11-08T03:10:22Z

I don't know that flat:name would actually be hard for parsers because it is largely redundant with flat + PROPN. OTOH, that raises the question of why we need the subtype. It seems like it could be useful crosslinguistically if there are a wider range of non-name constructions in certain languages that correspond to flat. Perhaps @dan-zeman would like to weigh in?

dan-zeman · 2023-11-08T09:34:53Z

https://lindat.mff.cuni.cz/services/teitok/ud212/index.php?action=cqp

[deprel = "flat:name" & upos != "PROPN"] within text

There are 8683 results across UD 2.12. They can be SYM, NUM, NOUN, CCONJ, X, ADJ, ADP, DET... I'm not saying all of them are good examples of flat. Some of them should probably be annotated differently. But even then it won't mean that flat:name can be deduced from flat + PROPN.

There is a relatively long list of flat subtypes used in various UD treebanks. Other treebanks may use flat for various constructions and we may not see it because they don't distinguish them by subtypes.

amir-zeldes · 2023-11-08T14:51:44Z

I'm not necessarily saying it's trivial, but I am wondering whether we have some obligation to try to keep the recommended tagset small and ideally free of very rare categories. Non :name flat in GUM would be reduced to just 70 cases by this label split, and the way that I was taught to build tagsets was to make important distinctions, but also avoid ones that would have too few occurrences.

For example, we have NER tags for PERSON and ANIMAL, and they are sometimes applied to supernatural beings like "angels/PERSON" or "virus/ANIMAL", and I think most people would agree that angels and viruses are not quite PERSON and ANIMAL - we just lump them in there because splitting off another category for such cases would be very sparse.

Adding lots of small and subtle label distinctions makes many things complicated: parser training, any evaluation of labels involving confusion matrices, teaching guidelines to students, and even annotation interfaces. Right now when annotating deprels, the drop down of possible values already doesn't fit on one screen. And when every year for teaching UD I have to cram in a few more distinctions, it really overloads what the students can reasonably get a grip on in the space of time I have to teach UD. So for me that raises the question for any proposed label split: is it worth it?

nschneid · 2023-11-08T15:04:12Z

Looking at @dan-zeman's query results: From what I can tell, a lot of these are cases where the head if not the dependent would likely be PROPN (e.g. numbers in addresses or brand names, or "Ó" in Irish names tagged as PART); years in citations (whether a full citation is a name or even flat is debatable); and words of foreign names tagged as X.

Foreign names are perhaps the best argument for flat:name—if it is in fact correct to tag the words as X rather than PROPN (cf. #440). Perhaps the English corpora have fewer foreign names tagged as X than many of the non-English corpora. For Foreign=Yes tokens, GUM uses a mix of X and PROPN, as does Irish IDT, whereas Czech-PDT, German-HDT, and Hebrew-IAHLTwiki are pretty much exclusively X (to choose just a small sample of treebanks).

amir-zeldes · 2023-11-08T15:34:42Z

Well, the issue is not that foreign names are rare, and in any case, the proposal is to give them the same deprel as non-foreign names. The issue is that the remaining cases of flat-not-name would be very very rare, making the label hard to justify for me. We already have a very high cognitive load in the label space for annotators to track, and those drop downs are starting to look ridiculous - they mess with the browser windows and cause things to scroll out of sight.

These are technical annoyances, to be sure, but I think they are a symptom of essentially having too many subtypes as part of the UD deprel annotation task (you'll notice I'm much more tolerant of FEATs, because we assign them automatically, their annotation is non-relational and therefore easy, and they basically constitute a separate, though not totally independent task). I think deprels should not need so many values, since we don't have separate labels for each tense/morphological class, but somehow we now have many more deprels than even xpos in English... In fact I would be for eliminating some of them, not expanding (I'm looking at :tmod and list in particular, and you know my opinion about nsubj/csubj:outer)

jnivre · 2023-11-08T15:48:32Z

As pointed out by Dan previously, the main reason why both “flat:name” and “flat:foreign” exist (and are mentioned in the guidelines) is that there were distinct relations “name” and “foreign” in version 1. When we decided that these were superfluous and could be subsumed under the new “flat” relation, we simply suggested that people who were unhappy about losing this information could use subtypes to preserve it.

In the case of “name”, the situation is complicated by the fact that we also changed the general guidelines concerning names into the current recommendation that names that have internal syntactic structure should be annotated accordingly. Therefore, many complex names, in particular song, book and movie titles (like “She loves you”, “The old man and the sea”, and “Some like it hot”), which used to be annotated with the “name” relation in v1 (at least in some treebanks), should not use the new “flat” relation. The combined effect of these changes is that the “:name” subtype is almost redundant, as observed by Nathan and Amir, and it is only a small subset of all names (in the wider sense) that can be retrieved by looking for that specific subtype (and many of these could also be retrieved by looking for PROPN tags). I guess we didn’t quite see this effect when we launched v2 (at least I don’t remember any discussions about this).

Hence, I am not opposed to dropping the recommendation to use “flat:name”, but since all subtypes are optional, it seems hard to do anything stronger than that.

nschneid · 2023-11-09T03:15:08Z

How about we simply avoid any recommendations regarding *flat* subtypes, but instead link to https://universaldependencies.org/ext-dep-index.html#flat and note that in many treebanks that predated v2, the v1 *name* and *foreign* relations were converted to *flat:name* and *flat:foreign* respectively?

…

On Wed, Nov 8, 2023 at 9:48 AM Joakim Nivre ***@***.***> wrote: As pointed out by Dan previously, the main reason why both “flat:name” and “flat:foreign” exist (and are mentioned in the guidelines) is that there were distinct relations “name” and “foreign” in version 1. When we decided that these were superfluous and could be subsumed under the new “flat” relation, we simply suggested that people who were unhappy about losing this information could use subtypes to preserve it. In the case of “name”, the situation is complicated by the fact that we also changed the general guidelines concerning names into the current recommendation that names that have internal syntactic structure should be annotated accordingly. Therefore, many complex names, in particular song, book and movie titles (like “She loves you”, “The old man and the sea”, and “Some like it hot”), which used to be annotated with the “name” relation in v1 (at least in some treebanks), should not use the new “flat” relation. The combined effect of these changes is that the “:name” subtype is almost redundant, as observed by Nathan and Amir, and it is only a small subset of all names (in the wider sense) that can be retrieved by looking for that specific subtype (and many of these could also be retrieved by looking for PROPN tags). I guess we didn’t quite see this effect when we launched v2 (at least I don’t remember any discussions about this). Hence, I am not opposed to dropping the recommendation to use “flat:name”, but since all subtypes are optional, it seems hard to do anything stronger than that. — Reply to this email directly, view it on GitHub <#468 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHQRL7DGNTNJYD5VQPDSKDYDOSVZAVCNFSM6AAAAAA666HYK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBSGE3DIMRSGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

jnivre · 2023-11-09T07:14:12Z

SGTM Skickat från Outlook för iOS<https://aka.ms/o0ukef>

nschneid · 2023-12-25T16:26:08Z

Not implementing :name subtype. Done with the clear errors. Other potential improvements to the use of flat are in linked issues.

nschneid changed the title ~~Implement flat:name~~ Implement flat:name...OR NOT Nov 9, 2023

nschneid mentioned this issue Nov 9, 2023

Update guidelines for fixed, flat, compound and be more careful about the term "multiword expression" UniversalDependencies/docs#989

Closed

nschneid added a commit that referenced this issue Nov 16, 2023

flat fixes (#468, #469); nmod:desc for "Aunt/Uncle NAME"

e0f59c6

nschneid added a commit that referenced this issue Dec 25, 2023

PROPN+NOUN: compound not flat (#468)

860fc98

nschneid added a commit that referenced this issue Dec 25, 2023

neaten.py: warn about PROPN-[flat]->NOUN; depedit script (#468)

547b675

nschneid closed this as completed Dec 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement flat:name...OR NOT #468

Implement flat:name...OR NOT #468

Implement flat:name...OR NOT #468

Implement flat:name...OR NOT #468

Comments