-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement flat:name...OR NOT #468
Comments
You're adding |
Didn't we say |
Hm, 74 and 200... that makes it even rarer than nsubj:outer (180 in GUM, which I opposed among other reasons due to rarity) or orphan (92). There are other labels which are rarer (csubj:pass at 18, for example, and csubj:outer at a whopping 6), so it's not totally unheard of, but it is quite low, and I don't love introducing super rare labels which complicate things for parsing. I know we want to be linguistically faithful, but supporting parsing is an explicit goal of UD and it seems a bit irresponsible to add labels we don't badly need in this way. How big do we want the label set to be? (currently 52 types) |
I don't know that flat:name would actually be hard for parsers because it is largely redundant with flat + PROPN. OTOH, that raises the question of why we need the subtype. It seems like it could be useful crosslinguistically if there are a wider range of non-name constructions in certain languages that correspond to flat. Perhaps @dan-zeman would like to weigh in? |
https://lindat.mff.cuni.cz/services/teitok/ud212/index.php?action=cqp
There are 8683 results across UD 2.12. They can be There is a relatively long list of flat subtypes used in various UD treebanks. Other treebanks may use |
I'm not necessarily saying it's trivial, but I am wondering whether we have some obligation to try to keep the recommended tagset small and ideally free of very rare categories. Non For example, we have NER tags for PERSON and ANIMAL, and they are sometimes applied to supernatural beings like "angels/PERSON" or "virus/ANIMAL", and I think most people would agree that angels and viruses are not quite PERSON and ANIMAL - we just lump them in there because splitting off another category for such cases would be very sparse. Adding lots of small and subtle label distinctions makes many things complicated: parser training, any evaluation of labels involving confusion matrices, teaching guidelines to students, and even annotation interfaces. Right now when annotating deprels, the drop down of possible values already doesn't fit on one screen. And when every year for teaching UD I have to cram in a few more distinctions, it really overloads what the students can reasonably get a grip on in the space of time I have to teach UD. So for me that raises the question for any proposed label split: is it worth it? |
Looking at @dan-zeman's query results: From what I can tell, a lot of these are cases where the head if not the dependent would likely be PROPN (e.g. numbers in addresses or brand names, or "Ó" in Irish names tagged as PART); years in citations (whether a full citation is a name or even flat is debatable); and words of foreign names tagged as X. Foreign names are perhaps the best argument for flat:name—if it is in fact correct to tag the words as X rather than PROPN (cf. #440). Perhaps the English corpora have fewer foreign names tagged as X than many of the non-English corpora. For |
Well, the issue is not that foreign names are rare, and in any case, the proposal is to give them the same deprel as non-foreign names. The issue is that the remaining cases of flat-not-name would be very very rare, making the label hard to justify for me. We already have a very high cognitive load in the label space for annotators to track, and those drop downs are starting to look ridiculous - they mess with the browser windows and cause things to scroll out of sight. These are technical annoyances, to be sure, but I think they are a symptom of essentially having too many subtypes as part of the UD deprel annotation task (you'll notice I'm much more tolerant of FEATs, because we assign them automatically, their annotation is non-relational and therefore easy, and they basically constitute a separate, though not totally independent task). I think deprels should not need so many values, since we don't have separate labels for each tense/morphological class, but somehow we now have many more deprels than even xpos in English... In fact I would be for eliminating some of them, not expanding (I'm looking at |
As pointed out by Dan previously, the main reason why both “flat:name” and “flat:foreign” exist (and are mentioned in the guidelines) is that there were distinct relations “name” and “foreign” in version 1. When we decided that these were superfluous and could be subsumed under the new “flat” relation, we simply suggested that people who were unhappy about losing this information could use subtypes to preserve it. In the case of “name”, the situation is complicated by the fact that we also changed the general guidelines concerning names into the current recommendation that names that have internal syntactic structure should be annotated accordingly. Therefore, many complex names, in particular song, book and movie titles (like “She loves you”, “The old man and the sea”, and “Some like it hot”), which used to be annotated with the “name” relation in v1 (at least in some treebanks), should not use the new “flat” relation. The combined effect of these changes is that the “:name” subtype is almost redundant, as observed by Nathan and Amir, and it is only a small subset of all names (in the wider sense) that can be retrieved by looking for that specific subtype (and many of these could also be retrieved by looking for PROPN tags). I guess we didn’t quite see this effect when we launched v2 (at least I don’t remember any discussions about this). Hence, I am not opposed to dropping the recommendation to use “flat:name”, but since all subtypes are optional, it seems hard to do anything stronger than that. |
How about we simply avoid any recommendations regarding *flat* subtypes,
but instead link to
https://universaldependencies.org/ext-dep-index.html#flat and note that in
many treebanks that predated v2, the v1 *name* and *foreign* relations were
converted to *flat:name* and *flat:foreign* respectively?
…On Wed, Nov 8, 2023 at 9:48 AM Joakim Nivre ***@***.***> wrote:
As pointed out by Dan previously, the main reason why both “flat:name” and
“flat:foreign” exist (and are mentioned in the guidelines) is that there
were distinct relations “name” and “foreign” in version 1. When we decided
that these were superfluous and could be subsumed under the new “flat”
relation, we simply suggested that people who were unhappy about losing
this information could use subtypes to preserve it.
In the case of “name”, the situation is complicated by the fact that we
also changed the general guidelines concerning names into the current
recommendation that names that have internal syntactic structure should be
annotated accordingly. Therefore, many complex names, in particular song,
book and movie titles (like “She loves you”, “The old man and the sea”, and
“Some like it hot”), which used to be annotated with the “name” relation in
v1 (at least in some treebanks), should not use the new “flat” relation.
The combined effect of these changes is that the “:name” subtype is almost
redundant, as observed by Nathan and Amir, and it is only a small subset of
all names (in the wider sense) that can be retrieved by looking for that
specific subtype (and many of these could also be retrieved by looking for
PROPN tags). I guess we didn’t quite see this effect when we launched v2
(at least I don’t remember any discussions about this).
Hence, I am not opposed to dropping the recommendation to use “flat:name”,
but since all subtypes are optional, it seems hard to do anything stronger
than that.
—
Reply to this email directly, view it on GitHub
<#468 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHQRL7DGNTNJYD5VQPDSKDYDOSVZAVCNFSM6AAAAAA666HYK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBSGE3DIMRSGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
SGTM
Skickat från Outlook för iOS<https://aka.ms/o0ukef>
|
Not implementing |
compound
appos
?)See also #81
The text was updated successfully, but these errors were encountered: