You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Looking at hyphenated compounds, there are several ways that English treebanks annotate these, sometimes inconsistently within the same treebank and across treebanks.
12 Indo Indo X AFX _ 15 compound 15:compound SpaceAfter=No
13 - - PUNCT HYPH _ 12 punct 12:punct SpaceAfter=No
14 Sri Sri PROPN NNP Number=Sing 15 compound 15:compound _
15 Lanka Lanka PROPN NNP Number=Sing 17 compound 17:compound _
my understanding is that this should be:
12 Indo- Indo- X AFX Hyph=Yes 14 compound 15:compound SpaceAfter=No
13 Sri Sri PROPN NNP Number=Sing 14 compound 15:compound _
14 Lanka Lanka PROPN NNP Number=Sing 16 compound 16:compound _
Hyph=Yes is indeed meant for the first part of such compounds in case they are separate tokens and their form is different from independent word. But it does not specify what should be done with tokenization, that is, whether the hyphen shall be part of the form or a separate token. We use Hyph=Yes in Czech but we don't include the hyphen in the token that contains the prefix and that gets the feature.
AFAIK, the actual convention for AFX in LDC corpora is not like in EWT - in OntoNotes, it is only used for the same situations that Dan is referring to, where the affix 'word' is a separate token due to spacing, e.g.:
pro-/AFX and anti-abortionists/NNS (wsj_0290)
As the second noun demonstrates, the standard has been to not separate prefixes like anti- when they are spelled together, and GENTLE (and the other GU corpora) follows this standard.
Keeping the hyphen within the AFX token makes logical sense to me. I checked the EWT source trees from LDC and they do have the separated HYPH tokens, so either they changed their standard or didn't apply it consistently. There are very few AFX tokens with hyphens in EWT—I only see about 5.
Looking at hyphenated compounds, there are several ways that English treebanks annotate these, sometimes inconsistently within the same treebank and across treebanks.
I'm basing this on https://universaldependencies.org/u/feat/Hyph.html.
Indo-Sri Lanka
EWT sent_id weblog-blogspot.com_dakbangla_20041119231111_ENG_20041119_231111-0033:
my understanding is that this should be:
This should also apply to
Anglo-Saxon
, etc.Proto-Indo-European
GENTLE sent_id GENTLE_dictionary_school-8
my understanding is that this should be:
This should also apply to
pro-Muslim
,anti-Semite
, etc. with thepro-
,anti-
, etc. modifiers being their ownAFX
tokens.The text was updated successfully, but these errors were encountered: