Tokenization of hyphenated forms in English #1002

rhdunn · 2023-12-03T17:15:33Z

Looking at hyphenated compounds, there are several ways that English treebanks annotate these, sometimes inconsistently within the same treebank and across treebanks.

I'm basing this on https://universaldependencies.org/u/feat/Hyph.html.

Indo-Sri Lanka

EWT sent_id weblog-blogspot.com_dakbangla_20041119231111_ENG_20041119_231111-0033:

12	Indo	Indo	X	AFX	_	15	compound	15:compound	SpaceAfter=No
13	-	-	PUNCT	HYPH	_	12	punct	12:punct	SpaceAfter=No
14	Sri	Sri	PROPN	NNP	Number=Sing	15	compound	15:compound	_
15	Lanka	Lanka	PROPN	NNP	Number=Sing	17	compound	17:compound	_

my understanding is that this should be:

12	Indo-	Indo-	X	AFX	Hyph=Yes	14	compound	15:compound	SpaceAfter=No
13	Sri	Sri	PROPN	NNP	Number=Sing	14	compound	15:compound	_
14	Lanka	Lanka	PROPN	NNP	Number=Sing	16	compound	16:compound	_

This should also apply to Anglo-Saxon, etc.

Proto-Indo-European

GENTLE sent_id GENTLE_dictionary_school-8

65	Proto-Indo-European	Proto-Indo-European	PROPN	NNP	Number=Sing	66	compound	66:compound	Entity=(33-abstract-new-cf19-2-sgl(34-abstract-new-cf23-1-coref-Proto%2DIndo%2DEuropean_language)|XML=<ref target:::"https://en.wikipedia.org/wiki/Proto-Indo-European_language"></ref>

my understanding is that this should be:

65	Proto-	proto-	X	AFX	Hyph=Yes	67	compound	66:compound	SpaceAfter=No
66	Indo-	Indo-	X	AFX	Hyph=Yes	67	compound	66:compound	SpaceAfter=No
67	European	European	PROPN	NNP	Number=Sing	67	compound	66:compound	_

This should also apply to pro-Muslim, anti-Semite, etc. with the pro-, anti-, etc. modifiers being their own AFX tokens.

The text was updated successfully, but these errors were encountered:

dan-zeman · 2023-12-03T20:22:57Z

EWT sent_id weblog-blogspot.com_dakbangla_20041119231111_ENG_20041119_231111-0033:

12	Indo	Indo	X	AFX	_	15	compound	15:compound	SpaceAfter=No
13	-	-	PUNCT	HYPH	_	12	punct	12:punct	SpaceAfter=No
14	Sri	Sri	PROPN	NNP	Number=Sing	15	compound	15:compound	_
15	Lanka	Lanka	PROPN	NNP	Number=Sing	17	compound	17:compound	_

my understanding is that this should be:

12	Indo-	Indo-	X	AFX	Hyph=Yes	14	compound	15:compound	SpaceAfter=No
13	Sri	Sri	PROPN	NNP	Number=Sing	14	compound	15:compound	_
14	Lanka	Lanka	PROPN	NNP	Number=Sing	16	compound	16:compound	_

Hyph=Yes is indeed meant for the first part of such compounds in case they are separate tokens and their form is different from independent word. But it does not specify what should be done with tokenization, that is, whether the hyphen shall be part of the form or a separate token. We use Hyph=Yes in Czech but we don't include the hyphen in the token that contains the prefix and that gets the feature.

amir-zeldes · 2023-12-04T20:51:13Z

AFAIK, the actual convention for AFX in LDC corpora is not like in EWT - in OntoNotes, it is only used for the same situations that Dan is referring to, where the affix 'word' is a separate token due to spacing, e.g.:

pro-/AFX and anti-abortionists/NNS (wsj_0290)

As the second noun demonstrates, the standard has been to not separate prefixes like anti- when they are spelled together, and GENTLE (and the other GU corpora) follows this standard.

nschneid · 2023-12-04T22:16:51Z

Keeping the hyphen within the AFX token makes logical sense to me. I checked the EWT source trees from LDC and they do have the separated HYPH tokens, so either they changed their standard or didn't apply it consistently. There are very few AFX tokens with hyphens in EWT—I only see about 5.

dan-zeman added English tokenization standard needed labels Dec 3, 2023

dan-zeman added this to the v2.14 milestone Dec 3, 2023

rhdunn mentioned this issue Dec 7, 2023

Incorrect lemma capitalization UniversalDependencies/UD_English-PUD#37

Open

dan-zeman modified the milestones: v2.14, v2.15 May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization of hyphenated forms in English #1002

Tokenization of hyphenated forms in English #1002

Tokenization of hyphenated forms in English #1002

Tokenization of hyphenated forms in English #1002

Comments

Indo-Sri Lanka

Proto-Indo-European