[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization of hyphenated forms in English #1002

Open
rhdunn opened this issue Dec 3, 2023 · 3 comments
Open

Tokenization of hyphenated forms in English #1002

rhdunn opened this issue Dec 3, 2023 · 3 comments

Comments

@rhdunn
Copy link
rhdunn commented Dec 3, 2023

Looking at hyphenated compounds, there are several ways that English treebanks annotate these, sometimes inconsistently within the same treebank and across treebanks.

I'm basing this on https://universaldependencies.org/u/feat/Hyph.html.

Indo-Sri Lanka

EWT sent_id weblog-blogspot.com_dakbangla_20041119231111_ENG_20041119_231111-0033:

12	Indo	Indo	X	AFX	_	15	compound	15:compound	SpaceAfter=No
13	-	-	PUNCT	HYPH	_	12	punct	12:punct	SpaceAfter=No
14	Sri	Sri	PROPN	NNP	Number=Sing	15	compound	15:compound	_
15	Lanka	Lanka	PROPN	NNP	Number=Sing	17	compound	17:compound	_

my understanding is that this should be:

12	Indo-	Indo-	X	AFX	Hyph=Yes	14	compound	15:compound	SpaceAfter=No
13	Sri	Sri	PROPN	NNP	Number=Sing	14	compound	15:compound	_
14	Lanka	Lanka	PROPN	NNP	Number=Sing	16	compound	16:compound	_

This should also apply to Anglo-Saxon, etc.

Proto-Indo-European

GENTLE sent_id GENTLE_dictionary_school-8

65	Proto-Indo-European	Proto-Indo-European	PROPN	NNP	Number=Sing	66	compound	66:compound	Entity=(33-abstract-new-cf19-2-sgl(34-abstract-new-cf23-1-coref-Proto%2DIndo%2DEuropean_language)|XML=<ref target:::"https://en.wikipedia.org/wiki/Proto-Indo-European_language"></ref>

my understanding is that this should be:

65	Proto-	proto-	X	AFX	Hyph=Yes	67	compound	66:compound	SpaceAfter=No
66	Indo-	Indo-	X	AFX	Hyph=Yes	67	compound	66:compound	SpaceAfter=No
67	European	European	PROPN	NNP	Number=Sing	67	compound	66:compound	_

This should also apply to pro-Muslim, anti-Semite, etc. with the pro-, anti-, etc. modifiers being their own AFX tokens.

@dan-zeman
Copy link
Member

EWT sent_id weblog-blogspot.com_dakbangla_20041119231111_ENG_20041119_231111-0033:

12	Indo	Indo	X	AFX	_	15	compound	15:compound	SpaceAfter=No
13	-	-	PUNCT	HYPH	_	12	punct	12:punct	SpaceAfter=No
14	Sri	Sri	PROPN	NNP	Number=Sing	15	compound	15:compound	_
15	Lanka	Lanka	PROPN	NNP	Number=Sing	17	compound	17:compound	_

my understanding is that this should be:

12	Indo-	Indo-	X	AFX	Hyph=Yes	14	compound	15:compound	SpaceAfter=No
13	Sri	Sri	PROPN	NNP	Number=Sing	14	compound	15:compound	_
14	Lanka	Lanka	PROPN	NNP	Number=Sing	16	compound	16:compound	_

Hyph=Yes is indeed meant for the first part of such compounds in case they are separate tokens and their form is different from independent word. But it does not specify what should be done with tokenization, that is, whether the hyphen shall be part of the form or a separate token. We use Hyph=Yes in Czech but we don't include the hyphen in the token that contains the prefix and that gets the feature.

@amir-zeldes
Copy link
Contributor

AFAIK, the actual convention for AFX in LDC corpora is not like in EWT - in OntoNotes, it is only used for the same situations that Dan is referring to, where the affix 'word' is a separate token due to spacing, e.g.:

  • pro-/AFX and anti-abortionists/NNS (wsj_0290)

As the second noun demonstrates, the standard has been to not separate prefixes like anti- when they are spelled together, and GENTLE (and the other GU corpora) follows this standard.

@nschneid
Copy link
Contributor
nschneid commented Dec 4, 2023

Keeping the hyphen within the AFX token makes logical sense to me. I checked the EWT source trees from LDC and they do have the separated HYPH tokens, so either they changed their standard or didn't apply it consistently. There are very few AFX tokens with hyphens in EWT—I only see about 5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants