Automatically generate descriptions for items based on their P31 (instance of) values
Open, HighPublic
Actions

Assigned To

None

Authored By

	Mahir256
	Mar 13 2022, 5:52 AM

Description

As an editor I don't want to maintain redundant descriptions in order to reduce the amount of data to keep track of.
As the dev team we don't want to store a large amount of redundant descriptions in several hundred languages.

Problem:
We have certain classes of Items where the description is more or less the same as the instance of statement and just causes additional maintenance and scaling issues with the query service. Unfortunately, bot activity is escalating this problem.

Example:

scientific articles
chemical compounds

Proposed solution:

We have a fallback of the description to the value of P31 for all descriptions that don't exist in a language. We continue to use the manually set descriptions where available.
If a description in a fallback language exists we use that one over the P31 statement.
If multiple P31 statements exist then we want to list them all (comma separated?)
We only consider best-ranked statements for this.
Where does it show up?
- We do not want these automated descriptions to be materialized in Blazegraph. We can accomplish this by not including them in the dump flavor of the RDF produced by the Linked Data endpoint: https://www.wikidata.org/wiki/Special:EntityData/Q42.rdf?flavor=dump
- We do want it in action API, Linked Data endpoint except RDF with flavor dump, database dumps

BDD
GIVEN
AND
WHEN
AND
THEN
AND

Acceptance criteria:

Things to consider still:

If we put it into the Linked Data endpoint and the action API, then people might be inclined to use that for editing and then put the automatically generated description back as a materialized one. We don't want that and might need to introduce a flag to indicate that a description was generated automatically. (e.g. en: { language: "en", "value": "chemical compound", generated: true })
We don’t want to have the generated descriptions as triples in Blazegraph, but users might still want to have working ?itemDescription variables via the label service when there’s only a generated description. Should the label service reimplement the description generation logic?

Original report:

T91981 was closed without comment in late 2020 (was it out of staleness?) despite the only objections to the issue provided being made over the course of a few days in August 2015. At that time Blazegraph was still maintained and there were between 20 and 21 million items in Wikidata (and possibly a sense of optimism in the air regarding how descriptions on individual items would turn out). Now there are more than 97 million items, due primarily to the imports of scientific articles in particular—with astronomical objects coming later, to boot—and we routinely speak of a potential Blazegraph failure and the need to seek alternatives to that software. One way that we might forestall a Blazegraph failure without disturbing people is to reduce the amount of excess triples that actually need to be separately stored, and one such place from which triples might be taken out is the set of descriptions.

Like it or not, there are certain classes of items that simply will not get descriptions more imaginative or customized or detailed than the ones which over time have been added to them in different languages. Yet there are users whose entire existence on Wikidata, judging from their edit history, seems to be the addition and maintenance of these repetitive/unimaginative/etc. descriptions, needing to run so many batches of edits just to correct a single letter across millions of items. An automatic description generation mechanism based on language and item class (following a P‌31/P‌279+ path, possibly involving a few other selected properties), whose outputs may be adjusted in exactly one place rather than in millions of items separately, would at least free these users of their labors, and would allow us to remove the excess of triples for their corresponding non-automatic but equally repetitive/unimaginative/etc. counterparts.

Some classes of items that would dearly benefit from such a thing immediately include items for

scientific articles (33,000,000+),
Wikimedia categories (5,000,000+),
Wikimedia templates (~1,000,000),
stars (~3,000,000),
galaxies (~2,000,000),
Unicode characters (~150,000),
researchers (200,000+)

This is already near half the total number of items on Wikidata at the moment, and there are likely more item classes that are missing, and there are likely more items in the noted classes that will add to the above numbers.

Note to developers and other maintainers: It is vigorously beseeched that this task not be closed as a duplicate of the previous task, since circumstances have significantly changed over the last six and a half years.

Tasks may be obsoleted by this task: T159106: Show P31 in the Wikidata search results, T141553: [feature request] DAB1: add standard description to disambiguation items at Wikidata

Related Objects

Mentioned In: T334563: Wikidata search results differ significantly between English and British English
T64695: Draft a computer-assisted translation system for Wikidata labels/descriptions
T337021: [Analytics] Find out size of term subgraph
T302058: Add wikibase:identifiers to RDF representation of lexemes
T321180: wikibase.entityselector.search hook does not handle pagination
T313833: [EPIC] Automate Item descriptions based on statements
T312097: [EPIC] MUL - Default values for labels and aliases
T307274: Display fallback labels and descriptions as placeholder in termbox
T305924: Automatically generate sense glosses
Mentioned Here: P279 404 and 500 error pages
P106 (An Untitled Masterwork)
T141553: [feature request] DAB1: add standard description to disambiguation items at Wikidata
T159106: Show P31 in the Wikidata search results
P31 Fork of P29 (An Untitled Masterwork)
T91981: Show an auto-generated article description when a user-contributed one is unavailable.

Event Timeline

Mahir256 created this task.Mar 13 2022, 5:52 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 13 2022, 5:52 AM

Epidosis awarded a token.Mar 13 2022, 9:34 AM

Epidosis subscribed.

I'm very strongly in favour of having some form of dynamically generated descriptions. The current situation is completely absurd.

Here's most of the items Mahir listed plus some more that I could think of and the number of descriptions which are identical to the corresponding label on the item.

Item	Number of matching descriptions
chemical compound (Q11173)	22,436,766
encyclopedia article (Q13433827)	9,877,236
galaxy (Q318)	14,615,397
protein (Q8054)	1,116,867
scholarly article (Q13442814)	778,351,557
star (Q523)	943,976
Unicode character (Q29654788)	594,869
Wikimedia category (Q4167836)	495,506,461
Wikimedia disambiguation page (Q4167410)	77,473,195
Wikimedia list article (Q13406463)	17,270,013
Wikimedia template (Q11266439)	67,869,856
Wikinews article (Q17633526)	12,994,976

i.e. there are 1.5 billion descriptions which simply duplicate the labels of these 12 items.

That doesn't take into account any slight differences in spelling, capitalisation or language code, e.g. 20 variations of the "Wikimedia category" labels cover another 100 million descriptions.

(I have more thoughts on this, but I'll continue another time)

Manuel subscribed.Mar 23 2022, 5:17 PM

Manuel added a project: Wikidata-Campsite.Mar 23 2022, 5:35 PM

Manuel moved this task from Incoming to Needs Wikidata PM Work on the Wikidata-Campsite board.

Lydia_Pintscher updated the task description. (Show Details)Mar 23 2022, 5:43 PM

• Lucas_Werkmeister_WMDE updated the task description. (Show Details)Mar 23 2022, 5:51 PM

Bugreporter renamed this task from Provide auto-generated descriptions for certain classes of items to Automatic generate descriptions for items based on their P31 (instance of) values.Mar 28 2022, 8:25 PM

Bugreporter renamed this task from Automatic generate descriptions for items based on their P31 (instance of) values to Automatically generate descriptions for items based on their P31 (instance of) values.

Bugreporter updated the task description. (Show Details)

For people the P106 value may be more useful than P31.

DVrandecic subscribed.Apr 4 2022, 6:17 PM

Bugreporter mentioned this in T305924: Automatically generate sense glosses.Apr 12 2022, 11:33 AM

One thing to consider: this may degrade ElasticSearch results.

Bugreporter mentioned this in T307274: Display fallback labels and descriptions as placeholder in termbox.Apr 30 2022, 4:33 PM

Michael subscribed.Jul 5 2022, 8:26 AM

Manuel moved this task from Needs Wikidata PM Work to Epics on the Wikidata-Campsite board.Jul 12 2022, 11:37 AM

Manuel mentioned this in T312097: [EPIC] MUL - Default values for labels and aliases .Jul 12 2022, 5:04 PM

Manuel mentioned this in T313833: [EPIC] Automate Item descriptions based on statements .Jul 26 2022, 5:15 PM

Manuel moved this task from Epics to Task with Epic on the Wikidata-Campsite board.Jul 28 2022, 6:57 PM

Nikki mentioned this in T321180: wikibase.entityselector.search hook does not handle pagination.Nov 16 2022, 2:52 PM

Moebeus awarded a token.Jan 11 2023, 12:09 PM

waldyrious subscribed.Jan 11 2023, 4:54 PM

Lectrician1 subscribed.Jan 30 2023, 10:38 PM

valerio.bozzolan awarded a token.Feb 8 2023, 8:23 PM

valerio.bozzolan subscribed.

Nikki mentioned this in T302058: Add wikibase:identifiers to RDF representation of lexemes.Feb 13 2023, 5:18 AM

Fuzheado subscribed.Mar 20 2023, 2:05 PM

Manuel updated the task description. (Show Details)Jul 21 2023, 1:24 PM

Manuel mentioned this in T337021: [Analytics] Find out size of term subgraph.Jul 21 2023, 1:34 PM

AndrewTavis_WMDE subscribed.Jul 21 2023, 1:41 PM

tfmorris subscribed.Jul 21 2023, 5:11 PM

I'm surprised that this hasn't received any attention in 15 months. As an update to @Nikki 's numbers there are now on the order of 2.5 BILLION of these bot generated descriptions. The top 5 alone represent over 2 billion triples. That's a huge waste of resources!

Q#	Entity Type	Descriptions (Billions)
Q13442814	scholarly article	1.32
Q4167836	Wikimedia category	0.60
Q4167410	Wikimedia disambiguation page	0.11
Q11266439	Wikimedia template	0.09
Q101352	family name	0.06

In addition to the usability and resource issues, there's also a substantial language equity issue associated with the lack of this functionality. The language with the largest number of descriptions is Dutch simply because there's a Dutch speaking bot operator who has vigorously added many, many machine generated descriptions. On the flip side, languages without the privilege of bot operators supporting them go wanting and have no way to disambiguate the terms that autocomplete / search offers them. Of course, if someone were to start adding machine generated descriptions for all those hundreds of languages, the situation would be completely untenable from a Blazegraph point of view.

As an alternative to a textual description, I'll offer the suggestion to consider building an autocomplete widget which looks more like this:

Screen Shot 2023-07-21 at 2.22.16 PM.png (335×710 px, 70 KB)

That's how Freebase Suggest did it back in 2008. Heck, you could even steal the code. One non-obvious aspect of their implementation was that they used metaschema annotations of types as being "Notable" or interesting enough to show the user. Similarly the properties which were displayed varied by entity type and were controlled by metaschema notations, so you might have birth date and place for a person, but containing/parent entity for something like a town or species. Of course, even just a simple list of the P31's would be better than the current situation.

In T303677#9035100, @tfmorris wrote:

I'm surprised that this hasn't received any attention in 15 months. As an update to @Nikki 's numbers there are now on the order of 2.5 BILLION of these bot generated descriptions. The top 5 alone represent over 2 billion triples. That's a huge waste of resources!

Q# Entity Type Descriptions (Billions)

Q13442814 scholarly article 1.32

Q4167836 Wikimedia category 0.60

Q4167410 Wikimedia disambiguation page 0.11

Q11266439 Wikimedia template 0.09

Q101352 family name 0.06

In addition to the usability and resource issues, there's also a substantial language equity issue associated with the lack of this functionality. The language with the largest number of descriptions is Dutch simply because there's a Dutch speaking bot operator who has vigorously added many, many machine generated descriptions. On the flip side, languages without the privilege of bot operators supporting them go wanting and have no way to disambiguate the terms that autocomplete / search offers them. Of course, if someone were to start adding machine generated descriptions for all those hundreds of languages, the situation would be completely untenable from a Blazegraph point of view.

As an alternative to a textual description, I'll offer the suggestion to consider building an autocomplete widget which looks more like this:
That's how Freebase Suggest did it back in 2008. Heck, you could even steal the code. One non-obvious aspect of their implementation was that they used metaschema annotations of types as being "Notable" or interesting enough to show the user. Similarly the properties which were displayed varied by entity type and were controlled by metaschema notations, so you might have birth date and place for a person, but containing/parent entity for something like a town or species. Of course, even just a simple list of the P31's would be better than the current situation.

See also https://autodesc.toolforge.org/, which is already used in various tools (e.g. Mix'n'Match). Previous discussion (dated backed to 2012): https://www.wikidata.org/wiki/Wikidata:Automating_descriptions

Manuel triaged this task as High priority.Jul 24 2023, 8:35 AM

dr0ptp4kt subscribed.Jul 26 2023, 3:16 PM

dr0ptp4kt unsubscribed.Jul 28 2023, 4:00 PM

In T303677#9035100, @tfmorris wrote:

I'm surprised that this hasn't received any attention in 15 months. As an update to @Nikki 's numbers there are now on the order of 2.5 BILLION of these bot generated descriptions. The top 5 alone represent over 2 billion triples. That's a huge waste of resources!

What exactly are you counting? (You don't seem to be counting the same thing as me, so they can't be directly compared)

I tried redoing my queries (and saved the URLs this time...):

Item	Matching descriptions (March 2022)	Matching descriptions (August 2023)
chemical compound (Q11173)	22,436,766	38,777,020	QLever
encyclopedia article (Q13433827)	9,877,236	10,056,470	QLever
galaxy (Q318)	14,615,397	16,149,120	QLever
protein (Q8054)	1,116,867	1,155,777	QLever
scholarly article (Q13442814)	778,351,557	813,567,636	query
star (Q523)	943,976	1,179,311	QLever
Unicode character (Q29654788)	594,869	1,264,561	QLever
Wikimedia category (Q4167836)	495,506,461	471,340,460	QLever
Wikimedia disambiguation page (Q4167410)	77,473,195	78,644,158	QLever
Wikimedia list article (Q13406463)	17,270,013	17,383,921	QLever
Wikimedia template (Q11266439)	67,869,856	66,668,772	QLever
Wikinews article (Q17633526)	12,994,976	12,854,826	QLever
family name (Q101352)		48,959,524	QLever
given name (Q202444)		207,184	QLever
female given name (Q11879590)		1,634,583	QLever
male given name (Q12308941)		2,793,746	QLever
unisex given name (Q3409032)		58,843	QLever

Even QLever can't count the scholarly article descriptions, so I had to write a query to generate a query that counts each label separately.

The number of descriptions matching the labels of the original 12 items went up by 30 million, which still rounds to 1.5 billion. Categories went down 24 million (lots of merges?), but chemical compound went up 16 million and scholarly article went up 35 million.

@Denny, @Jdforrester-WMF and I discussed this and the overlap with abstract descriptions at Wikimania. Here is what we came up with:
We change Wikibase to generate an automated description. Initially this just takes the first best-ranked instance-of value. Once Wikifunctions and Abstract Wikipedia are ready we can swap out this simple logic for something more complex. This avoids the complexity increase I feared and gives us a sensible way forward now I think.

• dcausse subscribed.Sep 7 2023, 3:42 PM

Gehel subscribed.Sep 8 2023, 7:51 AM

This is somewhat related as autogeneration could also help human editors... but

After estimating how much time editors spend on adding edit summaries to Wikipedia edits, I figured we could do the same for Wikidata editors manually adding item descriptions.

Here's the Quarry query I used to find this: https://quarry.wmcloud.org/query/76538

The query only counts non-bot description addition revisions based on the recentchanges table (so edits in past 30 days) that are ones performed through the editor (mw.edit) and not a tool, are not revisions that are reverted or restored (since those have autogenerated summaries), are not quickstatements edits, and the character count does not include the autogenerated section names of the edit summary (names between /* */) so it only includes the description added.

Average number of human-typed descriptions added per day on wikidata: 5,349
Average number of typed characters (non auto-generated) per description: 43.8792

Assuming descriptions are typed at an average typing speed of 200 characters per minute or 50 wpm, these results calculate to:

Note: This is across all description languages added. Some languages may have different characters per minute typing speeds.

234,660 description characters typed per day
1,173 minutes per day typed
19.5 hours per day spent typing descriptions
7,137 hours per year spent typing descriptions

Note that some of these descriptions counted could be automatically generated too from user tools, but I'm not sure what those tools might be.

VIGNERON subscribed.Sep 21 2023, 6:51 PM

dr0ptp4kt subscribed.Sep 22 2023, 4:37 PM

If these automated descriptions weren't in WDQS (and consumed triples), how could the label service fetch them and bring it into results? Generating descriptions on the fly couldn't work well with queries with too many results. Other queries using schema:description couldn't work at all.

In T303677#9551210, @Midleading wrote:

If these automated descriptions weren't in WDQS (and consumed triples), how could the label service fetch them and bring it into results? Generating descriptions on the fly couldn't work well with queries with too many results. Other queries using schema:description couldn't work at all.

The generated descriptions we're talking about here would come from the labels of P31 statements on the items, which can be selected using wdt:P31/rdfs:label.

See this query for example: https://w.wiki/9CUR. That's 25 semi-random items, their current description, the label of their P31 statement, and a description created by using the current description if it exists, or the P31 label if not. I assume the label service would do something similar to that.

Lydia_Pintscher mentioned this in T64695: Draft a computer-assisted translation system for Wikidata labels/descriptions.Apr 2 2024, 9:40 AM

Agabi10 subscribed.Apr 8 2024, 7:45 AM

EgonWillighagen awarded a token.May 28 2024, 8:27 PM

EgonWillighagen subscribed.

P31 of chemical compound is "type of chemical entity" Q113145171 instead of chemical compound now. So descriptions of chemical compounds still needs to be copied from P279 rather than P31 by bots.

Wostr subscribed.Jul 26 2024, 4:40 PM

In T303677#9891997, @Midleading wrote:

P31 of chemical compound is "type of chemical entity" Q113145171 instead of chemical compound now. So descriptions of chemical compounds still needs to be copied from P279 rather than P31 by bots.

This does not seem to be a problem. Whether the description would be "chemical compound" or "type of chemical entity", both descriptions would serve the purpose of descriptions well enough. EDIT: and given the current guidelines, "chemical compound" in P279 should not appear at all.

Dexxor subscribed.Sun, Jul 28, 7:00 AM

Adam78 subscribed.Sun, Aug 4, 5:30 PM

Alhadis mentioned this in T334563: Wikidata search results differ significantly between English and British English.Fri, Aug 9, 7:55 PM

	F37145761: Screen Shot 2023-07-21 at 2.22.16 PM.png
	Jul 21 2023, 6:44 PM

Automatically generate descriptions for items based on their P31 (instance of) valuesOpen, HighPublicActions

Description

Related Objects

Event Timeline

Automatically generate descriptions for items based on their P31 (instance of) values
Open, HighPublic
Actions