[go: nahoru, domu]

Page MenuHomePhabricator

Automatically generate descriptions for items based on their P31 (instance of) values
Open, HighPublic

Assigned To
None
Authored By
Mahir256
Mar 13 2022, 5:52 AM
Referenced Files
F37145761: Screen Shot 2023-07-21 at 2.22.16 PM.png
Jul 21 2023, 6:44 PM
Tokens
"100" token, awarded by EgonWillighagen."Stroopwafel" token, awarded by valerio.bozzolan."Like" token, awarded by Moebeus."Like" token, awarded by Epidosis.

Description

As an editor I don't want to maintain redundant descriptions in order to reduce the amount of data to keep track of.
As the dev team we don't want to store a large amount of redundant descriptions in several hundred languages.

Problem:
We have certain classes of Items where the description is more or less the same as the instance of statement and just causes additional maintenance and scaling issues with the query service. Unfortunately, bot activity is escalating this problem.

Example:

  • scientific articles
  • chemical compounds

Proposed solution:

  • We have a fallback of the description to the value of P31 for all descriptions that don't exist in a language. We continue to use the manually set descriptions where available.
  • If a description in a fallback language exists we use that one over the P31 statement.
  • If multiple P31 statements exist then we want to list them all (comma separated?)
  • We only consider best-ranked statements for this.
  • Where does it show up?
    • We do not want these automated descriptions to be materialized in Blazegraph. We can accomplish this by not including them in the dump flavor of the RDF produced by the Linked Data endpoint: https://www.wikidata.org/wiki/Special:EntityData/Q42.rdf?flavor=dump
    • We do want it in action API, Linked Data endpoint except RDF with flavor dump, database dumps

BDD
GIVEN
AND
WHEN
AND
THEN
AND

Acceptance criteria:

Things to consider still:

  • If we put it into the Linked Data endpoint and the action API, then people might be inclined to use that for editing and then put the automatically generated description back as a materialized one. We don't want that and might need to introduce a flag to indicate that a description was generated automatically. (e.g. en: { language: "en", "value": "chemical compound", generated: true })
  • We don’t want to have the generated descriptions as triples in Blazegraph, but users might still want to have working ?itemDescription variables via the label service when there’s only a generated description. Should the label service reimplement the description generation logic?

Original report:

T91981 was closed without comment in late 2020 (was it out of staleness?) despite the only objections to the issue provided being made over the course of a few days in August 2015. At that time Blazegraph was still maintained and there were between 20 and 21 million items in Wikidata (and possibly a sense of optimism in the air regarding how descriptions on individual items would turn out). Now there are more than 97 million items, due primarily to the imports of scientific articles in particular—with astronomical objects coming later, to boot—and we routinely speak of a potential Blazegraph failure and the need to seek alternatives to that software. One way that we might forestall a Blazegraph failure without disturbing people is to reduce the amount of excess triples that actually need to be separately stored, and one such place from which triples might be taken out is the set of descriptions.

Like it or not, there are certain classes of items that simply will not get descriptions more imaginative or customized or detailed than the ones which over time have been added to them in different languages. Yet there are users whose entire existence on Wikidata, judging from their edit history, seems to be the addition and maintenance of these repetitive/unimaginative/etc. descriptions, needing to run so many batches of edits just to correct a single letter across millions of items. An automatic description generation mechanism based on language and item class (following a P‌31/P‌279+ path, possibly involving a few other selected properties), whose outputs may be adjusted in exactly one place rather than in millions of items separately, would at least free these users of their labors, and would allow us to remove the excess of triples for their corresponding non-automatic but equally repetitive/unimaginative/etc. counterparts.

Some classes of items that would dearly benefit from such a thing immediately include items for

  1. scientific articles (33,000,000+),
  2. Wikimedia categories (5,000,000+),
  3. Wikimedia templates (~1,000,000),
  4. stars (~3,000,000),
  5. galaxies (~2,000,000),
  6. Unicode characters (~150,000),
  7. researchers (200,000+)

This is already near half the total number of items on Wikidata at the moment, and there are likely more item classes that are missing, and there are likely more items in the noted classes that will add to the above numbers.

Note to developers and other maintainers: It is vigorously beseeched that this task not be closed as a duplicate of the previous task, since circumstances have significantly changed over the last six and a half years.

Tasks may be obsoleted by this task: T159106: Show P31 in the Wikidata search results, T141553: [feature request] DAB1: add standard description to disambiguation items at Wikidata

Event Timeline

I'm very strongly in favour of having some form of dynamically generated descriptions. The current situation is completely absurd.

Here's most of the items Mahir listed plus some more that I could think of and the number of descriptions which are identical to the corresponding label on the item.

i.e. there are 1.5 billion descriptions which simply duplicate the labels of these 12 items.

That doesn't take into account any slight differences in spelling, capitalisation or language code, e.g. 20 variations of the "Wikimedia category" labels cover another 100 million descriptions.

(I have more thoughts on this, but I'll continue another time)

Bugreporter renamed this task from Provide auto-generated descriptions for certain classes of items to Automatic generate descriptions for items based on their P31 (instance of) values.Mar 28 2022, 8:25 PM
Bugreporter renamed this task from Automatic generate descriptions for items based on their P31 (instance of) values to Automatically generate descriptions for items based on their P31 (instance of) values.
Bugreporter updated the task description. (Show Details)

For people the P106 value may be more useful than P31.

One thing to consider: this may degrade ElasticSearch results.

I'm surprised that this hasn't received any attention in 15 months. As an update to @Nikki 's numbers there are now on the order of 2.5 BILLION of these bot generated descriptions. The top 5 alone represent over 2 billion triples. That's a huge waste of resources!

Q#Entity TypeDescriptions (Billions)
Q13442814scholarly article1.32
Q4167836Wikimedia category0.60
Q4167410Wikimedia disambiguation page0.11
Q11266439Wikimedia template0.09
Q101352family name0.06

In addition to the usability and resource issues, there's also a substantial language equity issue associated with the lack of this functionality. The language with the largest number of descriptions is Dutch simply because there's a Dutch speaking bot operator who has vigorously added many, many machine generated descriptions. On the flip side, languages without the privilege of bot operators supporting them go wanting and have no way to disambiguate the terms that autocomplete / search offers them. Of course, if someone were to start adding machine generated descriptions for all those hundreds of languages, the situation would be completely untenable from a Blazegraph point of view.

As an alternative to a textual description, I'll offer the suggestion to consider building an autocomplete widget which looks more like this:

Screen Shot 2023-07-21 at 2.22.16 PM.png (335×710 px, 70 KB)
That's how Freebase Suggest did it back in 2008. Heck, you could even steal the code. One non-obvious aspect of their implementation was that they used metaschema annotations of types as being "Notable" or interesting enough to show the user. Similarly the properties which were displayed varied by entity type and were controlled by metaschema notations, so you might have birth date and place for a person, but containing/parent entity for something like a town or species. Of course, even just a simple list of the P31's would be better than the current situation.

I'm surprised that this hasn't received any attention in 15 months. As an update to @Nikki 's numbers there are now on the order of 2.5 BILLION of these bot generated descriptions. The top 5 alone represent over 2 billion triples. That's a huge waste of resources!

Q#Entity TypeDescriptions (Billions)
Q13442814scholarly article1.32
Q4167836Wikimedia category0.60
Q4167410Wikimedia disambiguation page0.11
Q11266439Wikimedia template0.09
Q101352family name0.06

In addition to the usability and resource issues, there's also a substantial language equity issue associated with the lack of this functionality. The language with the largest number of descriptions is Dutch simply because there's a Dutch speaking bot operator who has vigorously added many, many machine generated descriptions. On the flip side, languages without the privilege of bot operators supporting them go wanting and have no way to disambiguate the terms that autocomplete / search offers them. Of course, if someone were to start adding machine generated descriptions for all those hundreds of languages, the situation would be completely untenable from a Blazegraph point of view.

As an alternative to a textual description, I'll offer the suggestion to consider building an autocomplete widget which looks more like this:

Screen Shot 2023-07-21 at 2.22.16 PM.png (335×710 px, 70 KB)
That's how Freebase Suggest did it back in 2008. Heck, you could even steal the code. One non-obvious aspect of their implementation was that they used metaschema annotations of types as being "Notable" or interesting enough to show the user. Similarly the properties which were displayed varied by entity type and were controlled by metaschema notations, so you might have birth date and place for a person, but containing/parent entity for something like a town or species. Of course, even just a simple list of the P31's would be better than the current situation.

See also https://autodesc.toolforge.org/, which is already used in various tools (e.g. Mix'n'Match). Previous discussion (dated backed to 2012): https://www.wikidata.org/wiki/Wikidata:Automating_descriptions

Manuel triaged this task as High priority.Jul 24 2023, 8:35 AM

I'm surprised that this hasn't received any attention in 15 months. As an update to @Nikki 's numbers there are now on the order of 2.5 BILLION of these bot generated descriptions. The top 5 alone represent over 2 billion triples. That's a huge waste of resources!

What exactly are you counting? (You don't seem to be counting the same thing as me, so they can't be directly compared)

I tried redoing my queries (and saved the URLs this time...):

ItemMatching descriptions (March 2022)Matching descriptions (August 2023)
chemical compound (Q11173)22,436,76638,777,020QLever
encyclopedia article (Q13433827)9,877,23610,056,470QLever
galaxy (Q318)14,615,39716,149,120QLever
protein (Q8054)1,116,8671,155,777QLever
scholarly article (Q13442814)778,351,557813,567,636query
star (Q523)943,9761,179,311QLever
Unicode character (Q29654788)594,8691,264,561QLever
Wikimedia category (Q4167836)495,506,461471,340,460QLever
Wikimedia disambiguation page (Q4167410)77,473,19578,644,158QLever
Wikimedia list article (Q13406463)17,270,01317,383,921QLever
Wikimedia template (Q11266439)67,869,85666,668,772QLever
Wikinews article (Q17633526)12,994,97612,854,826QLever
family name (Q101352)48,959,524QLever
given name (Q202444)207,184QLever
female given name (Q11879590)1,634,583QLever
male given name (Q12308941)2,793,746QLever
unisex given name (Q3409032)58,843QLever

Even QLever can't count the scholarly article descriptions, so I had to write a query to generate a query that counts each label separately.

The number of descriptions matching the labels of the original 12 items went up by 30 million, which still rounds to 1.5 billion. Categories went down 24 million (lots of merges?), but chemical compound went up 16 million and scholarly article went up 35 million.

@Denny, @Jdforrester-WMF and I discussed this and the overlap with abstract descriptions at Wikimania. Here is what we came up with:
We change Wikibase to generate an automated description. Initially this just takes the first best-ranked instance-of value. Once Wikifunctions and Abstract Wikipedia are ready we can swap out this simple logic for something more complex. This avoids the complexity increase I feared and gives us a sensible way forward now I think.

This is somewhat related as autogeneration could also help human editors... but

After estimating how much time editors spend on adding edit summaries to Wikipedia edits, I figured we could do the same for Wikidata editors manually adding item descriptions.

Here's the Quarry query I used to find this: https://quarry.wmcloud.org/query/76538

The query only counts non-bot description addition revisions based on the recentchanges table (so edits in past 30 days) that are ones performed through the editor (mw.edit) and not a tool, are not revisions that are reverted or restored (since those have autogenerated summaries), are not quickstatements edits, and the character count does not include the autogenerated section names of the edit summary (names between /* */) so it only includes the description added.

Average number of human-typed descriptions added per day on wikidata: 5,349
Average number of typed characters (non auto-generated) per description: 43.8792

Assuming descriptions are typed at an average typing speed of 200 characters per minute or 50 wpm, these results calculate to:

Note: This is across all description languages added. Some languages may have different characters per minute typing speeds.

234,660 description characters typed per day
1,173 minutes per day typed
19.5 hours per day spent typing descriptions
7,137 hours per year spent typing descriptions

Note that some of these descriptions counted could be automatically generated too from user tools, but I'm not sure what those tools might be.

If these automated descriptions weren't in WDQS (and consumed triples), how could the label service fetch them and bring it into results? Generating descriptions on the fly couldn't work well with queries with too many results. Other queries using schema:description couldn't work at all.

If these automated descriptions weren't in WDQS (and consumed triples), how could the label service fetch them and bring it into results? Generating descriptions on the fly couldn't work well with queries with too many results. Other queries using schema:description couldn't work at all.

The generated descriptions we're talking about here would come from the labels of P31 statements on the items, which can be selected using wdt:P31/rdfs:label.

See this query for example: https://w.wiki/9CUR. That's 25 semi-random items, their current description, the label of their P31 statement, and a description created by using the current description if it exists, or the P31 label if not. I assume the label service would do something similar to that.

P31 of chemical compound is "type of chemical entity" Q113145171 instead of chemical compound now. So descriptions of chemical compounds still needs to be copied from P279 rather than P31 by bots.

P31 of chemical compound is "type of chemical entity" Q113145171 instead of chemical compound now. So descriptions of chemical compounds still needs to be copied from P279 rather than P31 by bots.

This does not seem to be a problem. Whether the description would be "chemical compound" or "type of chemical entity", both descriptions would serve the purpose of descriptions well enough. EDIT: and given the current guidelines, "chemical compound" in P279 should not appear at all.