[go: nahoru, domu]

Wikidata:WikiProject Limits of Wikidata

This WikiProject aims to catalogue the current limits of Wikidata and to extrapolate their development until about 2030.
The formula depicted here describes the resolution limit of the light microscope. After it had served science for about a century, it was set in stone for a monument. The image was taken yet more years later, but two months after that, the 2014 Nobel Prize in Chemistry was announced to be awarded for overcoming this limit using fluorescent molecules and lasers.
Which of the limits of Wikidata are set in stone, and which ones should we strive to overcome?

About

edit

This Wikiproject aims at bringing together various strands of conversations that touch upon the respective limits of Wikidata, both in technical and social terms. The aim is not to duplicate existing documentation but to collect pointers to places where the respective limits for a given section are being described or discussed. For background, see here.

Timeframe

edit

While fundamental limits exist in nature, the technical and social limits we are discussing here are likely to shift over time, so any discussion of such limits will have to come with some indication of an applicable timeframe. Since the Wikimedia community has used the year 2030 as a reference point for its Movement Strategy, we will use this here as a default as well for projections into the future and contrast this with current values (which may be available via Wikidata's Grafana dashboards). If other timeframes make more sense in specific contexts, please indicate that.

Design limits

edit

"Design limits" are the limits which exist by intentional design of the infrastructure of our systems. As design choices, they have benefits and drawbacks. Such infrastructure limits are not necessarily problems to address and may instead be environmental conditions for using the Wikidata platform.

Software

edit

Knowledge graphs in general

edit

MediaWiki

edit
maxlag
edit

mw:Manual:Maxlag parameter, as explained here. API requests specifying the maxlag parameter will see the wiki as read-only if any replica server is lagged by more than the parameter. (It’s customary for bots to specify maxlag=5 [seconds].)

Query service lag is factored into maxlag according to a scaling factor (wgWikidataOrgQueryServiceMaxLagFactor): in production the factor is 60 (as of 2024-08-08), i.e. a WDQS backend server lagged by five minutes counts like a database replica lagged by five seconds.

If a majority of database replicas are lagged by more than three seconds ($wgAPIMaxLagThreshold in production, as of 2024-08-08), all API requests will see the wiki as read-only (including all human edits to entities; Wikitext pages will remain editable). If lag on a majority of replicas exceeds six seconds ('max lag' in the production database configuration, as of 2024-08-08), the wiki becomes fully read-only until replication catches up again.

Page size
edit

The maximum page size is controlled by $wgMaxArticleSize and maxSerializedEntitySize. It’s not clear which of them apply to entities. (They are set to 2 MiB and 3 MiB in production, respectively.)

Special:LongPages suggests the effective maximum page size is a bit above 4 MiB; the historical maximum page size can be seen on Grafana (though data before October 2021 has been lost).

Page load performance
edit
 
Reduced loading times cut

mw:Wikimedia Performance Team/Page load performance, as explained here.

Wikibase

edit
Generic Wikibase Repository
edit
  • By design, the repository stores statements, that *could* be true. There is no score yet, that describes the validity or "common sense agreement" of that statement.
Data types
edit
  • Item
  • Monolingual string
  • Single value store, but no time-series for KPIs
Data formats
edit
  • JSON
  • RDF
  • etc.
Generic Wikibase Client
edit
Wikidata's Wikibase Repository
edit
Wikidata's Wikibase Client
edit
Wikibase Repositories other than Wikidata
edit
Wikibase Clients other than Wikidata
edit
Wikidata bridge
edit
Wikimedia wikis
edit
Non-Wikimedia wikis
edit

Wikidata Query Service

edit

See also Future-proof WDQS.

Triple store
edit
Blazegraph
edit

As of 2024-08-08, known major issues with Blazegraph include:

  • Re-importing the Wikidata data from scratch (e.g. to a new server, or if the data file got corrupted) is extremely slow (on the order of weeks).
  • Sometimes there are issues with Blazegraph allocators that require a restart of a server (and subsequently catching up with missed updates).
Virtuoso
edit
JANUS
edit
Apache Rya
edit

Apache Rya (Q28915769), source code (no activity since dev 2020), manual

Oxigraph
edit
Frontend
edit
Timeout limit
edit

Queries to the Wikidata Query Service time out after a certain time, which is a parameter that can be set.

There are multiple related timeouts, e.g. a queryTimeout behind Blazegraph's SPARQL LOAD command or a timeout parameter for the WDQS GUI build job.

JavaScript

edit

The default UI is heavy on JavaScript, and so are many customizations. This creates problems with pages that have lots of statements in that they load more slowly or freeze the browser.

Python

edit

SPARQL

edit

Hardware

edit
"Firstly we need a machine to hold the data and do the needed processing. This blog post will use a “n1-highmem-16” (16 vCPUs, 104 GB memory) virtual machine on the Google Cloud Platform with 3 local SSDs held together with RAID 0."
"This should provide us with enough fast storage to store the raw TTL data, munged TTL files (where extra triples are added) as well as the journal (JNL) file that the blazegraph query service uses to store its data.
"This entire guide will work on any instance size with more than ~4GB memory and adequate disk space of any speed."

Functional limits

edit

A "functional limit" exists when the system design encourages an activity, but somehow engaging in the activity at a large scale exceeds the system's ability to permit that activity. For example, by design Wikidata encourages users to share data and make queries, but it cannot accommodate users doing a mass import of huge amounts of data or billions of quick queries.

A March 2019 report considered the extent to which various functions on Wikidata can scale with increased use - wikitech:WMDE/Wikidata/Scaling.

Wikidata editing

edit

Edits by nature of account

edit
Edits by human users
edit
Manual edits
edit
  • ...
Tool-assisted edits
edit
  • ...
Edits by bots
edit
  • ...

Edits by nature of edit

edit
Page creations
edit
Page modifications
edit
Page merges
edit
Reverts
edit
Page deletions
edit

Edits by size

edit

Edits by frequency

edit

WDQS querying

edit

A clear example where we encounter problems, is SPARQL queries against the WDQS where things of some type (P31) are asked for, involving large number of hits. For example, querying all scholarly article titles. Queries that involve fewer items of that type do not typically give these issues.

Query timeout

edit

This is a design limit discussed under #Timeout limit above. It manifests itself as an error when the query takes more time to run than the timeout limit allows for.

Queries by usage

edit
One-off or rarely used queries
edit
Showcase queries
edit
Maintenance queries
edit
Constraint checks
edit

Queries by user type

edit
Manually run queries
edit
Queries run through tools
edit
Queries run by bots
edit

Queries by visualization

edit
  • Table
  • Map
  • Bubble chart
  • Graph
  • etc.

Multiple simultaneous queries

edit

Wikidata dumps

edit

Creating dumps

edit

Using dumps

edit
Ingesting dumps
edit
Ingesting dumps into a Wikibase instance
edit
Ingesting dumps into the Wikidata Toolkit
edit

Updating Triple Store Content

edit

Creating large numbers of new items itself does not seem to cause problems (except the aforementioned WDQS querying issue). However, there frequently is a lag between updating the wiki pages of Wikidata and the updates being propagated to the Wikidata Query Service servers.

Edits to large items

edit
Performance issues
edit

One bottleneck is the editing of existing Wikidata items with a lot of properties. The underlying issue here is that, for each edit, RDF for the full item is created and that the WDQS needs to update that full RDF. Therefore, independent of the size of the edit, edits on large items stress the system more than edits on small items. There is a Phabricator ticket to change how the WDQS triple store is updated.

Page size limits
edit

Pages at the top of Special:LongPages are often at the size limit for a wiki page, which is set via $wgMaxArticleSize.

Merged QuickStatement edits

edit

The current QuickStatement website is not always efficient in making edits: adding a statement with references can result in multiple edits. This feature makes QuickStatement make the Large item edits issue very visible.

Human engagement limits

edit

"Human engagement limits" include everything to do with human ability and attention to engage in Wikidata. In general Wikidata is more successful when humans of diverse talent and ability enjoy putting more attention and time into their engagement with Wikidata.

Limits in this space include the number of contributors Wikidata has, how much time each one gives, and the capacity of Wikidata to invite more human participants to spend more time in the platform.

Wikidata users

edit

Human users

edit
Human Wikidata readers
edit
Human Wikidata contributors
edit
  • Format is machine friendly but not human-friendly - hard for new editors to understand. Necessary to ensure that Wikidata brings in data that may not be already on the internet.
  • Difficult for college classes/instructors to know how to organize mass contributions from their students, such as Wikidata_talk:WikiProject_Chemistry#Edits_from_University_of_Cambridge.
  • Effective description of each type of entity requires guidance for the users who are entering a new item: What properties need to be used for each instance of tropical cyclone (Q8092)? How do we inform each user entering a new book item that they ought to create a version, edition or translation (Q3331189) and a written work (Q47461344) entity for that book (per Wikidata:WikiProject_Books). In other words, how do we make the interface self-documenting for unfamiliar users? And where we have failed to do so, how do we clean up well-intentioned but non-standard edits by hundreds or thousands of editors operating without a common framework?
Human curation
edit
  • Human curation of massive automated inputs of data - tool needed to ensure that data taken from large databases are reliable? Can we harness the power of human curators, who may identify different errors than machine-based checks?

Tools

edit
Tools for reading Wikidata
edit
Tools for contributing to Wikidata
edit
Tools for curating Wikidata
edit
  • "Wikidata vandalism dashboard". Wikimedia Toolforge.
  • "Author Disambiguator". Wikimedia Toolforge.

Bots

edit
Bots that read Wikidata
edit
Bots that contribute to Wikidata
edit

Users of Wikidata client wikis

edit

Users of Wikidata data dumps

edit

Users of dynamic data from Wikidata

edit

SPARQL

edit

Linked Data Fragments

edit

Other

edit

Users of Wikibase repositories other than Wikidata

edit

Content limits

edit

"Content limits" describe how much data Wikidata can meaningfully hold. Of special concern is limits on growth. Wikidata hosts a certain amount of content now, but limits on adding additional content impede the future development of the project.

A March 2019 report considered the rate of growth for Wikidata's content — wikitech:WMDE/Wikidata/Growth.

Generic

edit

How many triples can we manage?

edit

Wikidata Query Service (WDQS) is already experiencing stability issues that are related to the graph size with the current (May 2024) number of triples in the graph (~16 Billions). While there is no strict limit to the number of triples that Blazegraph can support, stability issues due to race conditions occur (see T263110). This is fundamentally a software issue that is unlikely to be fixed by more powerful hardware.

The failure modes we are experiencing are the Blazegraph journal being corrupted, leading to the failure of the affected server. This happens more often during data load, when the system is under more stress. When failures occur during data load, the process has to be restarted from scratch, leading to reload time of > 30 days.

Most of those limitations have been explained in past updates.

The WMF Search Platform team is currently working on splitting the WDQS Graph into multiple sub graphs to mitigate this risk.

How many languages should be supported?

edit
edit

Items

edit
 
Timeline of Wikidata item creation

How many items should there be?

edit

The Gaia project released data so far on over 1.6 billion stars in our galaxy. It would be nice if Wikidata could handle that. OpenStreetMap has about 540 million "ways". The number of scientific papers and their authors is on the order of 100-200 million. The total number of books ever published is probably over 130 million. OpenCorporates lists over 170 million companies. en:CAS Registry Number's have been assigned to over 200 million substances or sequences. There are over 100 large art museums in the world each with hundreds of thousands of items in their collection, so likely at least tens of millions of art works or other artifacts that could be listed. According to en:Global biodiversity there may be as few as a few million or as many as a trillion species on Earth; on the low end we already are close, but if the real number is on the high end, could Wikidata handle it? Genealogical databases provide information on billions of deceased persons who have left some record of themselves; could we allow them all here?

From all these different sources, it seems likely there would be a demand for at least 1 billion items within the next decade or so; perhaps many times more than that.

How many statements should an item have?

edit
  • The top-listed items on Special:LongPages have over 5000 statements. This slows down operations like editing and display.

Properties

edit

How many properties should there be?

edit

How many statements should a property have?

edit

Lexemes

edit
 
Overview of lexicographical data as of May 2019. Does not discuss limits other than those of QuickStatements.

How many lexemes should there be?

edit
English Wiktionary has about 6 million entries (see wikt:Wiktionary:Statistics); according to en:Wiktionary there are about 26 million entries across all the language variations. These numbers give a rough idea of potential scale; however they cannot be translated directly to expected lexeme counts due to the structural differences between Wikidata lexemes and Wiktionary entries. Lexemes have a single language, lexical category and (general) etymology, while Wiktionary entries depend only on spelling and include all languages, lexical categories and etymologies in a single page. On the other hand, each lexeme includes a variety of spellings due to the various forms associated with a single lexeme and spelling variations due to language varieties. Very roughly, then, one might expect the eventual number of lexemes in Wikidata to be on the order of 10 million, while the number of forms might be 10 times as large. The vast majority of lexemes will likely have only one sense, though common lexemes may have 10 or more senses, so the expected number of senses would be somewhere in between the number of lexemes and the number of forms, probably closer to the number of lexemes.

How many statements should a lexeme have?

edit

So far there are only a handful of properties relevant for lexemes, in each case likely to have only one or a very small number of values for a given lexeme. So on the order of 1 to 10 statements per lexeme/form/sense seems to be expected. However, if we add more identifiers for dictionaries and link them, there's a possibility we may have a much larger number of external id links per lexeme in the long run - perhaps on the order of the number of dictionaries that have been published in each language?

References

edit

How many references should there be?

edit

How many references should a statement have?

edit

Where should references be stored?

edit

Subpages

edit

Participants

edit

The participants listed below can be notified using the following template in discussions:
{{Ping project|Limits of Wikidata}}