Commons:Requests for comment/Technical needs survey/Media dumps

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search
Previous proposal Overview page Next proposal

Media dumps

[edit]

Description of the Problem

[edit]
  • Problem description:

There are no Wikimedia Commons dumps that include any media. There's an open Phabricator ticket since 2021 (T298394), but no major advances have been seen. The root of this problem seems to be fundamentally in the enormous size that the sum of all media currently in Commons has (almost 500 TB). Fortunately, thanks to the hard work of some guys, Commons media now have 2 backups at very distant locations (https://phabricator.wikimedia.org/T262668, https://wikitech.wikimedia.org/wiki/Media_storage/Backups), although in the same data centers as the primary copies. Having copies in more locations would provide greater security, considering the value of some of the content hosted.

  • Proposal type: bugfix / feature request / process request

process request

  • Proposed solution:

There's no need at all to include ALL Commons media in dumps. Focus should be in images with special value, such as historical photographies or documents (here, historical does not necessarily mean old) or featured pictures. Using categories, it should be easy to select all pictures depicting paintings, books, documents, maps (with some kind of filter to exclude user-made or trivial maps, such as country location maps, that, individually, take very little space, but there are lots and lots of them), or photos of special historic value (again, they can be very recent, provided they depict something trully historic). Featured pictures are easy to select since they belong to specific categories. This collection (a subset of Commons) could be split by topic, to have even smaller individual dumps. These dumps, could then be distributed to mirrors around the world (for example, in libraries or universities that volunteer to host them, using a model similar to Debian mirrors). Internet Archive would be another location to host them, but, since it stores only 2 copies of each item, both of them in San Francisco area, with high seismic risk, it probably isn't, sadly, to be relied on for long-term preservation, unless they improve this in the future, or paid Archive-It service (https://support.archive-it.org/hc/en-us/articles/208117536-Archive-It-Storage-and-Preservation-Policy) is used (they store more copies in other locations when using this option).

  • Phabricator ticket:

T298394

  • Further remarks:

This proposed solution are only general ideas that obviously need much more revision and elaboration, but the basic goal is to have at least dumps with the media that is deemed most important (criteria and technical aspects apart). Having backups of all media in other locations besides the 2 main datacenters would be another, perhaps even better, solution. It costs money, but it should be a priority in the budget, as Wikimedia Foundation Mission states: The Foundation will make and keep useful information from its projects available on the internet free of charge, in perpetuity.

Discussion

[edit]
  • Tending toward  Support but some things are a bit unclear. Are you saying there are 3 backups of WMC at a distant location but all at the same place so there should be a fourth at another place? (How many and do you request one backup to be moved or another full backup?)
I'm thinking about whether there are methods of excluding files to reduce the size but that could also introduce problems due to lower data loss severity. Maybe there are some categories of files where the file-sizes are very large despite being of little use where all or all unused files could be excluded from the dump. Or there could be very small backups of all files that are in use or otherwise likely valuable. I think approaches that exclude files (such as all videos longer than 10 minutes or larger than 200 MB plus all uncategorized unused images etc) rather than use a whitelist approach would be best.
I think small Wikimedia Commons datadumps of files as well as metadata (like file descriptions and cats) would be useful and so far haven't found it.
More full backups should certainly be done once there are new technologies of sustainable long-term large-scale data storage. For now, when non-public full backups are concerned 3 backups does seem possibly enough. --Prototyperspective (talk) 11:47, 29 December 2023 (UTC)[reply]
@Prototyperspective, I said that there are currently 2 backups, but there are at the same datacenters that the primary copies (there are 2 primary copies at Virginia and Texas datacenters, so the 2 datacenters are distant from each other, but there is no backup outside those 2 places, and I think it is advisable to have at least a third place, especially if there are no media dumps that add more copies).
I think approaches that exclude files (such as all videos longer than 10 minutes or larger than 200 MB plus all uncategorized unused images etc) rather than use a whitelist approach would be best.: I totally agree: surely it would be a lot easier and produce a better result, thanks (I would add trivial maps (for example, <1MB, or 500 KB) to the exclusion list, though, because there are lots of them and are of very low value themselves, but perhaps I'm obsessed with it and they don't take so much space).
I think small Wikimedia Commons datadumps of files as well as metadata (like file descriptions and cats) would be useful and so far haven't found it.: since 2013, there aren't (they only include the metadata). The Phabricator ticket linked above contains more info about it. Size seems to be the main problem, so excluding certain files and creating several dumps instead of one, would be of great help, I think. MGeog2022 (talk) 12:57, 29 December 2023 (UTC)[reply]
copies=backups (copies of the data) Please be clearer, this is very ambiguous. I think you're saying there is the database and two backups one of which is at the same place as the live database and that a third at another location would be good.
Missed saying that it could also be the case that excluding large files or categories containing large low-usefulness files wouldn't make much of a difference: it could be that the size large comes from the number of small–medium sized files dispersed all across WMC. For example if it reduced the size by 50TB it wouldn't make much of a difference at 500TB. A treemap of filesize by categories&filetypes or sth similar could be very useful (an issue with that is that files are in multiple categories). I do think that excluding unused+unlikely-to-be-very-useful videos would make a substantial difference in filesize.
A third backup at a third locations is something I support. Prototyperspective (talk) 13:38, 29 December 2023 (UTC)[reply]
@Prototyperspective, copies and backups are not the same (not all copies can be called backups). To be clearer, there are 4 copies: a production copy and a backup at each datacenter (that is, 2 production copies, and 2 backups). An additional backup at a third place would be fine, specially if there are no dumps.
Missed saying that it could also be the case that excluding large files or categories containing large low-usefulness files wouldn't make much of a difference: certainly, additional information would be needed. That is why I initially proposed a whitelist instead of a blacklist: include only those files that are deemed to be specially important, to have working dumps as soon as possible (including more, even all, files, would produce better dumps, but if it is at the expense of keeping a Phabricator ticket open for 10 years (the existing one has been there for 2 years, and no major progress is seen), I choose the most practical solution). More should be known about budget and technical issues before opting for one or the other (whitelist or blacklist). MGeog2022 (talk) 19:07, 29 December 2023 (UTC)[reply]
Another thing: enormous size that the sum of all media currently in Commons has (almost 500 TB) is very misleading: 500TB is very little if that is the actual size. Do you have a source or chart for that number?
That doesn't mean it could change substantially with more HD videos getting uploaded (which could become a problem but isn't one now). So it seems like there are enough backups but when considering the small size, setting up an additional one at a third location in the near future indeed seems like a good thing to do. I don't think it's one of the most important issues at this point though. However, there should be a way for people to download rather than scrape all of WMC (or any select parts of it such as all of its images or all but videos), that seems more important but it's not clear if this proposal is also about that. I don't know if there is a text-only Wikipedia dump and a small WMC dump of all files used by it that you can combine if you have any versions of both (modular). --Prototyperspective (talk) 13:25, 31 December 2023 (UTC)[reply]
@Prototyperspective, this is the source for the size of all media currently in Commons.
500TB is very little if that is the actual size: it depends on what you call "big" or "little". The fact is that there are no media dumps since 10 years ago, and size seems to be the main cause (see here: generating and distributing 400TB of data among the many consumers that will likely be interested on those will still require some serious architectural design (e.g. compared to serving 356 KB pages, or 2.5T wikidata exports)). It's not the same storing 500 TB, than distributing them as dumps (in fact I doubt it can currently be possible if using a single dump).
I don't know if there is a text-only Wikipedia dump: yes, there are, for all languages, and not only for Wikipedia, but also for all other Wikimedia projects (see here).
and a small WMC dump of all files used by it: no, there isn't any media dump. Again, "small" is relative, taking into account that English Wikipedia dump (as compressed text, and I think that it includes full version history for all currently existing articles) is only 21.2 GB in size (obviously, when there were Commons media dumps 10 years ago, they were far bigger than this). I am sure we are talking about a challenge for the technical team, since a ticket for this has been open for 2 years now. With this proposal, I'm trying to make it possible, trying to greatly reduce the dump size.
setting up an additional one at a third location in the near future indeed seems like a good thing to do. I don't think it's one of the most important issues at this point though: things are taken for granted, until they aren't. I'm not saying that Commons backup policy is wrong (is far better than Internet Archive's, for example, of course), and I think a catastrophic loss of Commons content is highly unlikely. But all other Wikimedia content (text) is distributed as dumps in mirrors outside Wikimedia Foundation datacenters, while media isn't. So in fact there are more backups for all other content (including past vandalized versions of Wikipedia articles, for example), than for any media, no matter how important it is. I think Wikimedia Commons is something really unique, especially if you think about its relationship with other projects such as Wikisource and Wikipedia. All this together, is a really unique collection, and a freely distributable one (the sum of all human knowledge, as Wikipedia slogan says, or, at least, the currently freely licensed part of it). And I believe things should me made easy to distribute copies of it around the world, since probably libraries or universities would be interested in hosting them. MGeog2022 (talk) 14:06, 31 December 2023 (UTC)[reply]
  • Why I consider this proposal (mine) important: a big number of files hosted in Wikimedia Commons are of high value as part of Wikimedia's sum of all human knowledge, but, unlike text content, whose dumps are hosted in several external mirrors, they are hosted only in the 2 Wikimedia Foundation datacenters. It can seem enough, perhaps even it can be enough... many museums and libraries have been there for centuries, they are there, and probably they will also remain there for many more centuries. But the purpose of Wikimedia movement is to bring it all together: we are putting at the same place many of the videos, many of the songs, many of the books, many of the photos of items displayed at museums, among other things. I consider this sum has a really high value, so it is worth to have it in more than 2 physical locations. Billions of people at 5 continents view all this content in our smartphones, and we take it for granted. For a moment, we can forget that, unlike most physical books, it is stored only at 2 places, and it can be a bit scary. MGeog2022 (talk) 15:00, 20 January 2024 (UTC)[reply]
  • I feel like the ask here is very unclear. Is it just that we don't trust the existing backup? Do we just want an offline backup somewhere? Do we just want more geographically distinct locations? Is it that we want there to exist a backup not under the control of WMF? Do we want the ability for randoms to get a copy of the whole database? Respectfully, I would suggest this proposal should include the specific threats you want commons to be protected against. If we don't have specific worries, then we should just let the people currently working on it, get on with their jobs. Bawolff (talk) 18:35, 3 February 2024 (UTC)[reply]
    @Bawolff, there are many trust levels. Perhaps we can trust 95% the current backups, but not the other 5%. All text content in all Wikimedia projects (including Commons itself) has several dumps hosted both outside WMF control, and in many different places around the world. I (and many other people) think media content shouldn't be less than text content. In addition, some reusers specifically want a media dump as such (see here, in Phabricator, a ticket that predates this proposal by several years). It's not about lack of trust in WMF, but think for example about a possible sophisticated attack against WMF: in such a case, it would be a huge advantage to have copies outside WMF.
    then we should just let the people currently working on it, get on with their jobs: I have nothing at all to say against this. If the ticket isn't stagnated and we see media dumps working over the next months/years thanks to the existing ticket, this proposal wouldn't make any sense. MGeog2022 (talk) 14:21, 5 February 2024 (UTC)[reply]
    My objection isn't that you shouldn't care about such things, but this proposal is so vauge and subjective, with too many inter related things mixed together to be effectively actionable. I don't think its constructive to give WMF proposals like this. By all means, if you want media dumps ask for media dumps, if you want wmf backups to be able to survive the US government seizing all the servers, ask for that. If you want to ask WMF to solve a problem, ask for that problem to be solved. Bawolff (talk) 20:00, 6 February 2024 (UTC)[reply]
    @Bawolff, I'm only suggesting options to evaluate, sorry if this looks as a vague proposal. I fully aknowledge that it can be misunderstood, but I offer alternatives in the case my first proposal isn't feasible. I asked for media dumps: if not feasible, this could be compensated by an additional backup. On the other hand, I never said that the fact that all copies were in the U.S. was a cause for concern: in fact, I know that WMF is based in the USA, and has to serve its contents from there, due to legal reasons. That's why I said that this restrictions probably wouldn't apply to offline copies outside of the USA, because almost all WMF datacenters that don't yet host a backup, are out of the U.S. (except for San Francisco, but the earthquake risk there doesn't make it the best option to locate a new backup). MGeog2022 (talk) 20:51, 6 February 2024 (UTC)[reply]
    Sorry, the comment about copies outside of the USA is below at another place in this conversation, I thought you were talking about it, but perhaps you weren't. MGeog2022 (talk) 21:00, 6 February 2024 (UTC)[reply]
    @Bawolff: Aside from the above, COM:AWB can work based on dumps, but it needs dumps to make that happen.   — 🇺🇦Jeff G. please ping or talk to me🇺🇦 14:26, 5 February 2024 (UTC)[reply]
    This doesn't make sense to me. What is an example of a task you want AWB to do, that you would be able to do except for the lack of media dumps? Bawolff (talk) 20:02, 6 February 2024 (UTC)[reply]
    @Bawolff: Yes, the Database Scanner works on "current" or "Pages" XML files. I want to be able to scan file description pages.   — 🇺🇦Jeff G. please ping or talk to me🇺🇦 17:11, 7 February 2024 (UTC)[reply]
    @Jeff G., I thought file descriptions (not media files themselves) were included in Commons dumps such as this (maybe I'm wrong, or maybe you weren't aware of it). MGeog2022 (talk) 19:10, 7 February 2024 (UTC)[reply]
    Yes, as MGeog2022 said, file description pages are already in the dumps, and not usually what people are talking about when they say "media dumps" Bawolff (talk) 22:29, 7 February 2024 (UTC)[reply]
  • I would say there should be at least two dumps: featured media, and media used on other wikis. My guestimate for the latter would be a few million photos and a few thousand videos. Audio files are negligible in size. The dumps also don't need to have file histories. Another question is whether we should have downscaling applied, since I would still estimate the dump to be terabytes in size. SWinxy (talk) 02:12, 6 February 2024 (UTC)[reply]
    I think this proposal is mainly for a backup, not a publicly downloadable media dump but I think the latter would be even more useful in addition. Thus this seems to be largely about a full backup which only requires a few ~30TB HDDs and the backup procedure. Would be nice to have these two you named to be downloadable. One could also have a semi-full dump which for example doesn't include tiff files and large unused rarely viewed videos but excluding contents like this would not reduce the size by more than maybe a quarter so one could just as well backup everything more often at this point. I don't think downscaling is a good idea and there could also be a downloadable used-or-featured/… images-only dump along with the entire category structure which could be modular so you can combine it with newer or larger dumps. Prototyperspective (talk) 12:35, 6 February 2024 (UTC)[reply]
    @SWinxy, @Prototyperspective, if a multi-TB dump isn't viable, then I think the solution is to split it into several smaller dumps (for example, by root category), not downscaling. Storing some additional TB shouldn't be a major problem for WMF, having the resources it has, but I acknowledge that, for a downloadable dump, an enormous single dump couldn't be the best solution. I also think that there are unused files, not featured nor used in other projects, that are worth to be included in dumps, but this could be dealt with on a case-by-case basis. In principle, the existing ticket is for a full dump, but I don't see major advances in it; that's why I created this proposal. Finally, as Prototyperspective said, there are 2 goals here: to have a dump usable by third parties (if complete enough, this is almost like having many additional full backups, and is a great thing), and to have backups in more than just 2 places. I think WMF could easily improve this last point, given its bugdet: it wouldn't be so expensive to have, for example, offline backups of both text and media content from all wikis (this goes well beyond Commons scope; I'm talking about private full database backups here, not about public dumps), in several of their caching datacenters, such as Amsterdam, San Francisco, Marseille or Singapore (as mere offline backups, I think that being outside of the USA is no problem at all). MGeog2022 (talk) 13:06, 6 February 2024 (UTC)[reply]
    Of course I always speak on the basis of public information. Perhaps there is an offline backup of all text and media content in an unknown third place, and its secrecy is the point to help it to survive even the worst possible events. By the way, I'm never thinking about near-apocalyptic events: cyber attacks against 4 servers, or physical attacks against 2 buildings, is something feasible: the world as we know would still be there... well, no, not exactly as we know it: it would be without an important part of its free knowledge. MGeog2022 (talk) 13:16, 6 February 2024 (UTC)[reply]
    Regarding the storage of copies of the dumps (Commons and other): Debian has a really large list of mirrors, and total size of Debian packages is about 5 TB (not all mirrors store all content, though). Perhaps WMF could make some advocacy work for some third parties to store dumps (currently, there are 10 of them, with only 1 out of the 10 storing the really outdated media dump from about 2013). Probably, few organizations are willing to store all full dumps, but dividing them by wiki, by language, etc, there could be many more copies of the dumps than currently are... how many TB, or even PB, are being stored across all Debian mirrors, if we add them all together? And Wikimedia is much more well known than Debian (at least for the general public). MGeog2022 (talk) 13:32, 6 February 2024 (UTC)[reply]
    OK, nix the downscaling. In a catastrophe, the bare minimum we'd want to have people mirror is the vital media: featured media and media used on other projects. It being a fraction of Commons' entire media library makes it more appealing to save or mirror, and to be able to more easily distribute to WMF's other datacenters. But yeah we also need a full backup, which I assume would be on tape drives which are better for archival storage. SWinxy (talk) 16:07, 6 February 2024 (UTC)[reply]
    @SWinxy, full media backups do exist indeed, since 2021 (thanks to the hard work of some folks at WMF, mainly Jaime Crespo, from what I see in Phabricator). Before that date, they didn't exist, and in my opinion that was a really terrible situation. Current backups, as far as I know from official sites, aren't fully offline, though, and are located in the same datacenters (not the same servers) as the production copies. I consider that additional fully offline backups, located at other places (I'm exceeding Commons scope here, this isn't only about media), would be a very good thing (at the same time, publicly downloadable dumps, mirrored in multiple places, would provide a similar level of security). In the medium term, tape backups could be the technology to be used for that (that's up to the people at WMF, of course). In the long term, there are really promising new technologies that will probably be a big revolution in computer data storage, both in terms of capacity and durability. MGeog2022 (talk) 20:36, 6 February 2024 (UTC)[reply]
    mmmk I see! SWinxy (talk) 21:44, 6 February 2024 (UTC)[reply]

Perhaps the WMF should make LTO-8 (say) tape collections of the content available on a cost-plus basis, in a number of different versions; say featured-only, then whitelist-only, then all. At 12 TB uncompressed per tape, that's not a vast number of tapes even for a full dump. The Anome (talk) 14:43, 5 August 2024 (UTC)[reply]

Votes

[edit]
 Support 500 TB is only 42 x LTO-8 tapes. At $50 each, that's only $2100 for a set, a tiny, tiny fraction of the WMF's budget. You could also make Wikipedia's text dumps available on a similar basis. Make them available to all comers, both individual and corporate, at cost-plus, and give copies free to prominent reputable physical archivers (eg. Internet Archive, Library of Congress, British Library, Deutsche Nationalbibliothek, Archive Team, Amazon...) There's no excuse for not doing something like this: as said above, keeping project data available forever is a core goal of the project. The Anome (talk) 14:46, 5 August 2024 (UTC)[reply]