[go: nahoru, domu]

Page MenuHomePhabricator

media-backupsComponent
ActivePublic

Members (2)

Watchers

  • This project does not have any watchers.
  • View All

Details

Description

Tag for tickets related to WMF backup processes regarding backups and recoveries of multimedia files from wikis (including Wikimedia Commons), whose files are stored in production on SRE-swift-storage.

media-backups are one of the main components on handling WMF infrastructure backups and recoveries (Data-Persistence-Backup), the others being bacula and database-backups.

The project already produces working backups and is able to recover single file, but it is still under heavy development as of 2022.

Recent Activity

Jul 11 2024

Yuhong added a comment to T357184: Consider increasing $wgTranscodeBackgroundSizeLimit to 5GB.

Right now this limit is not even 4GB.

Jul 11 2024, 6:46 PM · TimedMediaHandler, media-backups, SRE-swift-storage, MediaWiki-File-management, Commons

Jul 9 2024

Maintenance_bot removed a project from T334069: Evaluate and decide the future of MinIO for media backups given the upgrade requirements and increase the available storage space: Patch-For-Review.
Jul 9 2024, 8:40 PM · Data-Persistence-Backup, media-backups

Jul 8 2024

jcrespo closed T334069: Evaluate and decide the future of MinIO for media backups given the upgrade requirements and increase the available storage space as Resolved.

Resharding completed, only pending 2 running purge screeen on ms-backup2001, 2002 for purging leftovers. backup1011 & backup2011 will have to be completented by backup1012 and backup2012 this Q.

Jul 8 2024, 12:40 PM · Data-Persistence-Backup, media-backups
jcrespo closed T365607: Reprovision missing files due to backup1005 hw issues as Resolved.
Jul 8 2024, 12:21 PM · Data-Persistence-Backup, media-backups

Jul 3 2024

jcrespo changed the status of T334069: Evaluate and decide the future of MinIO for media backups given the upgrade requirements and increase the available storage space from Open to In Progress.

1 more week left to finish the resharding.

Jul 3 2024, 12:43 PM · Data-Persistence-Backup, media-backups
jcrespo triaged T334069: Evaluate and decide the future of MinIO for media backups given the upgrade requirements and increase the available storage space as High priority.
Jul 3 2024, 12:42 PM · Data-Persistence-Backup, media-backups
jcrespo placed T351895: Make it easy to retrieve disk usage trends on backup storage for hw provisioning up for grabs.
Jul 3 2024, 12:42 PM · database-backups, media-backups, bacula, Data-Persistence-Backup
jcrespo changed the status of T365607: Reprovision missing files due to backup1005 hw issues from Open to In Progress.
Jul 3 2024, 12:39 PM · Data-Persistence-Backup, media-backups
jcrespo added a comment to T365607: Reprovision missing files due to backup1005 hw issues.

5 million files left to recover!

Jul 3 2024, 12:38 PM · Data-Persistence-Backup, media-backups
jcrespo updated the task description for T365607: Reprovision missing files due to backup1005 hw issues.
Jul 3 2024, 12:38 PM · Data-Persistence-Backup, media-backups

Jun 17 2024

ABran-WMF added a comment to P65105 Codfw media backup status.

taking that paste in note, thanks! :)

Jun 17 2024, 12:35 PM · media-backups
jcrespo added a comment to P65105 Codfw media backup status.

And this is the wiki distribution:

Jun 17 2024, 10:19 AM · media-backups
jcrespo added a comment to P65105 Codfw media backup status.

This is the API request I filed: T267365

Jun 17 2024, 10:18 AM · media-backups
jcrespo added a project to P65105 Codfw media backup status: media-backups.
Jun 17 2024, 10:14 AM · media-backups

May 23 2024

jcrespo updated the task description for T365607: Reprovision missing files due to backup1005 hw issues.
May 23 2024, 8:06 AM · Data-Persistence-Backup, media-backups

May 22 2024

Stashbot added a comment to T365607: Reprovision missing files due to backup1005 hw issues.

Mentioned in SAL (#wikimedia-operations) [2024-05-22T15:01:41Z] <jynus> stopping eqiad mediabackups for cleaning up missing files T365607

May 22 2024, 3:01 PM · Data-Persistence-Backup, media-backups
jcrespo added a parent task for T361087: backup1005 crashed: T365607: Reprovision missing files due to backup1005 hw issues.
May 22 2024, 2:46 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo added a comment to T365607: Reprovision missing files due to backup1005 hw issues.

Followup to T361087.

May 22 2024, 2:46 PM · Data-Persistence-Backup, media-backups
jcrespo added a subtask for T365607: Reprovision missing files due to backup1005 hw issues: T361087: backup1005 crashed.
May 22 2024, 2:46 PM · Data-Persistence-Backup, media-backups
jcrespo triaged T365607: Reprovision missing files due to backup1005 hw issues as High priority.
May 22 2024, 2:46 PM · Data-Persistence-Backup, media-backups
jcrespo created T365607: Reprovision missing files due to backup1005 hw issues.
May 22 2024, 2:46 PM · Data-Persistence-Backup, media-backups

Apr 30 2024

jcrespo closed T361087: backup1005 crashed as Resolved.
Apr 30 2024, 10:45 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

Apr 25 2024

jcrespo added a comment to T361087: backup1005 crashed.

In any case, at this point I 'd prefer to do an in-place upgrade rather than a reimage, given how unreliable a reimage is and how impactful it can be for stateful services.

Apr 25 2024, 3:38 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
MoritzMuehlenhoff added a comment to T361087: backup1005 crashed.

Booting failed (PXE):

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al


Debian 12 (bookworm) amd64 (Wikimedia edition)

                                              boot: 
Loading debian-installer/amd64/linux... ok
Loading debian-installer/amd64/initrd.gz...
Boot failed: press a key to retry, or wait for reset...

Hmm. Not sure if we've seen this problem before. DHCP clearly worked as did the debian image download, but Linux failed to load for some reason.

@jcrespo the only difference was selecting bullseye rather than bookworm on the second attempt?

Yes. Check with @MoritzMuehlenhoff he did something to fix something, but not sure what, or if it applies here.

Apr 25 2024, 3:36 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo updated subscribers of T361087: backup1005 crashed.

Booting failed (PXE):

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al


Debian 12 (bookworm) amd64 (Wikimedia edition)

                                              boot: 
Loading debian-installer/amd64/linux... ok
Loading debian-installer/amd64/initrd.gz...
Boot failed: press a key to retry, or wait for reset...

Hmm. Not sure if we've seen this problem before. DHCP clearly worked as did the debian image download, but Linux failed to load for some reason.

@jcrespo the only difference was selecting bullseye rather than bookworm on the second attempt?

Apr 25 2024, 3:29 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
cmooney added a comment to T361087: backup1005 crashed.

Booting failed (PXE):

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al


Debian 12 (bookworm) amd64 (Wikimedia edition)

                                              boot: 
Loading debian-installer/amd64/linux... ok
Loading debian-installer/amd64/initrd.gz...
Boot failed: press a key to retry, or wait for reset...
Apr 25 2024, 2:50 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
ops-monitoring-bot added a comment to T361087: backup1005 crashed.

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host backup1005.eqiad.wmnet with OS bullseye completed:

  • backup1005 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404251323_root_864747_backup1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status failed -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Apr 25 2024, 2:10 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
ops-monitoring-bot added a comment to T361087: backup1005 crashed.

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host backup1005.eqiad.wmnet with OS bullseye

Apr 25 2024, 12:38 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
ops-monitoring-bot added a comment to T361087: backup1005 crashed.

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host backup1005.eqiad.wmnet with OS bullseye executed with errors:

  • backup1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" backup1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Apr 25 2024, 12:02 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo added a comment to T361087: backup1005 crashed.

If booted into bullseye.

Apr 25 2024, 11:40 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
ops-monitoring-bot added a comment to T361087: backup1005 crashed.

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host backup1005.eqiad.wmnet with OS bullseye

Apr 25 2024, 11:17 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
ops-monitoring-bot added a comment to T361087: backup1005 crashed.

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host backup1005.eqiad.wmnet with OS bookworm executed with errors:

  • backup1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" backup1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Apr 25 2024, 11:15 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo added a comment to T361087: backup1005 crashed.

Booting failed (PXE):

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al
Apr 25 2024, 11:15 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
ops-monitoring-bot added a comment to T361087: backup1005 crashed.

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host backup1005.eqiad.wmnet with OS bookworm

Apr 25 2024, 11:10 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

Apr 24 2024

jcrespo claimed T361087: backup1005 crashed.

Will reimage soon.

Apr 24 2024, 4:51 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
VRiley-WMF changed the status of T361087: backup1005 crashed from Open to In Progress.
Apr 24 2024, 4:49 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
VRiley-WMF added a comment to T361087: backup1005 crashed.

We have received the PERC from Dell and I have just completed swapping it out. It now looks like the system can now see the PERC (previously, it wasn't). However, it does seem that the system will need to be rebuilt. @jcrespo would you be able to verify this? Thank you!

Apr 24 2024, 4:08 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

Apr 16 2024

VRiley-WMF added a comment to T361087: backup1005 crashed.

We have been able to get dell support on this unit. After sending over the logs for and they have reviewed it they suggested to update the BIOS and iDRAC. BIOS install went through fine. After completing the iDRAC update, it's not loading properly. Currently working with Dell to resolve this new issue that has been created.

Apr 16 2024, 10:09 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

Apr 9 2024

VRiley-WMF moved T361087: backup1005 crashed from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
Apr 9 2024, 8:39 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

Apr 5 2024

jcrespo added a comment to T262668: WMF media storage must be adequately backed up.

Cloning speed for 133 GB / 28K objects:

# rclone copy -P backup2007:mediabackups/commonswiki/fff backup2011:mediabackups/commonswiki/
Transferred:      133.243 GiB / 133.243 GiB, 100%, 125.044 MiB/s, ETA 0s
Transferred:        28850 / 28850, 100%
Elapsed time:     14m24.4s
Apr 5 2024, 6:17 AM · media-backups, Data-Persistence-Backup, Epic, Goal, SRE, SRE-swift-storage
jcrespo closed T361718: Resharded files fail to be deleted/recovered as Resolved.
Apr 5 2024, 5:20 AM · media-backups, Data-Persistence-Backup

Apr 4 2024

Maintenance_bot removed a project from T361718: Resharded files fail to be deleted/recovered: Patch-For-Review.
Apr 4 2024, 10:30 AM · media-backups, Data-Persistence-Backup
CodeReviewBot added a comment to T361718: Resharded files fail to be deleted/recovered.

jynus merged https://gitlab.wikimedia.org/repos/sre/mediabackups/-/merge_requests/3

Apr 4 2024, 9:58 AM · media-backups, Data-Persistence-Backup

Apr 3 2024

jcrespo added a comment to T361718: Resharded files fail to be deleted/recovered.

I tested the above patch and it solved the issue:

Apr 3 2024, 4:39 PM · media-backups, Data-Persistence-Backup
CodeReviewBot added a project to T361718: Resharded files fail to be deleted/recovered: Patch-For-Review.

jynus opened https://gitlab.wikimedia.org/repos/sre/mediabackups/-/merge_requests/3

Apr 3 2024, 4:38 PM · media-backups, Data-Persistence-Backup
jcrespo claimed T361718: Resharded files fail to be deleted/recovered.
Apr 3 2024, 4:25 PM · media-backups, Data-Persistence-Backup
jcrespo triaged T361718: Resharded files fail to be deleted/recovered as High priority.
Apr 3 2024, 4:25 PM · media-backups, Data-Persistence-Backup

Apr 2 2024

VRiley-WMF added a comment to T361087: backup1005 crashed.

Opened ticket with dell in order to see what they could assist with since when first contacting them, it was on the day the warranty expired. Awaiting response from Dell

Apr 2 2024, 6:40 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

Mar 27 2024

jcrespo added a comment to T334069: Evaluate and decide the future of MinIO for media backups given the upgrade requirements and increase the available storage space.

The new shard looking great:

Mar 27 2024, 7:05 PM · Data-Persistence-Backup, media-backups
Stashbot added a comment to T334069: Evaluate and decide the future of MinIO for media backups given the upgrade requirements and increase the available storage space.

Mentioned in SAL (#wikimedia-operations) [2024-03-27T18:54:07Z] <jynus> increasing volume size of backup2011 T334069

Mar 27 2024, 6:54 PM · Data-Persistence-Backup, media-backups