OCG
OCG, or Offline Content Generation, was a Node.js service that converted Parsoid RDF into some offline form, typically a PDF document. For end-users, it is only accessible through the Collection extension in MediaWiki.
OCG was accessible internally at: http://ocg.svc.eqiad.wmnet:8000
The service was turned off on Wikimedia wikis in October 2017 (background: mw:Reading/Web/PDF Functionality ,T150871).
Installing a development instance
- First install the Collection extension on your local mediawiki instance.
cd $IP/extensions; git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/Collection
- Add the following to your
$IP/LocalSettings.php
:
// Collection extension require_once("$IP/extensions/Collection/Collection.php"); // configuration borrowed from wmf-config/CommonSettings.php // in operations/mediawiki-config $wgCollectionFormatToServeURL['rdf2latex'] = $wgCollectionFormatToServeURL['rdf2text'] = 'http://localhost:17080'; // MediaWiki namespace is not a good default $wgCommunityCollectionNamespace = NS_PROJECT; // Sidebar cache doesn't play nice with this $wgEnableSidebarCache = false; $wgCollectionFormats = array( 'rdf2latex' => 'PDF', 'rdf2text' => 'Plain text', ); $wgLicenseURL = "http://creativecommons.org/licenses/by-sa/3.0/"; $wgCollectionPortletFormats = array( 'rdf2latex', 'rdf2text' );
- Create a new directory, which we'll call
$OCG
, and check out the OCG service, bundler, and some backends:
mkdir $OCG ; cd $OCG git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/Collection/OfflineContentGenerator mw-ocg-service git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/Collection/OfflineContentGenerator/bundler mw-ocg-bundler git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/Collection/OfflineContentGenerator/latex_renderer mw-ocg-latexer git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/Collection/OfflineContentGenerator/text_renderer mw-ocg-texter for f in mw-ocg-service mw-ocg-bundler mw-ocg-latexer mw-ocg-texter ; do cd $f ; npm install ; cd .. done
- Follow the Installation instructions in the mw-ocg-latexer/README.md (installing system dependencies in particular).
- Follow the Running a development server instructions in
mw-ocg-service/README.md
to configure and start the OCG service. (Ignore the "Installation on ubuntu part" unless/until you want to install a production instance.) - Your wiki sidebar should now have "Download as PDF" and "Download as Plain text" entries. Visit an article on your wiki and try them out! Diagnostics appear on the console where
mw-ocg-service
is running, unless you've configured some other logger. - If running a private wiki you must make sure that the
$wgServer
IP address is able to reach article content/the Mediawiki API. It is suggested to include this IP address using NetworkAuth and adding it to theiprange
array or by adding the following to theLocalSettings.php
:
if ( @$_SERVER['REMOTE_ADDR'] == '<enter your $wgServer IP address>' || @$_SERVER['REMOTE_ADDR'] == '127.0.0.1' ) { $wgGroupPermissions['*']['read'] = true; }
- You can also use the bundler and backends directly from the command-line. See the mw-ocg-latexer/README.md for an example.
Monitoring
- OCG eqiad cluster in Ganglia
- Icinga has service checks for HTTP on port 8000 and for disk and queue usage.
- Logging happens in /var/log/ocg.log. There is a log rotation setup in /etc/logrotate.d/ocg. (But you need to be in
ocg-render-admins
to look at these.) - grafana dashboard: https://grafana.wikimedia.org/dashboard/db/ocg [Broken Link T211982]
- graphite.wikimedia.org has a Graphite/ocg/pdf tree with other useful statistics like:
- job_queue_length.value -- The number of currently pending jobs
- status_objects.value -- The number of jobs we're currently tracking (this data is kept for a couple of days for caching purposes)
- [backend|frontend].restarts.count -- The number of times the given thread has restarted (indication of fatal errors)
- Dashboard for OCG is at https://logstash.wikimedia.org/app/kibana#/dashboard/OCG
When something goes wrong
C. Scott and Arlo know the most.
Hop into #mediawiki-parsoid connect on Libera.chat.
Reverting an OCG deployment
Code
ssh tin cd /srv/deployment/ocg/ocg git deploy revert # pick the last good deployed version
If git deploy revert
fails:
git deploy start git reset --hard <desired changeset> git submodule update --recursive git deploy --force sync
Tracking down the source of a sudden traffic spike
MediaWiki requests the service of OCG to render a PDF via its LVS IP, that will hit OCG's frontend daemon. In turn, the request will become a job enqueued in the OCG Redis Job Queue, and another daemon (the OCG backend) will take care of the rendering. Oxygen and sampled-1000.json could help opsens tracking down heavy hitters like bots, for example grepping for requests containing the word rdf2latex.
Deploying changes
OCG is deployed using git-deploy. Briefly, you will run git deploy start
, make whichever changes you need to make to the git clone (such as pulling, changing branches, committing live hacks, etc.), then run git deploy sync
. The sync command pushes the new state to all backends and restarts them.
You should have deploy access and be a member of the deployment-prep project (so you can deploy to beta). Since the service never restarts properly on beta, being listed on Special:NovaSudoer for the deployment-prep
project (usually in the under_NDA
group), is a good idea, so that you can sudo
. In production, being a member of the ocg-render-admins
puppet group is helpful, in case salt fails to restart the ocg service; being a member of ocg-render-admins
also lets you read /var/log if things go wrong.
Since OCG does not have regularly scheduled deploy windows (yet!), ping greg-g on #wikimedia-operations and ask him to schedule a window for your deploy when necessary.
Pre-deploy checks; preparing the deploy commit
OCG is a collection of submodules organized under the OCG Collection service. There are two branches: the master
branch is the latest versions of the code. In theory it would be what's running on our local pre-deploy testing machine (like parsoid.wmflabs.org
and the round-trip-testing service for parsoid), but at the moment nothing automatically pulls from the master branch. The wmf-deploy
branch is code we've deemed stable enough to deploy. It should be what's running in beta and on the ocg servers. In addition, the wmf-deploy
branch has a prebuilt node_packages
folder, which is built for the version of node we run on the cluster (nodejs 0.10 x64; on Ubuntu 14.04).
You will need to, on your local machine, update the submodules as required and then run 'make'. The make script builds the node dependencies. Note that your local machine must match the architecture and configuration of the deploy cluster. At the moment that is x64 ubuntu 14.04 and node 0.10.25. (In the future we may provision an appropriately-configured labs machine to build deploy commits.)
- Begin a deployment summary on OCG/Deployments. Don't include all commits, but only notable fixes and changes (ignore code cleanup updates, test case updates, etc).
- I usually open an edit window on OCG/Deployments and update the various sections as I perform the steps below. I include the shortlog for the submodules in the commit message for the "Updating to latest masters" commit (below), and cut-and-paste that into the wiki summary. Then as the various master and wmf-deploy branch commits are created I update the wiki with the appropriate hashes and gerrit links, adding special notes if there are additional deploy branch commits which are being created or any other special work being done.
- Prepare a ocg-collection repo commit and push for +2 (note that jenkins is not running on this repo, so you will need to V+2 and submit as well)
- First update the submodules, roughly:
cd ocg-collection ; git checkout master ; git pull origin master ; git submodule update ; git submodule foreach git pull origin master ; git add -u ; git commit -m "Updating to latest masters" ; git review
- You probably want to edit that commit message a bit more before submitting it; see deployment summary discussion above.
- Then build the dependencies:
git checkout wmf-deploy ; git pull origin wmf-deploy ; git merge master ; git commit --amend; git review
- Note the
git commit --amend
aftergit merge master
to allow the gerrit hooks to add an appropriate Change-Id field to the merge commit.
- Note the
- If the package dependencies have changed, continue with:
make production ; git add --all package.json node_modules ; git commit -m "Rebuilding dependencies" ; git review
- In order to ensure that the binary versions match, these steps can be done on
deployment-pdf01.eqiad.wmflabs
. However, the labs machines have packages installed (likenode-request
) which are not installed in production. Be careful. You can perform the build under/home
. After setting up youruser.name
anduser.email
usinggit config --global
and doingmkdir ~/bin ; ln -s $(which nodejs-ocg) ~/bin/node
, try:git clone https://gerrit.wikimedia.org/r/mediawiki/services/ocg-collection && cd ocg-collection && git submodule update --init --recursive
and then the above commands (starting withmake production
) to (re)build the dependencies. - Other commands that might be useful:
curl https://www.npmjs.org/install.sh | bash
;sudo apt-get install g++
- In order to ensure that the binary versions match, these steps can be done on
- Run the service locally to ensure that nothing has broken: XXX MORE DETAILS HERE XXX
- First update the submodules, roughly:
- Add yourself to the "deployer" field of Deployments if you're not already there
- Be online in freenode #wikimedia-operations and #wikimedia-releng (and stay online through the deployment window)
Deploying the latest version of OCG
We are going to deploy the latest version both to the beta cluster and to production. In theory we might separate these steps by a few days, but at the moment we just do a quick test on beta before deploying to production. Let's start by deploying to beta (see https://wikitech.wikimedia.org/wiki/Access#Accessing_public_and_private_instances for the .ssh/config
needed):
$ ssh -A deployment-deploy-01.eqiad.wmflabs deployment-deploy-01$ cd /srv/deployment/ocg/ocg deployment-deploy-01$ git deploy start deployment-deploy-01$ git pull deployment-deploy-01$ git submodule update --init --recursive deployment-deploy-01$ git deploy sync
You will then get status updates. If any minions are not ok, then retry the deploy until all are. Proceed with 'y' when all minions are ok for each step. If minions are not ok, in general you just need to press 'd' to check for the 'detailed status' (which will not restart the salt job just repoll the status) until they are ok.
Nodes are not automatically restarted. To do this, use
deploy-01$ git deploy service restart
If that fails (in early 2014 it used to be in the habit of failing) deployment-prep
admins can sudo service ocg restart
on individual boxes -- for beta, that's deployment-pdf01.eqiad.wmflabs
and deployment-pdf02.eqiad.wmflabs
. On the deployment cluster, you need to be a member of ocg-render-admins
to sudo service ocg restart
on the ocg100[123].eqiad.wmnet boxes. (Once we get a rsh group we could do something like dsh -g ocg sudo service ocg restart
.)
Now go to #wikimedia-releng and report the deploy:
!log updated OCG to version <new hash>
Assuming this all worked, you will want to test the deploy on beta before moving on to production.
- Do a test render on http://en.wikipedia.beta.wmflabs.org (use the "Download as PDF" link in the sidebar)
- For example, [1]
- Use 'force re-render' if necessary to ensure you're testing the latest code.
- You may also wish to use the "Create a book" option in the sidebar and add a few articles to a book, then render that.
- For example, [1]
Now let's deploy to production. Before you begin, notify ops in #wikimedia-operations
:
!log starting OCG deploy
Now we're going to repeat the above steps, but on deployment.eqiad.wmnet
rather than deployment-deploy-01.eqiad.wmflabs
:
$ ssh -A deployment.eqiad.wmnet tin$ cd /srv/deployment/ocg/ocg tin$ git deploy start tin$ git pull tin$ git submodule update --init --recursive tin$ git deploy sync
We used to have hosts here which were off but not yet depooled, so we would see minions fail. That shouldn't happen anymore, but if it does consult the old version of this page for advice.
You will then get status updates. If any minions are not ok, then retry the deploy until all are. Proceed with 'y' when all minions are ok for each step. If minions are not ok, in general you just need to press 'd' to check for the 'detailed status' (which will not restart the salt job just repoll the status) until they are ok. Remember to then restart the service:
tin$ git deploy service restart ocg1003.eqiad.wmnet: True ocg1002.eqiad.wmnet: True ocg1001.eqiad.wmnet: True
Fantastic! (Remember you can use sudo service ocg restart
on the individual ocg100[123].eqiad.wmnet boxes if something goes wrong here.)
Once everything is done, log the deploy in #wikimedia-operations with something like
!log updated OCG to version <new hash> (T<bug number>, T<bug number>, ...)
listing the hash of the deployed OCG version (the hash of the wmf-deploy branch of the ocg-collection repository) as well as any bug numbers referenced in the deploy log. This creates a timestamped entry in the Server Admin Log and creates cross-references in the listed bugs to the SAL.
Post-deploy checks
- Render a page
- Use the 'book creator' link in the sidebar to render a collection.
Scripts
If you are a member of ocg-render-admins
(or root) you can run scripts by doing:
$ ssh ocg1001.eqiad.wmnet ocg1001$ sudo -u ocg -g ocg nodejs-ocg /srv/deployment/ocg/ocg/mw-ocg-service/scripts/run-garbage-collect.js -c /etc/ocg/mw-ocg-service.js
The machines in question are ocg100[1234].eqiad.wmnet; see https://gerrit.wikimedia.org/r/#/c/150863/
sudo -u ocg -g ocg nodejs-ocg
will put you in the same permissions context as ocg.
Maintenance scripts
All scripts take a `-c /etc/ocg/mw-ocg-service.js` configuration option which tells it to use the puppetized OCG configuration file.
- /srv/deployment/ocg/ocg/mw-ocg-service/scripts/clear-queue.js -- Clears the job queue if it's gotten too long
- /srv/deployment/ocg/ocg/mw-ocg-service/scripts/run-garbage-collect.js -- If the cron jobs have failed or if there are too many job status objects
There are actually three configuration files for the service:
mw-ocg-service/defaults.js -> /etc/ocg/mw-ocg-service.js -> /srv/deployment/ocg/ocg/LocalSettings.js
defaults.js
- is all the "default stuff" and it is well commented.
/etc/ocg/mw-ocg-service.js
- has all the puppetized stuff, e.g. the redis password, hosts, and file directories
LocalSettings.js
- is commited to the git repo and has stuff that is more for performance tweaking
The service initializes a configuration object, then loads the file specified with a "-c" command line option, and then passes the configuration object to it (the config files are treated as node modules and they have a known entry point). In production we pass the /etc/ file to -c and the /etc/ file then calls the LocalSettings.js file. So it's one big chain and each step can override the previous.
(Note that commit da78e552232efe0078452b0f876b926332f49c84 to mw-ocg-service added a /etc/mw-collection-ocg.js
configuration file, and a new mechanism for chaining configurations. We haven't switched to this new style configuration in production yet.)
Pruning the queue
If the queue mechanism appears to be working, but you'd like to expire some jobs in order to free up space, the following command (repeated on each of ocg100[1234].eqiad.wmnet could be useful:
sudo -u ocg -g ocg nodejs-ocg mw-ocg-service/scripts/run-garbage-collect.js -c ~/config.js
where ~/config.js
contains something like:
module.exports = function(config) { // chain to standard configuration config = require('/etc/ocg/mw-ocg-service.js')(config); // drastically reduce job lifetimes and frequencies var gc = config.garbage_collection; ['every', 'job_lifetime', 'job_file_lifetime', 'failed_job_lifetime', 'temp_file_lifetime', 'postmortem_file_lifetime'].forEach(function(p) { gc[p] /= 1000; // maybe you don't need to be this dramatic }); return config; };
Decommissioning a host
If one of the OCG hosts needs to be taken down (for maintenance, upgrades, etc), the cache entries for that host need to be removed from redis. The clear-host-cache.js
script will do this.
First, remove the host from the round-robin DNS name specified in the Collection extension configuration, so it is no longer the target of new job requests from PHP. (Nowadays this means: [puppetmaster1001:~] $ sudo -i confctl select name=ocg1001.eqiad.wmnet set/pooled=no
) This is the $wgCollectionMWServeURL
variable, set to ocg.svc.eqiad.wmnet
for production and deployment-pdf01
in labs.
You should also decommission the host in puppet, by writing a hieradata/hosts/ocg1003.yaml
file (where ocg1003
is the name of the host being decommissioned) with the contents:
ocg::decommission: true
See https://gerrit.wikimedia.org/r/286070 for an example of this. Another simpler example is https://gerrit.wikimedia.org/r/#/c/347781/. This will stop the host from running new backend jobs. You should restart OCG on the affected host(s) once the puppet change has propagated for the configuration to take effect. (Once https://gerrit.wikimedia.org/r/284599 is enabled the explicit restart won't be necessary, but that's not turned on in our machine configuration yet. Baby steps.)
Once the DNS change has propagated and you've restarted OCG with the decommission configuration (restarting will wait for any existing jobs on that host to complete), you would run something like:
$ cd /srv/deployment/ocg/ocg/ $ sudo -u ocg -g ocg nodejs-ocg mw-ocg-service/scripts/clear-host-cache.js -c /etc/ocg/mw-ocg-service.js ocg1003.eqiad.wmnet
where ocg1003.eqiad.wmnet
is the fully-qualified domain name of the host you want to decommission.
If the hostname is omitted, the script will use the name of the host on which the script is running (presumably, you'd typically run this on the OCG host itself, but you could also run it on a different OCG host). (Note that within a week or so of the deployment of https://gerrit.wikimedia.org/r/286068 you will have to clear both the FQDN of the host and the bare hostname. You can do that simultaneously by specifying both on the command line.)
The script will not remove job status entries for pending jobs
(unless you use the --force
flag). It will complain on console
if it finds pending jobs, and exit with a non-zero exit code.
In that case, the operator should wait longer (say, 15 minutes)
for the pending job to complete and the user to collect the
results, before re-running the clear-host-cache
script.
Clearing the job queue
If the job queue grows to ridiculous levels, it can impair usability for ordinary users. This can happen when someone decides to (say) spider all of wiktionary. In this case, it might be best to clear the entire job queue, aborting all jobs. The clear-queue.js
script will do this, run on any of the OCG hosts:
$ cd /srv/deployment/ocg/ocg/ $ sudo -u ocg -g ocg nodejs-ocg mw-ocg-service/scripts/clear-queue.js -c /etc/ocg/mw-ocg-service.js
This script will set all pending job status entries to "failed" with the message "Killed by administrative action".
Clearing the cache for a given date range
Parsoid or RESTbase bugs might cause some cached content to be corrupted. After the bug is identified and fixed, the cache entries for some specific period of time might need to be removed from redis to clear the corruption.
The clear-time-range.js
script will do this:
$ cd /srv/deployment/ocg/ocg/ $ sudo -u ocg -g ocg nodejs-ocg mw-ocg-service/scripts/clear-time-range.js -c /etc/ocg/mw-ocg-service.js 2015-04-23T23:30-0700 2015-04-24T13:00-0700
where 2015-04-23T23:30-0700
and 2015-04-24T13:00-0700
are the start/end times of the time range in question.
The script will not remove job status entries for pending jobs
(unless you use the --force
flag). It will complain on console
if it finds pending jobs, and exit with a non-zero exit code.
In that case, the operator should wait longer (say, 15 minutes)
for the pending jobs to complete and the user to collect the
results, before re-running the clear-time-range
script.
Regression testing
It is useful to run large numbers of articles through the backend in order to find crashers. We use the mw:Parsoid data set for this, which consists of 10,000 articles from a large number of wikis (and 1,000 articles from a larger set of wikis). To facilitate this, run your local mw-ocg-service as follows:
cd $OCG/mw-ocg-service ./mw-ocg-service.js -c localsettings-wmf.js
with localsettings-wmf.js
looking something like:
module.exports = function(config) { // Increase this if you don't mind hosing your local machine config.coordinator.frontend_threads = config.coordinator.backend_threads = 1; // point parsoid at production config.backend.bundler.parsoid_api = 'http://parsoid-lb.eqiad.wikimedia.org/'; // default to enwiki, although we'll be specifying prefixes explicitly config.backend.bundler.parsoid_prefix = 'enwiki'; // optional, but useful if you want to collect postmortem info locally // make sure this directory exists config.backend.post_mortem_dir = __dirname + '/postmortem'; };
Now you can pull large quantities of articles through. Start with:
cd $OCG/ocg-collection # checked out from https://gerrit.wikimedia.org/r/mediawiki/services/ocg-collection cd loadtest ; npm install # only the first time ./loadtest.js -p enwiki -o 0 # reads from ./pages.list, filters to only enwiki, outputs files named 0-*.txt
This will take a while. But once you have a list of crashers (in 0-failed-render.txt
) you can make some fixes and then recheck just the crashers like:
cp 0-failed-render.txt 1.txt ./loadtest.js -o 1 1.txt
Rinse and repeat: copy 1-failed-render.txt
to 2.txt
once you've fixed some more bugs, and rerun to see what's left.
Test scripts
- $OCG/ocg-collection/loadtest/loadtest.js - Add metabook jobs to the queue for load testing or to find regressions
- $OCG/ocg-collection/loadtest/injectMetabooks.js - Older version of the above; deprecated.
These scripts are also on the production machines in /srv/deployment/ocg/ocg/loadtest/
but you probably shouldn't find them from there. If you want to inject jobs in the production queue, I recommend running the loadtest script locally after first running:
ssh -L 17080:ocg.svc.eqiad.wmnet:8000 tin
to redirect queries to the production service. (The OCG service port is firewalled from outside connection, hence the need for the ssh tunnel.)
Finding crashers / Debugging "Error: 1"
Here's how to find crashers and reproduce them. Hopefully after that you can fix them!
First, go to logstash: https://logstash.wikimedia.org/#/dashboard/elasticsearch/OCG%20Backend
Select an appropriate timeframe from the top-right dropdown, and then type in: "process died with"
(with the quotes, and replacing the default *
) in the QUERY field.
You should now see some crashers, and some event counts.
Clicking on an entry under "all events" will give you the basic parameters of the request in the full_message.job.metabook
field.
- Copy and paste the contents of the
full_message.job.metabook
field into a new local file, let's call itsomebug.json
. - For some reason, logstash adds spurious semicolons. Search and replace all semicolons in the JSON file with nothing.
- Remove the line
"parsoid": "http://10.x.x.x",
from the end of the JSON file. - Run
mw-ocg-bundler -v -D -o somebug.zip -m somebug.json
. If this was a bundler crasher, this command should crash and you're done.
If this was a LaTeX crasher, then:
- Run
mw-ocg-latexer -v -D -o somebug.pdf somebug.zip
. Hopefully this will crash for you. - To debug further, use
mw-ocg-latexer -v -D -l -o somebug.tex somebug.zip
. - Now you can rerun LaTeX with:
TEXINPUTS=tex/: xelatex somebug.tex
- Note that
somebug.tex
just includes the "real" TeX files in a temporary directory somewhere. Open that up in your editor. Commenting stuff out with%
is a good first step to narrow down the bug. If this is a collection of multiple articles, comment out the\input
lines inoutput.tex
until you've figured out which article the problem is. The XeLaTeX output hopefully then gives you the line number within that file. Good luck!
Older hints
(04:30:28 PM) mwalker: so... the process right now is to look in logstash for the collection id (04:31:00 PM) mwalker: cscott, https://logstash.wikimedia.org/#/dashboard/elasticsearch/OCG%20Backend (04:31:17 PM) mwalker: cscott, which will tell you the IP of the generating host (04:33:19 PM) mwalker: cscott, once you have the IP; run `host <ip>` on any of the ocg servers or tin to figure out what host that actually was; ssh there, and then look in /var/log/ocg.log for anything that looks like what you've kicked off