Author: ellyjones@
Audience: Chromium build sheriff rotation members
This document describes how to be a Chromium sheriff: what your responsibilities are and how to go about them. It outlines a specific, opinionated view of how to be an effective sheriff which is not universally practiced but may be useful for you.
Sheriffs have one overarching role: to ensure that the Chromium build infrastructure is doing its job of helping developers deliver good software. Every other sheriff responsibility flows from that one. In priority order, sheriffs need to ensure that:
As the sheriff, you not only have those responsibilities, but you have any necessary authority to fulfill them. In particular, you have the authority to:
Do not be shy about asking for debugging help from subject-matter experts, and do not hesitate to ask for guidance on Slack (see below) when you need it. In particular, there are many experienced sheriffs, ops folks, and other helpful pseudo-humans in slack #sheriffing.
Effective sheriffing requires an up-to-date checkout, primarily so that you can create CLs to mark tests, but also so you can attempt local reproduction or debugging of failures if necessary. If you are a googler, you will want to have goma working as well, since fast builds provide faster turnaround times.
Sheriffing is coordinated in the slack #sheriffing channel. If you don‘t yet have Slack set up, it is worth setting it up to sheriff. If you don’t want to use Slack, you will at a minimum need a way to coordinate with your fellow sheriffs and probably with the ops team.
These are important Slack channels for sheriffs:
A good way to use Slack for sheriffing is as follows: for each new task you do (investigating a build failure, marking a specific test flaky/slow/etc, working with a trooper, ...), post a new message in the #sheriffing channel, then immediately start a thread based on that message. Post all your status updates on that specific task in that thread, being verbose and using lots of detail/links; be especially diligent about referencing bug numbers, CL numbers, usernames, and so on, because these will be searchable later. For example, let's suppose you see the mac-rel bot has gone red in browser tests:
you: looking at mac-rel browser_tests failure [in thread] you: a handful of related-looking tests failed you: ah, looks like this was caused by CL 12345, reverting you: revert is CL 23456, TBR @othersheriff you: revert is in now, snoozing failure
Only use “also send to channel” when either the thread is old enough to have scrolled off people's screens, or the message you are posting is important enough to appear in the channel at the top level as well - this helps keep the main channel clear of smaller status updates and makes it more like an index of the threads.
[Monorail] is our bug tracker. Hopefully you are already familiar with it. When sheriffing, you use Monorail to:
The CI console page, commonly referred to as “the waterfall” because of how it looks, displays the current state of many of the Chromium bots. In the old days before Sheriff-o-matic (see below) this was the main tool sheriffs used, and it is still extremely useful. Especially note the “links” panel, which can take you not just to other waterfalls but to other sheriffing tools.
Sheriff-o-matic attempts to aggregate build or test failures together into groups. How to use sheriff-o-matic effectively is described below, but the key parts of the UI are:
Sheriff-o-matic can automatically generate CLs to do some sheriffing tasks, like disabling specific tests. You can access this feature via the “Layout Test Expectations” link in the sidebar; it is called “TA/DA”.
Each builder page displays a view of the history of that builder‘s most recent builds - here is an example: Win10 Tests x64. You can get more history on this page using the links to the bottom-left of the build list, or appending a ?limit=n
to the builder page’s URL.
If you click through to an individual build (Win10 Tests x64 Build 36397 for example), the important features of the build page are:
goma_dir
).The new flakiness dashboard is much faster than the old one but has a different set of features. Note that to effectively use this tool you must log in via the link in the top right.
To look at the history of a suite on a specific builder here:
builder
== (eg) Win10 Tests x64
test_type
== (eg) browser_tests
To look at how flaky a specific test is across all builders:
This gives you a UI listing the failures for the named test broken down by builder. At this point you should check to see where the flakes start revision-wise. If that helps you identify a culprit CL revert it and move on; otherwise, disable the test.
For more advice on dealing with flaky tests, look at the “Test Failed” section below under “Diagnosing Build Failures”.
The old flakiness dashboard is often extremely slow and has a tendency not to provide full results, but it sometimes can diagnose kinds of flakiness that are not as easy to see in the new dashboard. It is not generally used these days, but if you need it, you use it like this:
If you instead want to see how flaky a given test is across all builders, like if you're trying to diagnose whether a specific test flakes on macOS generally:
In either view, you are looking for grey or black cells, which indicate flakiness. Clicking one of these cells will let you see the actual log of these failures; you should eyeball a couple of these to make sure that it‘s the same kind of flake. It’s also a good idea to look through history (scroll right) to see if the flakes started at a specific point, in which case you can look for culprit CLs around there. In general, for a flaky test, you should either:
For more advice on dealing with flaky tests, look at the “Test Failed” section below under “Diagnosing Build Failures”.
The tree status page tracks and lets you set the state of the Chromium tree. You'll need this page to reopen the tree, and sometimes to find out who previously closed it if it was manually closed. The tree states are:
Note that various pieces of automation parse the tree state directly from the textual message, and some pieces update it automatically as well - e.g., if the tree is automatically closed by a builder, it will automatically reopen if that builder goes green. To satisfy that automation, the message should always be formatted as “Tree is $state ($details)”, like:
Another key phrase is “channel is sheriff”, which roughly means “nobody is on-duty as sheriff right now”; you can use this if (eg) you are the only sheriff on duty and you need to be away for more than 15min or so:
You are on duty during your normal work hours; don't worry about adjusting your work hours for maximum coverage or anything like that.
Note that you are expected not to do any of your normal project work while you are on sheriff duty - you are expected to be spending 100% of your work time on the sheriffing work listed here.
While you are on duty and at your desk, you should be in the “sheriffing loop”, which goes like this:
If yes:
Do not wait for a slow builder to cycle green before reopening if you are reasonably confident you have landed a fix - “90% confidence” is an okay threshold for a reopen.
If yes:
If yes:
If yes:
If none of the above conditions obtain, it's time to do some longer-term project health work!
Don't go back to doing your regular project work - sheriffing is a full-time job while your shift is happening.
This section discusses how to figure out what's wrong with a failed build. Generally a build fails for one of four reasons, each of which has its own section below.
You can spot this kind of failure because the “Compile” step is marked red on the bot‘s page. For this kind of failure the cause is virtually always a CL in the bot’s blamelist; find that CL and revert it. If you can‘t find the CL, try reproducing the build config for the bot locally and seeing if you can reproduce the compile failure. If you don’t have that build setup (eg, the broken bot is an iOS bot and you are a Linux developer and thus unable to build for iOS), get in touch with members of that team for help reproing / fixing the failure. Remember that you are empowered to pull in other engineers to help you fix the tree!
You can spot this kind of failure by a test step being marked red on the bot‘s page. Note that test steps with “(experimental)” at the end don’t count - these can fail without turning the entire bot red, and can usually be safely ignored. Also, if a suite fails, it is usually best to focus on the first failed test within the suite, since a failing test will sometimes disturb the state of the test runner for the rest of the tests.
Take a look at the first red step. The buildbot page will probably say something like “Deterministic failures: Foo.Bar”, and then “Flaky failures: Baz.Quxx (ignored)”. You too can ignore the flaky failures for the moment - the deterministic failures are the ones that actually made the step go red. Note that here “deterministic” means “failed twice in a row” and “flaky” means “failed once”, so a deterministic failure can still be caused by a flake.
From here there are a couple of places to go:
When debugging a layout_tests
failure: use the layout_test_results
step link from the bot page; this will give you a useful UI that lets you see image diffs and similar. This will be under “archive results” usually.
One thing to specifically look out for: if a test is often slow, it will sometimes flakily time out on bots that are extra-slow, especially bots that run sanitizers (MSAN/TSAN/ASAN) or debug bots. If you see flaky timeouts for a test, but only on these bots, that test might just be slow. For Blink tests there is a file called SlowTests that lists these tests and gives them more time to run; for Chromium tests you can just mark these as DISABLED_
in those configurations if you want.
When marking a Blink test as slow/flaky/etc, it‘s often not clear who to send the CL to, especially if the test is a WPT test imported from another repository or hasn’t been edited in a while. In those situations, or when you just can't find the right owner for a test, feel free to TBR your CL to one of the other sheriffs on your rotation and kick it into the Blink triage queue (ie: mark the bug as Untriaged with component Blink).
If a bot turns purple rather than red, that indicates an infra failure of some type. These can have unpredictable effects, and in particular, if a bot goes red after a purple run, that can often be caused by corrupt disk state on the bot. Ask a trooper for help with this via go/bugatrooper. Do not immediately ping in slack #ops unless you have just filed (or are about to file) a Pri-0 infra bug, or you have a Pri-1 bug that has been ignored for >2 hours.
There are many other things that can go wrong, which are too individually rare and numerous to be listed here. Ask for help with diagnosis in Slack #sheriffing and hopefully someone else will be able to help you figure it out.