by Anthony Vallone

Unreproducible bugs are the bane of my existence. Far too often, I find a bug, report it, and hear back that it’s not a bug because it can’t be reproduced. Of course, the bug is still there, waiting to prey on its next victim. These types of bugs can be very expensive due to increased investigation time and overall lifetime. They can also have a damaging effect on product perception when users reporting these bugs are effectively ignored. We should be doing more to prevent them. In this article, I’ll go over some obvious, and maybe not so obvious, development/testing guidelines that can reduce the likelihood of these bugs from occurring.


Avoid and test for race conditions, deadlocks, timing issues, memory corruption, uninitialized memory access, memory leaks, and resource issues

I am lumping together many bug types in this section, but they are all related somewhat by how we test for them and how disproportionately hard they are to reproduce and debug. The root cause and effect can be separated by milliseconds or hours, and stack traces might be nonexistent or misleading. A system may fail in strange ways when exposed to unusual traffic spikes or insufficient resources. Race conditions and deadlocks may only be discovered during unique traffic patterns or resource configurations. Timing issues may only be noticed when many components are integrated and their performance parameters and failure/retry/timeout delays create a chaotic system. Memory corruption or uninitialized memory access may go unnoticed for a large percentage of calls but become fatal for rare states. Memory leaks may be negligible unless the system is exposed to load for an extended period of time.

Guidelines for development:

  • Simplify your synchronization logic. If it’s too hard to understand, it will be difficult to reproduce and debug complex concurrency problems.
  • Always obtain locks in the same order. This is a tried-and-true guideline to avoid deadlocks, but I still see code that breaks it periodically. Define an order for obtaining multiple locks and never change that order.
  • Don’t optimize by creating many fine-grained locks, unless you have verified that they are needed. Extra locks increase concurrency complexity.
  • Avoid shared memory, unless you truly need it. Shared memory access is very easy to get wrong, and the bugs may be quite difficult to reproduce.

Guidelines for testing:

  • Stress test your system regularly. You don't want to be surprised by unexpected failures when your system is under heavy load.
  • Test timeouts. Create tests that mock/fake dependencies to test timeout code. If your timeout code does something bad, it may cause a bug that only occurs under certain system conditions.
  • Test with debug and optimized builds. You may find that a well behaved debug build works fine, but the system fails in strange ways once optimized.
  • Test under constrained resources. Try reducing the number of data centers, machines, processes, threads, available disk space, or available memory. Also try simulating reduced network bandwidth.
  • Test for longevity. Some bugs require a long period of time to reveal themselves. For example, persistent data may become corrupt over time.
  • Use dynamic analysis tools like memory debuggers, ASan, TSan, and MSan regularly. They can help identify many categories of unreproducible memory/threading issues.


Enforce preconditions

I’ve seen many well-meaning functions with a high tolerance for bad input. For example, consider this function:

void ScheduleEvent(int timeDurationMilliseconds) {
  if (timeDurationMilliseconds <= 0) {
    timeDurationMilliseconds = 1;
  }
  ...
}

This function is trying to help the calling code by adjusting the input to an acceptable value, but it may be doing damage by masking a bug. The calling code may be experiencing any number of problems described in this article, and passing garbage to this function will always work fine. The more functions that are written with this level of tolerance, the harder it is to trace back to the root cause, and the more likely it becomes that the end user will see garbage. Enforcing preconditions, for instance by using asserts, may actually cause a higher number of failures for new systems, but as systems mature, and many minor/major problems are identified early on, these checks can help improve long-term reliability.

Guidelines for development:

  • Enforce preconditions in your functions unless you have a good reason not to.


Use defensive programming

Defensive programming is another tried-and-true technique that is great at minimizing unreproducible bugs. If your code calls a dependency to do something, and that dependency quietly fails or returns garbage, how does your code handle it? You could test for situations like this via mocking or faking, but it’s even better to have your production code do sanity checking on its dependencies. For example:

double GetMonthlyLoanPayment() {
  double rate = GetTodaysInterestRateFromExternalSystem();
  if (rate < 0.001 || rate > 0.5) {
    throw BadInterestRate(rate);
  }
  ...
}

Guidelines for development:

  • When possible, use defensive programming to verify the work of your dependencies with known risks of failure like user-provided data, I/O operations, and RPC calls.

Guidelines for testing:

  • Use fuzz testing to test your systems hardiness when enduring bad data.


Don’t hide all errors from the user

There has been a trend in recent years toward hiding failures from users at all costs. In many cases, it makes perfect sense, but in some, we have gone overboard. Code that is very quiet and permissive during minor failures will allow an uninformed user to continue working in a failed state. The software may ultimately reach a fatal tipping point, and all the error conditions that led to failure have been ignored. If the user doesn’t know about the prior errors, they will not be able to report them, and you may not be able to reproduce them.

Guidelines for development:

  • Only hide errors from the user when you are certain that there is no impact to system state or the user.
  • Any error with impact to the user should be reported to the user with instructions for how to proceed. The information shown to the user, combined with data available to an engineer, should be enough to determine what went wrong.


Test error handling

The most common sections of code to remain untested is error handling code. Don’t skip test coverage here. Bad error handling code can cause unreproducible bugs and create great risk if it does not handle fatal errors well.

Guidelines for testing:

  • Always test your error handling code. This is usually best accomplished by mocking or faking the component triggering the error.
  • It’s also a good practice to examine your log quality for all types of error handling.


Check for duplicate keys

If unique identifiers or data access keys are generated using random data or are not guaranteed to be globally unique, duplicate keys may cause data corruption or concurrency issues. Key duplication bugs are very difficult to reproduce.

Guidelines for development:

  • Try to guarantee uniqueness of all keys.
  • When not possible to guarantee unique keys, check if the recently generated key is already in use before using it.
  • Watch out for potential race conditions here and avoid them with synchronization.


Test for concurrent data access

Some bugs only reveal themselves when multiple clients are reading/writing the same data. Your stress tests might be covering cases like these, but if they are not, you should have special tests for concurrent data access. Case like these are often unreproducible. For example, a user may have two instances of your app running against the same account, and they may not realize this when reporting a bug.

Guidelines for testing:

  • Always test for concurrent data access if it’s a feature of the system. Actually, even if it’s not a feature, verify that the system rejects it. Testing concurrency can be challenging. An approach that usually works for me is to create many worker threads that simultaneously attempt access and a master thread that monitors and verifies that some number of attempts were indeed concurrent, blocked or allowed as expected, and all were successful. Programmatic post-analysis of all attempts and changing system state may also be necessary to ensure that the system behaved well.


Steer clear of undefined behavior and non-deterministic access to data

Some APIs and basic operations have warnings about undefined behavior when in certain states or provided with certain input. Similarly, some data structures do not guarantee an iteration order (example: Java’s Set). Code that ignores these warnings may work fine most of the time but fail in unusual ways that are hard to reproduce.

Guidelines for development:

  • Understand when the APIs and operations you use might have undefined behavior and prevent those conditions.
  • Do not depend on data structure iteration order unless it is guaranteed. It is a common mistake to depend on the ordering of sets or associative arrays.


Log the details for errors or test failures

Issues described in this article can be easier to reproduce and debug when the logs contain enough detail to understand the conditions that led to an error.

Guidelines for development:

  • Follow good logging practices, especially in your error handling code.
  • If logs are stored on a user’s machine, create an easy way for them to provide you the logs.

Guidelines for testing:

  • Save your test logs for potential analysis later.


Anything to add?

Have I missed any important guidelines for minimizing these bugs? What is your favorite hard-to-reproduce bug that you discovered and resolved?