[go: nahoru, domu]

Posted by Marc Kaplan, Test Engineering Manager

When looking at whether to check a new change in, there are several testing-related questions that can be asked, which at least include:
1. Does the new functionality work?
2. Did this change break existing functionality?
3. What is the performance impact of this change?

We can either answer these questions before we check things in, or after. We found that we can save a lot of time and effort in trying to do detective work and track bad changes down if we do this before check in to the depot. In most teams developers kick off the tests before they check anything in, but in case somebody forgets, or doesn't run all of the relevant tests, a more automated system was desired. So we have what we call presubmit systems at Google to run all of the relevant tests in an automated fashion pre-checkin, without the developer doing anything before the code is actually checked in. The idea behind this is that once the code is finished being written there are minutes or hours before it's checked in while it's being reviewed via the Google code review process (more info on that at Guido's Tech Talk on Mondrian). So we have taken advantage of this time by running tests via the presubmit system that finds all of the tests that are relevant to the change, runs the tests, and reports the results.

Most of the work around presubmit at Google has been looking at functionality (question 1, and 2 above), however in the GFS Team where we are making a lot of performance-related changes we wondered whether a performance presubmit system would be possible. So recently we developed such a system and have begun to start using it. Basically a hook was added to the code review process such that once a developer sends out a change for review, and automated performance test is started via the GFS Performance Testing Tool that we previously described on another Testing Blog post. Once two runs of the performance test are finished, they are compared against a known-good baseline that gets automatically created at the beginning of each week, and a comparison graph is generated and e-mailed to the user submitting the CL. So now we know, prior to checking some code in if it has some unexpected hit to performance. Additionally, now when a developer wants to make a change they think will improve performance they don't need to do anything other than a normal code review which will trigger an automatic performance test to see if they got the performance bump they were hoping for.

Example of a performance presubmit report:

I've seen many companies and projects where performance testing is cumbersome and not run very often until near release date. As you might expect, this is very problematic because once you find a performance bug you have to go back through all of the changes and try to narrow it down, and many times it's simply not that easy. Doing the performance testing early, and often helps narrow things down, but we've found that it's even better to make sure that these performance regressions never creep into the code with pre-check in performance testing. Certainly you should still do a final performance test once this code is frozen, but with the presubmit performance testing system in place you are essentially certain that the performance will be as good as you expect it to be.


Posted by Rajat Jain and Marc Kaplan, Infrastructure Test Engineering

Google is unique in that we develop most of our software infrastructure from scratch inside the company. Distributed filesystems are no exception, and we have several here at Google that all serve different purposes. One such filesystem is the Google File System (GFS) which is used to store almost all data at Google. Although, GFS is the ultimate endpoint for much of the data at Google, there are many other distributed file systems built on top of GFS for a variety of purposes (see Bigtable, for example -- but several others also exist) with developers constantly trying to improve performance to meet the ever-increasing demands of serving data at Google. The challenge to the teams testing performance of these filesystems is that running performance tests, analyzing the results, and repeating over and over is very time consuming. Also, since each filesystem is different, we have traditionally had different performance testing tools for the different filesystems, which made it difficult to compare performance between the filesystems, and led to a lot of unnecessary maintenance work on the tools.

In order to streamline testing of these filesystems, we wanted to create a new framework that is capable of easily performance testing the filesystems at Google. The goals of this system were as follows:

  • Generic: The testing framework should be generic enough to test any type of file-system inside Google. In having a generic framework, it will be easier to compare the performance of different filesystems across many operations.
  • Ease of use: The framework should be easy enough to use so that software developers can design and run their own tests without any help from the test team.
  • Scalable: Testing can be done at various scales depending on the scalability of the FS. The framework can issue any number of operations simultaneously. So, for a testing a Linux file system, we might only issue 1000 parallel requests, while for the Google File System, we might want to issue requests at a much larger scale.
  • Extensible:
    • Firstly, it should be easy to add a new kind of operation in the framework, if its developed in future (For example, RecordAppend operation in GFS).
    • Also, the framework should allow the user to easily generate complex types of loadscenarios on the server. For example, we might want to have a scenario in which we issue File Create operations simultaneously with Read, Write, and Delete operations. Thus, we want a good mix of operations but not in a randomized way, so that we can have benchmark results.
  • Unified testing: The framework should be stand-alone or independent ie it should be a one-stop solution to setup, run the tests and monitor the results.

We developed a framework which allows us to achieve all the above mentioned goals. We used the Google's generic File API for writing the framework, since every file system can be tested just by changing the file namespace in which the testing data will be generated (e.x. /gfs vs. /bigtable). Following Google's standard, we developed a Driver + Worker system. The Driver co-ordinates the overall test, by reading configuration files to set up the test, automatically launching different number of workers depending on the load, monitoring the health of workers, collecting performance data from each worker and calculating the overall performance. TheWorker class is the one which loads the file systems with appropriate operations. A worker is an abstract class and a new child class can be created for each file operation, which gives us the flexibility to add any operation we want in the future. A separate Worker instance is launched on a different machine depending on the load that we want to generate. It is simple to run more or less workers on remote machines simply by changing the config file.

The test is divided into various phases. In a phase, we can run a single operation N number of times (with a given concurrency) and collect performance data. So, we can run a create phase followed by a write phase followed by a read phase. We can also have multiple sub-phases inside a phase, which gives us the ability to generate many different simultaneous operations on the system. For example, in a phase, we might add three subphases create, write and delete, which will issue all the different kinds of operations simultaneously on remote client machines against the distributed filesystem.

It is instructive to look at an example config file for an idea of how the load is specified against this filesystem:

# Example of create
phase {
label: "create"
shards: 200
create {
file_prefix: "metadata_perf"
}
count: 100
}

So in the example above, we launch 200 shards (which all run of different client machines) that all do creates of files with a prefix of metadata_perf, and suffixes based upon the index of the worker shard. In practice, the user of the performance test passes a flag into the performance test binary that specifies a base path to use: i.e. /gfs/cell1/perftest_path, and the resulting files will be /gfs/cell1/perftest_path/worker.i/metadata_perf.j, for i=1 until i=#shards, and j=1, until j=count.

# Example of using subphases
phase {
label: "subphase_test"
subphase {
shards: 200
label: "stat_subphase"
stat {
file_prefix: "metadata_perf"
}
count: 100
}
subphase {
shards: 200
label: "open_subphase"
open {
file_prefix: "metadata_perf"
}
count: 100
}
}

In the example above, we simultaneously do stats and opens of the files that were initially created in the create phase. Different workers execute these, and then report their results to the driver.

On conclusion of the test, the driver prints a performance test results report that details the aggregate results of all of the clients, in terms of MB/s for data intensive ops, ops/s for metadata intensive ops, and latency measures of central tendency and dispersion.

In conclusion, Google's generic File API, use of Driver & Workers and the concept of phaseshave been very useful in the development of the performance testing framework and hence making performance testing easier. Almost as important, the fact that this is a simple script-driven method of testing complex distributed filesystems has resulted in an ease of use that has given both developers and testers, the ability to quickly experiment and iterate, resulting in faster code development and better performance overall.

Posted by Marc Kaplan, Test Engineering Lead

At Google, we have infrastructure that is shared between many projects. This infrastructure creates a situation where we have a many dependencies in terms of build requirements, but also in terms of test requirements. We've found that we actually need two approaches to deal with these requirements depending on whether we are looking to run larger system tests or smaller unittests, both of which ultimately need to be executed to improve quality.

For unittests, we are typically interested in only the module or function that is under test at the time, and we don't care as much about downstream dependencies, except insofar as they relate to the module under test. So we will typically write test mocks to mock out the downstream components that we aren't interested in actually running that simulate their behaviors and failure modes. Of course, this can only be done after understanding how the downstream module works and interfaces with our module.

As an example of mocking out a downstream component in Bigtable, we want to simulate the failure of Chubby , our external lockservice, so we we write a Chubby test mock that simulates the various ways that Chubby can interact with Bigtable. We then use this for the Bigtable unittests so that they a) run faster, b) reduce external dependencies and c) enable us to simulate various failure and retry conditions in the Bigtable Chubby related code.

There are also cases where we want to simulate components that are actually upstream to the component under test. In these cases we write what is called a test driver. This is very similar to a mock, except that instead of being called by our module (downstream) it calls our module (upstream). For example, if Bigtable component has some Mapreduce specific handling, we might want to write a test driver to simulate these Mapreduce-specific interfaces so we don't have to run the full Mapreduce framework inside our unittest framework. The benefits are all the same as those of using test mocks. In fact, in many cases it may be desirable to use both drivers and mocks, or perhaps multiple of each.

In system tests where we're more interested in the true system behaviors and timings, or in other cases where we can't write a driver or mocks we might turn to fault injection. Typically, this involves either completely failing certain components sporadically in system tests, or injecting particular faults via a fault injection layer that we write. Looking back to Bigtable again, since Bigtable uses GFS when we run system tests, we are running fault injection for GFS by failing actual masters and chunkservers sporadically, and seeing how Bigtable reacts under load to verify that when we deploy new versions of Bigtable that they it will work given the frequent rate of hardware failures. Another approach that we're currently work on is actually simulating the GFS behavior via a fault injection library so we can reduce the need to use private GFS cells which will result in better use of resources.

Overall, the use of Test Drivers, Test Mocks, and Fault Injection allows developers and test engineers at Google to test components more accurately, quickly, and above all helps improve quality.