Insights for Procure-to-Pay and Finance Leaders

Just add servers

The motivation

In a recent blog series, senior software developer at Tradeshift Joakim Recht elaborated on how we do UI testing at Tradeshift. The UI tests are part of our integration test suite which, together with our unit tests, run automatically every time code is changed to know when regressions are introduced.

As our product grows the number of test grows. We develop code in a test-driven manner, so every change, be it a new feature or a bug fix, includes a set of tests. At one point, it started taking too long to run all of our tests in sequence, even when we scaled capacity of test-servers to make the individual tests run faster. When tests are slow developers don’t get immediate feedback when they commit a change and more time is wasted switching between tasks and different parts of the code. To solve this problem, we built parallelization into our pipeline, so we only need to add servers to make our test suites run faster and keep overall test time down. Here’s how.

Our scalable build pipeline

Our build pipeline uses Jenkins and maven to build artifacts and run tests. The pipeline contains a cluster of build slaves that are managed by puppet and orchestrated by the jenkins swarm plugin. When a new slave is initialized puppet installs everything necessary to run the tests and the slave joins the build swarm on the Jenkins master.

We use the maven surefire plugin to execute our tests. The plugin accepts a parameter called ‘it.tests’ that controls which tests are run during ‘mvn verify’.
We use two Jenkins jobs to execute our tests; a distribution job and an execution job.
The distribution job distributes all test-cases into test-sets to be built by the execution job – one set per slave. The objective of the distribution algorithm is to utilize all available build slaves and to minimizes overall run-time of the tests. The distribution algorithm is a piece of groovy code that fetches the test report from the last test run (via the Jenkins API) and distributes tests into sets by their duration using the Longest Processing Time algorithm. New tests are distributed randomly among the sets as they have no known duration.

We define the test sets as buckets:

…scan for ITCase files…

…get last test case durations…

…and sort tests by descending duration into appropriate buckets…

To run all tests, a build is triggered on the execution job for each test set, corresponding to one per slave, with the individual test sets as argument to the ‘it.tests’ parameter to maven. This is achieved using the parameterized trigger plugin. The number of available slaves is determined using the Jenkins API and injected as an environment parameter using the envinject plugin.

The triggered builds are enqueued to Jenkins’ general queue and picked off by slaves when available. When all builds have finished, the distribution job gathers test results and artifacts (screen capture videos, screenshots, html-pages and console outputs) into its own workspace and publishes them. It also sends emails to committers and/or requesters about the test results.
We had to aggregate the test results and artifacts in custom groovy code as the available functions were not sufficient. Basically, the execution builds are not understood as downstream builds by Jenkins, and most features that work with linked jobs in Jenkins, such as the ‘archive downstream artifacts’ require this understanding to work.

Pros and cons of parallelizing

Since the distribution job queries Jenkins at runtime for the number of available build slaves, we can almost transparently start and stop slaves and scale to accommodate pressure on the pipeline. All slaves pick off builds from the queue and thereby relieve pipeline pressure as soon as they are available. Moreover, they are included in subsequent distribution jobs allowing for more test sets and thus fewer tests per set to reduce test time further.

We needed to implement a safeguard query to the Jenkins API in the slave stop script that prevents shutdown if the slave is not idle (meaning that jobs are building on it). In turn, this allows us to automatically stop build-slaves if they are not needed to save money and energy. Currently, we scale to have between 0 and 18 slaves available on an hourly basis depending on pipeline pressure.

There is no free lunch, of course, and in the making we discovered that many of our tests were dependent on others test to have run before and/or global state. These tests now started failing like wild fire. This forced us to spend a large chunk of time on making our tests more independent and self-contained, both towards output of other tests and towards shared data such as users, connections and documents. Moreover, since we have lower pressure on the individual test slaves they run the tests faster and as a consequence sometime run too fast, requiring us to mature our test control code.

All in all the optimization effort has improved our pipeline speed and in addition helped us mature our test code. The future pipeline holds a lot of interesting challenges such as dynamic swarm scaling based on the queue size and utilizing Amazon EC2 Spot instances for build slaves.
The pipeline is part of our continuous deployment framework which will be described in future blog posts.