You are on page 1of 11

TESTING THE LIMITSOF A

TRANSACTIONAL NETWORKED SERVICE

BY BOWEI DU

INTRODUCTION
One of the defining characteristics of a cloud service is scale, and with scale comes the question of performance and cost.
How efficient are the software systems that we run? How many computing resources are required to meet our current
demands, and how much more will be required in the future?
At Instart Logic, we have created a system called Lava that enables us to measure and test the scalability limits of our
systems. Lava is focused on transactional networked services and systems that serve independent requests sent over a
network from a large number of clients. Examples include HTTP frontends, data caches and API endpoints.
Performance measurement is a deep topic with many facets. Lava seeks to solve a specific slice of the performance
measurement problem: how can we quickly find the maximum load a service can handle? While today there are many open
source tools for stress testing, we found most of them to be too inflexible and slow to use for this purpose. This poses a
problem as we have a large space of experimental parameters to explore during system stress tests.
Lava decomposes this problem into two pieces:
a set of extensible protocol-specific agents that generate a controllable amount of load on the system in test
a control function that uses feedback from metrics generated by the stress test to find system limits.
While the ideas used in the Lava system are not novel, we feel that the particular combination of features used will be
interesting to a broader audience.

BACKGROUND
The most important metrics for our Lava use cases are throughput and latency. Throughput is the rate of request that can be
processed and latency is the time from the start of a request to reception of the response.

Figure 1 is a graph of the typical response time behavior with respect to increasing request volume. Service response time
is stable under increasing request rate until we reach a saturation point at which the service cannot keep up with the
request ingress rate. Beyond the saturation point, internal queues overflow and service response times degrade past
acceptable thresholds.
It is important to know what the saturation point is for each of our services. In development, we use the results obtained
from Lava to find performance regressions and guide our performance improvement efforts. In production, we use these
results for capacity planning, as services need to be provisioned with enough headroom to absorb service failures and
request spikes.
There are many existing performance frameworks for network protocols such as HTTP, the main ones being Tsung,
Apache Bench, Siege and JMeter. We encountered the following issues with these frameworks:
First, many of the frameworks available run a set workload without any feedback mechanisms for load control. Our
stress runs can be sensitive around the saturation point and slightly too much load can cause high variance in the
output, leading to unstable results.
Lack of feedback also meant that finding the saturation point required many runs of the stress tools probing at
different load levels. Even with a guided binary search, this proved to be too slow to be viable for exploring large sets
of experimental parameters.
Finally, while this is not fundamental, we found that the Lava system was simple enough that implementation of the
mechanisms within our own framework did not incur undue engineering cost.

DESIGN

The Lava system (Figure 2) consists of two main components:


A set of agents running on worker threads that generate application-specific loads. For example, in a stress test of an
HTTP frontend, each state machine executes a sequence of HTTP request/response interactions. For saturation
point measurement, each agent generates a constant number of requests per second for easy load control.
A control function component that receives real-time metrics aggregated from the state machines and adjusts the
parameters of the stress run. The control function manages the number of state machines that are active and the
state of the Lava system overall.
Each Lava run consists of three phases: ramp-up, search and measurement. During the ramp-up phase, the Lava control
function steadily increases the number of active agents until a metric threshold has been exceeded. The ramp-up phase is
not strictly necessary; however, we have found it is useful to distinguish for debugging purposes. Lava then transitions to the
search phase, in which the number of agents is varied up and down around the saturation point, to find the maximum load
possible that still meets the threshold. When the search phase has stabilized, Lava transitions to the measurement phase,
in which the number of agents is held constant for a configurable time period. During the measurement phase, all metrics
should be stable. If high variance occurs, it is an indication that either something is wrong with the system under test or with
the test setup itself. Figure 3 shows the agent count and metric graphs for each of the phases.

Each agent in Lava simulates a constant rate workload from a client. By increasing or decreasing the number of active
agents, Lava can adjust the amount of load placed on the system in test. Each agent has (modulo code transformations to
facilitate non-blocking I/O) the following inner loop:
void Agent::run() {
while (true) {
Operation* op = create_next_operation();
op->run();
sleep(1/rate);
}
}

Agents can be implemented as extensions in C++ or via the Lua scripting language. In addition to the system limit
exploration, we have also implemented agents that replay request traces taken from production.

METRICS AND CONTROL FUNCTIONS


We track an extensible set of metrics from the active agents and feed them to a control function that determines how to
adjust the load. Metrics are tracked by each agent and aggregated by the central control function component.

class Controller {
public:
enum Signal { STABLE, DECREASE, INCREASE };
virtual Signal update(
const Metrics* metrics) = 0;
};

For most applications, we have found that a simple linear controller tracking a moving window of 95th/99th percentile
operation latency suffices:
Controller::Signal LinearController::update(
const Metrics* metrics) {
double delta = metrics->p95_latency() - limit;
if (delta > epsilon) { return DECREASE; }
if (delta < -epsilon) { return INCREASE; }
return STABLE;
}

More sophisticated control functions with faster convergence are possible but currently not explored.

EXAMPLE

Figure 4 shows a sample result from a Lava run testing an HTTP-based service. In this graph, we set a threshold for the
95th percentile latency of 2 milliseconds with the linear control function. The top graph shows the throughput we are getting
from the system. The middle graph shows the sliding window metrics we are measuring. Note that the metrics can vary due
to inherent system variabilities and randomness. The bottom graph shows the number of agents that are active through the
run. We can see Lava transition through the ramp-up, search and measurement states from the agent graph.

CONCLUSION
Lava is currently being used to stress test all major systems at Instart Logic, replacing all 3rd party stress test frameworks.
Adoption of Lava has reduced the length of time taken for a single stress test experiment by an order of magnitude. For
example, our HTTP-based stress tests using Tsung and binary search took around twenty minutes to converge. A similar run
using Lava can converge in under five minutes.
We are in the process of open-sourcing our Lava software as we feel the feedback-control-based stress test framework is
widely applicable and useful.
To read additional technical content from the Instart Logic engineering team, visit our technology blog.

You might also like