You are on page 1of 8

Page 1 of 8

Page 1 of 8
Goto Fail, Heartbleed, and
Unit Testing Culture
Two computer securi ty fl aws were di scovered i n earl y 2014: Appl e s goto fai l
bug and OpenSSL s Heartbl eed bug. Both bugs had the potenti al for wi despread
and severe securi ty fai l ures, the ful l extent of whi ch we may never know. Gi ven
thei r severi ty, i t i s i mportant for the software devel opment professi on to refl ect
on how they coul d have been prevented so we can i mprove our abi l i ty to prevent
these ki nds of bugs i n the future. Thi s arti cl e consi ders the r ol e uni t testi ng coul d
pl ay, showi ng how uni t tests, and more i mportantl y a uni t testi ng cul ture, coul d
have i denti fi ed these parti cul ar bugs. I t goes on to l ook at the costs and benefi ts
of such a cul ture and descri bes how such a cul ture was i nsti l l ed at Googl e.
At the beginning of 2014 the security of the Internet was rocked by two serious
flaws: Apples goto fail bug (CVE-2014-1266) and OpenSSLs Heartbleed
bug (CVE-2014-0160). Both were vulnerabilities in the Secure Sockets Layer
technology upon which the majority of secure communications on the Internet
relies. These bugs are as instructive as they were devastating: They were rooted
in the same programmer optimism, overconfidence, and haste that strike
projects of all sizes and domains.
These bugs arouse my passion because I've seen and lived the benefits of unit
testing, and this strongly-imprinted experience compels me to reflect on how
unit testing approaches could prevent defects as high-impact and high-profile as
these SSL bugs. Unit testing is the process of looking for chunks of code that
make for convenient units to which to apply automated Unit Tests, small
programs designed to verify low-level implementation details and detect coding
errors early. The nature of the defects inspired me to write my own proof-of-
concept unit tests to reproduce the errors and verify their fixes. I wrote these
tests to validate my intuition, and to demonstrate to others how unit tests could
have detected these defects early and without heroic effort.
Writing unit tests produces benefits beyond detecting low-level coding errors.
In this article, I explore the question of whether unit testing could have helped
prevent the "goto fail" and Heartbleed bugs. In doing so, I hope to establish a
compelling case for the adoption of unit testing as part of everyday
development, so that the experience of Self Testing Codebecomes universal.
I offer my insights in the hope that they may help avoid similar failures in the
future, in the spirit of a postmortem or project retrospective. My experience
doesn't mean I'm owed deference based on mah authoritah, but I hope I've
earned an audience with those open-minded enough to have a discussion on the
merits of the argument.
Many popular and technical media stories have run with explanations of how
these defects happened, why they slipped past existing safeguards before being
so widely deployed, and what should be done to prevent such bugs from
happening again. It troubles me that most of these analyses fall back on facile
excuses that miss the mark, and promote resigned acceptance of such defects
Page 2 of 8
Page 2 of 8
due to the ever-increasing complexity of modern software systems. It is as
though the software industry at large, as well as the public that depends on it, is
anxious to accept such failures as inevitable fate, the price we pay for the
modern conveniences that technology affords us. It's the easiest possible
explanation that allows us to make sense of a bad situation and move on as a
society.
I don't accept such defects as inevitable. Rather, we must seize this opportunity
to reflect on how we developers can do far better than rely on fate, or more
funding, or any number of external factors to prevent security vulnerabilities or
other high-impact defects caused by low-level coding errors. Bugs will happen,
but neither software developers nor the public should be satisfied with that as a
response to defects this colossal in scope. Deep, genuine reflection is difficult,
and encounters a lot of resistance, as it calls on developers to accept
responsibility for their human limitationswhich is often a challenge to the
very self-image of a programmer. That makes it all the more important to dive
deeply into these two bugs in particular, to search for genuine solutions and to
avoid setting a dangerous precedent: If everything in the short-term turns out
OK in the wake of "goto fail" and Heartbleed, then why bother changing
anything about how software development is currently done?
goto fail
The goto fail bug first shipped to iPhones, iPads, and AppleTVs in September
2012, appeared in iOS 7.0 and OS X Mavericks, and was not fixed until
February 2014seventeen months after it was introduced. A short-circuit
skipping the final step of the SSL/TLS handshake algorithm left users
vulnerable to a man-in-the-middle attack, whereby a malicious system relaying
traffic between an affected system and another system could present the illusion
of a secure connection using false credentials, and subsequently intercept all
communications between the other two systems.
The bug earned its name from this now infamous code snippet:
if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
goto fail;
goto fail;
Some have argued that all goto statements are bad, based on Edsger Dijkstra's
famous essay A Case against the GO TO Statement, summarized by the popular
axiom "Goto considered harmful". However, the goto fail statement embodies
an idiom C programmers will recognize. Such statements pass control
immediately to the end of a function in the case of unrecoverable errors, as it is
at the end of a function where memory or other resources allocated by the
function are properly released. Other languages have built-in support for such
abortion clauses, as Dijkstra called them in the conclusion to his essay:
destructors in C++; try/catch/finally in Java; defer()/panic()/recover() in
Go; try/except/ finally and the with statement in Python. In C, there is no
essential problem with or confusion surrounding the use of goto in this context.
In other words, goto is not considered harmful here.
Page 3 of 8
Page 3 of 8
C programmers will also immediately recognize that the first goto
fail statement is bound to the result of the if statement preceding it, but the
second goto fail is not: The matching indentation of the two statements bears
no significance in C, as surrounding curly braces are required to bind more than
one statement to an if condition. If the first goto fail is not executed, the
second one certainly will be. This means that the final step of the handshake
algorithm will never be executed, but any exchange successfully passing this
point will always produce a successful return value even if the final verification
step would have failed. More plainly: The algorithm gets short-circuited by the
extra goto fail statement.
Some have claimed that a coding style requiring the use of curly braces for all
if statements or enabling unreachable-code compiler warnings could have
helped. However, there are deeper problems with the code which unit testing
could help to resolve.
How Could Unit Testing Have Helped?
While looking for units to which to apply unit tests, the entire block of
code containing the buggy algorithm, with its cluster of conditional logic, leaps
out as such a unit (from the SSLVerifySignedServerKeyExchange() function
in version 55471 of Apples Secure Transport library):
if ((err = ReadyHash(&SSLHashSHA1, &hashCtx)) != 0)
goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &clientRandom)) != 0)
goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
goto fail;
goto fail;
if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
goto fail;
Such a block of code is easier to test when extracted into its own function.
Extracting chunks of code like this is habitual practice for people writing unit
tests, and can help get parts of an existing code base under test a piece at a time.
Looking closely at the variables and data types used in the algorithm make it
clear that this block of code is performing a handshake on the hashes. By
looking up the type of SSLHashSHA1, we can also see that it is an instance of a
HashReference jump table, a structure containing function pointers that
enables C programmers to implement virtual function-like behavior (i.e.
substitutability and run-time polymorphism).
We can extract this operation into a function with a name signifying its intent:
static OSStatus HashHandshake(const HashReference* hashRef,
SSLBuffer* clientRandom,
SSLBuffer* serverRandom,
SSLBuffer* exchangeParams,
SSLBuffer* hashOut) {
SSLBuffer hashCtx;
OSStatus err = 0;
hashCtx.data = 0;
Page 4 of 8
Page 4 of 8
if ((err = ReadyHash(hashRef, &hashCtx)) != 0)
goto fail;
if ((err = hashRef->update(&hashCtx, clientRandom)) != 0)
goto fail;
if ((err = hashRef->update(&hashCtx, serverRandom)) != 0)
goto fail;
if ((err = hashRef->update(&hashCtx, exchangeParams)) != 0)
goto fail;
err = hashRef->final(&hashCtx, hashOut);
fail:
SSLFreeBuffer(&hashCtx);
return err;
}
Now the series of statements comprising the previously buggy algorithm can be
replaced by:
if ((err = HashHandshake(&SSLHashSHA1, &clientRandom, &serverRandom,
&signedParams, &hashOut)) != 0) {
goto fail;
}
This function is more easily understood in isolation. Faced with such a self-
contained function like this, a programmer can begin to focus on the external
effects of the code, considering questions such as:
What is the contract fulfilled by the code under test?
What preconditions are required, and how are they enforced?
What postconditions are guaranteed?
What example inputs trigger different behaviors?
What set of tests will trigger each behavior and validate each guarantee?
In the case of HashHandshake(), the contract can be described as: Five steps, all
must pass. Success or failure is propagated to the caller by the return value.
The HashReference is expected to respond correctly to the series of calls;
whether it makes use of any functions or data beyond that passed in by
HashHandshake() is an implementation detail opaque to HashHandshake() itself.
For an algorithm this straightforward, the test cases will rather closely "mirror"
the implemetation: One success case, five failure cases. For higher-level or
more complex operations, such close "mirroring" can make for brittle tests and
should generally be avoided. This is especially important to keep in mind when
using mocks or other Test Doubles to test code in isolation from its
collaborators.
Regardless of the scope of the code under test, it's critical to exhaustively test
failure cases to the extent possible. It is tempting to test that the code does what
it should do and leave it at that, but it's arguably even more important to test
that it doesn't do what it shouldn't do.
Proof-of-Concept Unit Test
Despite the fact that C is not an object-oriented programming language, the
existing code for this algorithm exhibits a clearly object-oriented design that
actually makes for easy unit testing once the code is extracted into its own
function. The tls_digest_test.c proof-of-concept unit test shows how
Page 5 of 8
Page 5 of 8
a HashReference stub can be used to effectively cover every path through the
extracted HashHandshake() algorithm. The actual test cases look like this:
static int TestHandshakeSuccess() {
HashHandshakeTestFixture fixture = SetUp(__func__);
fixture.expected = SUCCESS;
return ExecuteHandshake(fixture);
}

static int TestHandshakeInitFailure() {
HashHandshakeTestFixture fixture = SetUp(__func__);
fixture.expected = INIT_FAILURE;
fixture.ref.init = HashHandshakeTestFailInit;
return ExecuteHandshake(fixture);
}

static int TestHandshakeUpdateClientFailure() {
HashHandshakeTestFixture fixture = SetUp(__func__);
fixture.expected = UPDATE_CLIENT_FAILURE;
fixture.client = FAIL_ON_EVALUATION(UPDATE_CLIENT_FAILURE);
return ExecuteHandshake(fixture);
}

static int TestHandshakeUpdateServerFailure() {
HashHandshakeTestFixture fixture = SetUp(__func__);
fixture.expected = UPDATE_SERVER_FAILURE;
fixture.server = FAIL_ON_EVALUATION(UPDATE_SERVER_FAILURE);
return ExecuteHandshake(fixture);
}

static int TestHandshakeUpdateParamsFailure() {
HashHandshakeTestFixture fixture = SetUp(__func__);
fixture.expected = UPDATE_PARAMS_FAILURE;
fixture.params = FAIL_ON_EVALUATION(UPDATE_PARAMS_FAILURE);
return ExecuteHandshake(fixture);
}

static int TestHandshakeFinalFailure() {
HashHandshakeTestFixture fixture = SetUp(__func__);
fixture.expected = FINAL_FAILURE;
fixture.ref.final = HashHandshakeTestFailFinal;
return ExecuteHandshake(fixture);
}
The HashHandshakeTestFixture holds all the variables needed as input to the
code under test and to check the expected outcome:
typedef struct
{
HashReference ref;
SSLBuffer *client;
SSLBuffer *server;
SSLBuffer *params;
SSLBuffer *output;
const char *test_case_name;
enum HandshakeResult expected;
} HashHandshakeTestFixture;
SetUp() initializes all of the members of a HashHandshakeTestFixture to default
values; each test case overrides only those members pertinent to that specific
test case:
Page 6 of 8
Page 6 of 8
static HashHandshakeTestFixture SetUp(const char *test_case_name) {
HashHandshakeTestFixture fixture;
memset(&fixture, 0, sizeof(fixture));
fixture.ref = SSLHashNull;
fixture.ref.update = HashHandshakeTestUpdate;
fixture.test_case_name = test_case_name;
return fixture;
}
ExecuteHandshake() executes the HashHandshake() function and evaluates the
outcome, printing an error message and returning an error value if the outcome
differs from what was expected:
/* Executes the handshake and returns zero if the result matches
* expected, one otherwise. */
static int ExecuteHandshake(HashHandshakeTestFixture fixture) {

const enum HandshakeResult actual = HashHandshake(
&fixture.ref, fixture.client, fixture.server,
fixture.params, fixture.output);

if (actual != fixture.expected) {
printf("%s failed: expected %s, received %s\n",
fixture.test_case_name,
HandshakeResultString(fixture.expected),
HandshakeResultString(actual));
return 1;
}
return 0;
}
Adding a duplicate goto fail statement anywhere in the HashHandshake()
algorithm before the final() call will cause the test to fail.
This test was written without a testing framework, to demonstrate that an
effective test can be written using tools already in-use by a project. Even
without referencing a standard framework, the explanation in the preceding
paragraph should prove relatively easy to follow: Well-organized test cases
using well-organized objects and functions with well-chosen names mean that if
a test fails, you can usually diagnose the failure from the information in the test
case alone without digging through the full implementation of the test program.
Testing frameworks can help in writing tests more efficiently, but are not a
prerequisite for writing well-organized, thorough unit tests.
Writing a set of tests to exercise this function is straightforward because now
were thinking about concrete examples rather than conditions. Furthermore,
the tests act as a double-check: Its easy to make a mistake with conditional
logic, accidentally reversing one test in the chain; but when you write tests you
are stating the behavior twice, once with examples, once with logic. You have
to make the same mistake in two different representations for a bug to get
through.
Its likely that the programmer who wrote this algorithm the first time did
execute the program to check for errors in the new code. Most programmers
will run a program with some sample inputs to verify that its doing what they
think it should do. The problem is that these runs are often ephemeral and
Page 7 of 8
Page 7 of 8
thrown away once the code is working; an automated test captures those runs as
a permanent double-check.
That permanent double-check is important here: We dont know exactly how
that rogue second goto fail got into the code; a likely reason is that it was the
result of a large merge operation. When merging a branch into the mainline,
large differences can result. Even if a merge compiles, it can still introduce
errors. Inspecting such merge differences can be time-consuming, tedious, and
error-prone, even for experienced developers. In this case the automated
double-check provided by unit tests provides a fast and painstaking (yet
painless!) code review, in the sense that the tests will likely catch potential
merge errors before a human inspects the merged code. Its unlikely the original
author introduced the "goto fail" bug into the code, but a suite of tests doesnt
just help you find your own mistakes: It helps reveal mistakes made by
programmers far into the future.
In the case of the goto fail bug, the unit testing habit of looking for, and
extracting, testable functions has a second benefit.
Dj Vu All Over Again
A copy of the same algorithm with a different HashReference instance appears
immediately above the buggy algorithm in the same function. In total, the
algorithm appears six different times in the same file (sslKeyExchange.c from
Security-55471):
twice in SSLVerifySignedServerKeyExchange(), which contained the bug
once in SSLVerifySignedServerKeyExchangeTls12()
twice in SSLSignServerKeyExchange()
once in SSLSignServerKeyExchangeTls12()
The updated version of sslKeyExchange.c in Security-55471.14 has eliminated
the duplicated goto fail statement from SSLVerifySignedServerKeyExchange(),
but the duplicated algorithm remains.
Code duplication is a Code Smell that is known to increase the likelihood of
software errors. It is also apparent from the function names above that theres
more duplication besides that of the core handshake algorithm. This cut-and-
paste code reuse also supports the hypothesis that the bug might have been
caused by a large merge operation, as duplicate code increases the available
code surface during merges and compounds the potential for undetected
merge errors.
Unit testing introduces pressure to minimize copy/paste, because the
copy/pasted code also has to be unit tested. It could have ensured that only one
copy of this algorithm existed since it would have been easier to test. A unit test
could have easily verified that this algorithm was correct, merge or no, and
could have prevented the goto fail bug from being written in the first place.
Also, the Security-55471 version of ssl_regressions.h, which appears to list a
number of SSL regression tests for this library, remains unchanged in
the Security-55471.14 version of ssl_regressions.h. The only substantial
Page 8 of 8
Page 8 of 8
difference between the two versions of the library is the deletion of the goto
fail statement itself, with no added tests or eliminated duplication:
Cultural Implications
The presence of six separate copies of the same algorithm clearly indicates that
this bug was not due to a one-time programmer error: This was a pattern. This
is evidence of a development culture that tolerates duplicated, untested code.
I've never worked at Apple, nor do I know any Apple developers. I don't know
exactly what the company-wide development culture is there, and whether this
code is representative or exceptional. Even if this code is the exception rather
than the norm, it's still unacceptable. It doesn't matter to me, as someone whose
privacy and security might've been violated by this coding error, what the
circumstances were "excusing" this particular error or what the rest of the
culture looks like. I want to see greater accountability for such mistakes. Not
shaming, not condem-nation, etc., but accountability and the due diligence that
follows it. That's the deeper strategy for how we prevent the next "goto fail"
from happening.
I do know that development cultures can change. Bugs such as these give us an
occasion to reflect upon our own development cultures, if unit testing is not
already a vital part, and begin to appreciate why unit testing is such an
important development practice. I'll discuss my experience with changing a
development culture in detail in a later section of this article, and offer advice
for how to effect change in other development cultures, from a single team to an
entire company.
My proof-of-concept unit test for "goto fail" may be easy to dismiss as a one-
off test written with 20/20 hindsight. I would rather it appear as an example of
the kind of accessible unit testing approach that development teams everywhere
can apply to existing code, right now, to avoid similarly embarrassing (and
potentially catastrophic) bugs. A development culture that values unit testing
and whose members work to improve their craft will produce tests that will
most likely catch programming errors exactly like "goto fail" long before they
have a chance to impact any users.

You might also like