Professional Documents
Culture Documents
Continuous Reliability: The One
Thing Your CI/CD Workflow is
Missing
Successful teams know that CI/CD isn’t enough. With things
breaking faster than ever, many are adding Continuous
Reliability to their workflows.
Most engineering teams have adopted an agile development practice and are
pushing for shorter and faster release cycles. The difficulties associated with
more frequent deployments to production, not to mention evergrowing code
bases, led to the rise of Continuous Integration and Continuous
Delivery/Deployment (CI/CD) tools and workflows.
CI/CD tools add a lot of automation to the build-test-deploy process, but they
don’t address one of the biggest problems teams face when building and
deploying new code... Unpredictable errors and exceptions.
How do you gain a reliable measure of the overall quality of your code? Have
you tested everything? How can you ensure that what you are about to release
is “safe”?
This paper explores answers to these questions and outlines how you can
adopt the concept of Continuous Reliability into your workflows.
Page 1 | Continuous Reliability: The One Thing Your CI/CD Workflow is Missing
By OverOps, Inc. 2018
4 Causes for Failing CI/CD Workflows
Over the years, we’ve spoken with hundreds of development and operations
teams and heard the same thing repeated again and again. We all need to move
quickly, but this sometimes has an adverse effect on reliability and quality of the
code that makes it to production. CI/CD is great, but we still need deeper test
coverage. Before we can understand what’s really broken about the current
CI/CD release cycle (and how to fix it), we need to understand some of the
challenges that teams face after new code is deployed:
Even with the most thorough testing, staging and QA process, errors slip
through the cracks. After hours of testing, uncaught and unexpected exceptions
are still bound to get through to production. It is simply impossible to
conceptualize and implement a 100% comprehensive set of tests for every
condition/function. On top of that, CI/CD speeds up every part of the release
cycle, and contributes to more errors passing through unseen into production in
a shorter amount of time.
Bottom line: The tests we write are unable to catch unexpected failure
scenarios.
The concepts of CI and CD are predicated on the ability to not only automate
the promotion of code across build, test and deploy, but also on the ability to
automate and measure the functional quality of our code. While no suite of
tests is complete (as mentioned above), there also seems to be very limited
ability to gauge the overall quality of our applications and services as they flow
from dev to production. We can measure failed tests or count errors, but there
lacks an ability to understand the nature of these failures in aggregate across a
codebase. How do we know how many critical issues we have? How many new
errors have been introduced? How many issues have resurfaced? Having this
Page 2 | Continuous Reliability: The One Thing Your CI/CD Workflow is Missing
By OverOps, Inc. 2018
level of detail could help answer the critical answer of whether or not it is safe to
promote.
Bottom line: The decision whether or not to promote code to the next step in
the release cycle is ambiguous at best, and random at worst.
With unknown and unexpected errors getting into production at a faster rate
than before, immediate identification of issues and quick troubleshooting is
more crucial than ever. Unfortunately, the current practice for finding and
handling errors in production is inherently flawed. Customers are the first to
reveal errors in the applications and engineers end up spending an average of
20-40% of their time digging through log files and bouncing between monitoring
tools trying to understand what went wrong.
Bottom line: When things fail, we don’t always know. Even when we do know,
we don’t have the full context, and have to spend a considerable amount of time
on troubleshooting vs. building new features.
A chain is only as strong as its weakest link, and the same rings true for the
software development lifecycle. If an error slips through and breaks the build, or
goes unnoticed all the way up to production, the CI/CD workflow helps to
automate application failure.
With CI/CD, the quality of your testing determines the quality of your releases.
Because we can’t write a fully comprehensive suite of tests, we need another
way to determine the quality of our code to indicate if it’s safe to promote to the
next environment. Otherwise, we risk pushing an increased number of errors to
the hands of our users.
Page 3 | Continuous Reliability: The One Thing Your CI/CD Workflow is Missing
By OverOps, Inc. 2018
Bottom line: Automated build-test-deploy often turns into
build-test-deploy-break.
All of this begs the question then: If so many teams are experiencing the same
types of failures when code hits production, is there something fundamentally
broken in the way we implement CI/CD and automation?
Each step, from writing the code up to deployment, though automated, is still at
risk for human error. We test our code the best we can but we know that it won’t
be perfect the first or second time (or ever, let’s face it). Automating our testing
frameworks is great, and it helps us deploy faster, but it doesn’t mean that the
tests w e wrote will catch all the critical errors. We simply cannot think of all the
corner cases and all the permutations of data that may be involved.
The same unexpected errors and exceptions that eluded us before will continue
to elude us, only at a faster rate.
As the error rates for our applications increase, it becomes clear that automated
deployments require a smarter (and perhaps automated) feedback loop for
issue tracking. It’s not enough to rely on log files or user reports for information
about application errors. You need to know the moment a release introduces
new errors and already have all the information required to fix it and then use
this information to inform subsequent testing.
Page 4 | Continuous Reliability: The One Thing Your CI/CD Workflow is Missing
By OverOps, Inc. 2018
How Comcast Moves Fast WITHOUT Breaking Things
Engineering teams who overcome CI/CD obstacles are doing so by building a
strategy that also incorporates the practice of Continuous Reliability.
At first, their method for identifying issues was semi-automated, but fairly
inconsistent. The team used predefined queries to search the logs in their log
management tool for errors and exceptions, but “one person's set of go-to
queries wasn't the same as the next person”. So, different people would see
different alerts. Beyond that, with the high volume of alerts they received, it
wasn’t always clear which errors and exceptions were worth investigating.
Basically, it ended up being a lot of noisy alerts that required manual effort from
the developers to sort through and handle. Instead of accepting it as a reality of
a fast-paced workflow, Comcast integrated a fully-automated error identification
and investigation tool to their automated deployment model.
To learn more about how Comcast practices Continuous Reliability, check out
our recent webinar.
Page 5 | Continuous Reliability: The One Thing Your CI/CD Workflow is Missing
By OverOps, Inc. 2018
Still, while CI/CD can help accelerate the delivery of innovation it can also
present an increased risk for issues in production code.
Continuous Reliability (CR) is the practice of evaluating the overall quality of our
code before it is promoted from environment to environment. And this is more
than a simple count of errors. It is applying advanced principles to gauge the
nature of an error or the impact of these issues. Further, CR requires you to
provide feedback into the test framework so you constantly evolve to test not
just new functionality, but also the conditions that have caused issues in the
past. This feedback loop from production back to dev is key to automating the
overall reliability of your application.
Final thoughts
The bottom line is that it’s great to be able to reduce time-to-market, push new
innovative features faster and reduce some of the friction in the release cycle.
HOWEVER, it doesn’t solve the problem of errors and exceptions getting into
production code.
If CI/CD is all about automating the building, testing and deploying steps of the
release cycle, Continuous Reliability is all about maintaining the functional
health of our applications. We don’t want to just throw code at the production
environment and see what sticks. We want to proactively address errors and
exceptions before the code is deployed, including the ones that we never
expected to happen in the first place.
Page 6 | Continuous Reliability: The One Thing Your CI/CD Workflow is Missing
By OverOps, Inc. 2018