You are on page 1of 6

 

 
Continuous Reliability: The One 
Thing Your CI/CD Workflow is 
Missing 
 
 
Successful teams know that CI/CD isn’t enough. With things 
breaking faster than ever, many are adding Continuous 
Reliability to their workflows. 

Most engineering teams have adopted an agile development practice and are 
pushing for shorter and faster release cycles. The difficulties associated with 
more frequent deployments to production, not to mention evergrowing code 
bases, led to the rise of Continuous Integration and Continuous 
Delivery/Deployment (CI/CD) tools and workflows. 

CI/CD tools add a lot of automation to the build-test-deploy process, but they 
don’t address one of the biggest problems teams face when building and 
deploying new code... Unpredictable errors and exceptions.  

How do you gain a reliable measure of the overall quality of your code? Have 
you tested everything? How can you ensure that what you are about to release 
is “safe”? 

This paper explores answers to these questions and outlines how you can 
adopt the concept of Continuous Reliability into your workflows. 

Page 1 | ​Continuous Reliability: The One Thing Your CI/CD Workflow is Missing 
By OverOps, Inc. 2018   

 
4 Causes for Failing CI/CD Workflows 

Over the years, we’ve spoken with hundreds of development and operations 
teams and heard the same thing repeated again and again. We all need to move 
quickly, but this sometimes has an adverse effect on reliability and quality of the 
code that makes it to production. CI/CD is great, but we still need deeper test 
coverage. Before we can understand what’s really broken about the current 
CI/CD release cycle (and how to fix it), we need to understand some of the 
challenges that teams face after new code is deployed: 

1. It’s Impossible to Test Everything 

Even with the most thorough testing, staging and QA process, errors slip 
through the cracks. After hours of testing, uncaught and unexpected exceptions 
are still bound to get through to production. It is simply impossible to 
conceptualize and implement a 100% comprehensive set of tests for every 
condition/function. On top of that, CI/CD speeds up every part of the release 
cycle, and contributes to more errors passing through unseen into production in 
a shorter amount of time. 

Bottom line:​ The tests we write are unable to catch unexpected failure 
scenarios. 

2. Limited Insight into the Overall Quality of an Application 

The concepts of CI and CD are predicated on the ability to not only automate 
the promotion of code across build, test and deploy, but also on the ability to 
automate and measure the functional quality of our code. While no suite of 
tests is complete (as mentioned above), there also seems to be very limited 
ability to gauge the overall quality of our applications and services as they flow 
from dev to production. We can measure failed tests or count errors, but there 
lacks an ability to understand the nature of these failures in aggregate across a 
codebase. How do we know how many critical issues we have? How many new 
errors have been introduced? How many issues have resurfaced? Having this 

Page 2 | ​Continuous Reliability: The One Thing Your CI/CD Workflow is Missing 
By OverOps, Inc. 2018   

 
level of detail could help answer the critical answer of whether or not it is safe to 
promote. 

Bottom line: ​The decision whether or not to promote code to the next step in 
the release cycle is ambiguous at best, and random at worst. 

3. Issue Resolution still takes forever and a day 

With unknown and unexpected errors getting into production at a faster rate 
than before, immediate identification of issues and quick troubleshooting is 
more crucial than ever. Unfortunately, the current practice for finding and 
handling errors in production is inherently flawed. Customers are the first to 
reveal errors in the applications and engineers end up spending an average of 
20-40% of their time digging through log files and bouncing between monitoring 
tools trying to understand what went wrong. 

Bottom line:​ When things fail, we don’t always know. Even when we do know, 
we don’t have the full context, and have to spend a considerable amount of time 
on troubleshooting vs. building new features. 

4. Automated Application Failure 

A chain is only as strong as its weakest link, and the same rings true for the 
software development lifecycle. If an error slips through and breaks the build, or 
goes unnoticed all the way up to production, the CI/CD workflow helps to 
automate application failure.  

With CI/CD, the quality of your testing determines the quality of your releases. 
Because we can’t write a fully comprehensive suite of tests, we need another 
way to determine the quality of our code to indicate if it’s safe to promote to the 
next environment. Otherwise, we risk pushing an increased number of errors to 
the hands of our users. 

Page 3 | ​Continuous Reliability: The One Thing Your CI/CD Workflow is Missing 
By OverOps, Inc. 2018   

 
Bottom line:​ Automated build-test-deploy often turns into 
build-test-deploy-break. 

All of this begs the question then:​ If so many teams are experiencing the same 
types of failures when code hits production, is there something fundamentally 
broken in the way we implement CI/CD and automation? 

What’s ​Really​ Breaking the CI/CD Workflow? 


When people think of CI/CD they think of it as the same cycle of build, test, 
deploy, repeat but with automated everything. And for the most part, they’re 
right (depending on the tooling being used). It turns out, the weak link in the 
CI/CD toolchain isn’t the tooling at all. It’s us. 

Each step, from writing the code up to deployment, though automated, is still at 
risk for human error. We test our code the best we can but we know that it won’t 
be perfect the first or second time (or ever, let’s face it). Automating our testing 
frameworks is great, and it helps us deploy faster, but it doesn’t mean that the 
tests w​ e wrote​ will catch all the critical errors. We simply cannot think of all the 
corner cases and all the permutations of data that may be involved. 

The same unexpected errors and exceptions that eluded us before will continue 
to elude us, only at a faster rate. 

As the error rates for our applications increase, it becomes clear that automated 
deployments require a smarter (and perhaps automated) feedback loop for 
issue tracking. It’s not enough to rely on log files or user reports for information 
about application errors. You need to know the moment a release introduces 
new errors and already have all the information required to fix it and then use 
this information to inform subsequent testing. 

Page 4 | ​Continuous Reliability: The One Thing Your CI/CD Workflow is Missing 
By OverOps, Inc. 2018   

 
How Comcast Moves Fast WITHOUT Breaking Things 
Engineering teams who overcome CI/CD obstacles are doing so by building a 
strategy that also incorporates the practice of Continuous Reliability.  

Comcast’s engineering team working on their flagship X1 XFINITY platform is a 


prime example of this. The team deploys a new version of their application on a 
weekly basis to over 23 million set top boxes. 

At first, their method for identifying issues was semi-automated, but fairly 
inconsistent. The team used predefined queries to search the logs in their log 
management tool for errors and exceptions, but “one person's set of go-to 
queries wasn't the same as the next person”. So, different people would see 
different alerts. Beyond that, with the high volume of alerts they received, it 
wasn’t always clear which errors and exceptions were worth investigating. 

Basically, it ended up being a lot of noisy alerts that required manual effort from 
the developers to sort through and handle. Instead of accepting it as a reality of 
a fast-paced workflow, Comcast integrated a fully-automated error identification 
and investigation tool to their automated deployment model. 

To learn more about how Comcast practices Continuous Reliability, check out 
our ​recent webinar​. 

Introducing Continuous Reliability 


Organizations that can withstand the growing pains of CI/CD are pushing ahead 
of their competition.  

The practice helps high-performing engineering teams realize a more direct 


impact on their company’s bottom line, and the satisfaction of their team 
members tends to increase along with the rate of innovation. 

Page 5 | ​Continuous Reliability: The One Thing Your CI/CD Workflow is Missing 
By OverOps, Inc. 2018   

 
Still, while CI/CD can help accelerate the delivery of innovation it can also 
present an increased risk for issues in production code.  

Continuous Reliability (CR) is the practice of evaluating the overall quality of our 
code before it is promoted from environment to environment. And this is more 
than a simple count of errors. It is applying advanced principles to gauge the 
nature of an error or the impact of these issues. Further, CR requires you to 
provide feedback into the test framework so you constantly evolve to test not 
just new functionality, but also the conditions that have caused issues in the 
past. This feedback loop from production back to dev is key to automating the 
overall reliability of your application.   

Along with the adoption of Continuous Integration, Delivery and/or Deployment, 


it’s time that we started talking about Continuous Reliability and how we can 
achieve it. 

Final thoughts 
The bottom line is that it’s great to be able to reduce time-to-market, push new 
innovative features faster and reduce some of the friction in the release cycle. 
HOWEVER, it doesn’t solve the problem of errors and exceptions getting into 
production code. 

If CI/CD is all about automating the building, testing and deploying steps of the 
release cycle, Continuous Reliability is all about maintaining the functional 
health of our applications. We don’t want to just throw code at the production 
environment and see what sticks. We want to proactively address errors and 
exceptions before the code is deployed, including the ones that we never 
expected to happen in the first place. 

Page 6 | ​Continuous Reliability: The One Thing Your CI/CD Workflow is Missing 
By OverOps, Inc. 2018   

You might also like