Thriving with DevOps: An Advanced Checklist

August 26, 2019

This blog post covers advanced DevOps practices. To get started with DevOps you may want to read the preceding blog post on Getting to DevOps: A Checklist.

When your team got into DevOps and you are familiar with basic DevOps practices you can choose to level up the DevOps skill and become a high-performing DevOps team. In the spirit of my previous article this post will provide a comprehensive checklist for your team to move beyond basic DevOps practices and establish a culture of continuous growth and improvement.

1.) Test Automation

You will not deploy often if you don't have confidence that your deployment will succeed and that there are no issues. If you have an exhaustive range of tests that are run for each deployment, you can rest assured that there shouldn't be a major disservice to your customers when deploying a new release. While all edge cases can rarely ever be tested, it's important to focus on critical parts of your customer's user journey.

Do not test cases that are just easy to test, but rather test cases that provide real business value.

First, identify the parts of your application that are mostly used. There will be pages and components in your application that provide core functionality and are used with high frequency by users. You need to have some monitoring and analytics setup in order to gather this information.

Once you have identified the critical aspects of your application, assess your test automation suite and how those critical paths in your application are covered. All components involved in those critical paths need to have unit tests, integration tests, and definitely end-to-end tests covering typical user workloads. From those end-to-end tests you can select a handful to be run as smoke tests.

Here is a checklist for your test automation:

Do you know the critical, most business relevant user journeys through your application?
Do you have unit tests, integration tests and end-to-end tests (E2E) setup for your application? By the way, E2E tests are not merely frontend tests, you can also E2E test your API.
Do you have a handful of meaningful smoke tests that run in less than a minute?

2) Pipelines

Pipelines are your core asset to deliver fresh releases in a short cycle. A lot of operational aspects can be covered and implemented in pipelines, such as zero-downtime deployments, executing migrations or even rolling back to a previous version if needed. Even if it takes longer at the beginning to create the pipeline, a fully automated deployment will save you time down the road.

There are many tools available to implement pipelines. You definitely will want a tool that fosters a pipeline-as-code approach so you can store your pipeline configurations under version control. Additionally, the CI/CD system should run your pipeline scripts within containers, so that you can easily provide tools and an environment tailored to the needs of your project. If you need some inspiration for modern cloud CI/CD tools, make sure to check out AWS CodePipeline, GitLab CI, Buildkite, or Concourse (also check out our blog post on deploying Concourse. If you're interested in comparing even more CI/CD tools, have a look at this list of 50 CI/CD tools currently available. But as usual, do some research, make a decision, and then more important: stick to the tool and leverage it's potential to the fullest.

Once you have settled on the tool, build pipelines that help the team ship software at high pace with high quality. Nevertheless, strive to keep pipelines simple and understandable. Write pipelines in languages that are familiar to the team, and keep the amount of technical variability in pipelines to a minimum.

Here is a checklist to challenge your current pipeline setup:

How many people know how they pipelines work and can change them? There should be at least two, better three persons knowing the pipelines in and out. The more the better.
Do the pipelines implement a zero-downtime deployment strategy, such as a blue-green or canary deployment?
Do you create version numbers, such that each release is uniquely identifiable and can be traced back to a specific revision in your SCM?
Can your pipelines handle rollbacks?
Do certain monitoring conditions after a deployment automatically roll back a deployment?
Can anyone in the team take a fix to production in less than 1 hour?
Do you put your pipelines under version control?

3) Infrastructure

If your team deploys to production, it needs to take care about underlying infrastructure. The underlying infrastructure is usually managed by another team or multiple teams to some degree, but some parts of the infrastructure are usually provisioned and operated by the DevOps itself.

Regardless of infrastructure specifics it is usually possible to automate the parts owned by the team, and it mostly makes sense to do automate it. A lot of tools have evolved to achieve this in the recent years, beginning with server-provisioning tools such as Ansible, Chef or Puppet, or infrastructure tools suited for clouds such as Terraform, CloudFormation (AWS-specific) or BOSH. While tools are always great, stick to as few as possible and don't overengineer.

With containers on the rise it is also important to speak about immutable infrastructure. While in a previous era, servers have been maintained, configured, and adjusted, in modern cloud environments infrastructure is rebuilt from a known state in case changes are required. The automation tools allow you to tear down and spin up infrastructure as required from your pipelines.

Is your infrastructure setup automated?
How many people in your team understand and can change all the infrastructure?
Is your infrastructure immutable?
Is your infrastructure similar for your different environments?
Can you change any part of your infrastructure in less than 1 hour?
How long does it take to spin-up a new environment?

4) Security

When it comes to security a concept proposed by Justin Smith is very powerful: Rotate, Repave, Repair. Rotate all credentials (passwords, certificates, you name it) frequently, ideally every few hours. While shooting for the starts is great, rotating within less than half a year sets you ahead of a lot of teams. Repave your infrastructure frequently from a known good state. You can do that for example, by recreating your docker container every few hours.
Repair any known security vulnerabilities timely, ideally in a couple of hours after a patch has been released. This applies to OS patches all the way up to security patches for the libraries your application uses.

While the concept gives you a lot of options to improve your security, here is also a term that you can put right into action with little effort: least-privilege. If you need access to something, create specific technical users and grant them the specific access you need (for example read-only access to a database, when your application doesn't need to write to it). Also, don't give root access to everyone by default, but if you do need root access, then secure it with multi factor authentication.

Another measure that improves security right away: use a team password manager. While you probably have a single sign-on solution that covers most authentications, there always some shared secrets somewhere. There are so many password managers available that there is absolutely no reason not to use one. Make sure that you generate long and unique passwords for the various accounts you have everywhere. To programmatically share credentials and certificates between applications you can use a credentials service such as Vault, Credhub or AWS Secrets Manager.

Do you have a security checker for known vulnerabilities in 3rd party systems in place?
Can you roll out a patch to any part of your infrastructure within a couple of hours? Also to production?
Do you have a security contact and a defined process to handle incidents in place?
Are there no access credentials that give the owner more rights than needed?
Do you frequently rotate credentials and certificates?
Do you use multi factor authentication for team members, where possible?

5) Continuous Improvement

Once a team has truly cultivated DevOps, the team strives for continuous improvement for all aspects that are under the control of the team. For continuous improvement to be really effective, it is essential to prioritize on improving the right things. Identifying the right things, and the right time to pursue them, takes practice and won't always be right. To get there always think about the opportunity costs of your choices. Does it really make sense to invest a week into automating a special case while in the same amount of time you could update and heavily improve documentation?

For a lot of choices you can actually measure if the choices are correct by identifying and frequently measuring key performance indicators (KPIs). While KPIs can be requested from stakeholders, it makes sense to come up with KPIs that you want to be measured by in your team. Also, without measuring important business KPIs as well as team and process KPIs you won't be able to properly judge your decisions. There are tons of KPIs you could measure, so here you go with some examples:

Business KPI: What is the drop out rate of real customers of the parts of the application that you own?
Business KPI: What features you own are used how often? Does it make sense to keep and develop these features at all?
Business KPI: What is the amount of time users spend on your features? Is that good or bad (e.g. signup process vs. news feed)?
Team KPI: How long does it take for a bug to be fixed on average?
Team KPI: How accurate are your estimations?
Team KPI: How long does it take to deploy a commit to production?

An interesting read on KPIs is also the GitLab Cycle Analytics. If you have identified some KPIs and you are able to measure them, it starts making sense to have some sort of dashboard in place so everyone in the team has easily access to the relevant KPIs. Don't spend too much energy in creating a perfect dashboard - first validate whether the KPIs you identified and measure turn out to actually be of value for the team.

Next to that, you can also foster efficiency and convenience for common tasks performed by your team. As an example, you could create and share a Postman collection of frequently used REST API calls. That makes it especially easy to onboard new members to the team. Also, you could be creating and following checklists for common tasks such as defining a Definition of Done, rotating server certificates, or crafting a release with meaningful release notes.

Here is a checklist:

How often does your team work on improving it's own process?
Which KPIs do you measure?
Which KPIs do you improve?
Do you have simple dashboards in place so KPIs are accessible?
Do you have a prioritized backlog?
Do you have checklists in place for common operations?

Conclusion

There are lot's of options on what to improve. Discuss within your team where everyone sees the most potential. You may also seek advice from people outside the team to have a different perspective. Make sure to frequently zoom out to identify the areas that require attention and improvement, and then zoom in to actually improve and make progress in the identified areas. What additional topics come to your mind that are not yet part of the checklist? Let us know in the comments!