Getting to Cloud DevOps: A Basic Checklist

July 31, 2019

This blog post covers basic DevOps practices. If you already feel confident with DevOps you may want to read Thriving with DevOps: An Advanced Checklist.

With cloud platforms such as Cloud Foundry more and more teams move from a development-only team to a DevOps-enabled team. The aim in doing so is in continuously improving the application development lifecycle by moving fast from idea to production. Though, with development and operations efforts that need to get done some teams have less operations experience and neglect some important operational aspects. This blog post comprises a basic checklist that teams can go along and decide for themselves whether they want to increase efforts for a specific topic. Personally, I believe that any real DevOps team needs to have the following aspects in place. In an upcoming blog post I will provide a checklist for advanced DevOps topics.

1.) Continuous Integration & Delivery (CI/CD)

Here you go with a bold statement:

Taking a commit to production must run through a fully automated pipeline

That means that any change to software or infrastructure should move through a maintained (and understood!) path of CI/CD pipelines. To make this work in practice there should be a short feedback cycle for developers, so they know as soon as possible whether the new change has any flaws. Failing unit tests need to produce a build failure in less than 5 minutes and the overall process to deploy to production should take no more than 60 minutes. By having pipelines that run fast, you actually use them! There is nothing worse than a slow pipeline, as everyone will start to bypass the pipeline for the important production hotfix.

Despite automated checks, you can still embrace a Git workflow that supports your team and even has code reviews. Once the review passes, there should however be at most a manual trigger to finally deploy the code.

Also, if you have a product to operate, go the extra mile and also automate the deployment itself. Even if you have a manual trigger somewhere, a deployment should be no more than one or two clicks from a human. Humans should not be required to hop on a shell or do things on their local machine.

Here is a checklist to assess your team's progress for continuous engineering practices:

Does a continuous integration pipeline exist, that builds the software project?
Are unit tests being run?
Are integration tests being run?
Are end to end tests being run?
Are deployments of the built artifact automated?

2.) Logging

Applications logs are so useful when it comes to understanding what went wrong in any case. To have them ready when you need them, make sure to emit them in your application and collect them in your infrastructure. Modern cloud log aggregation systems usually collect metrics from standard out stream as recommended by the Twelve Factors. Writing logs to a file system will likely cause logs to be lost when you need them the most, so make sure to follow cloud best practices.

While having access to the logs is the first step, the real value is in analyzing the logs to extract particular information. To achieve this, it is important that all team members are upskilled such that they can actually use the log aggregation tool to filter and query log messages for specific questions.

Here is a checklist to assess your team's logging setup:

Are all application logs gathered in a centralized log tool?
Are application logs collected via standard out and not written to files?
Does everyone in the team know how to access the logs?
Do you have documentation with examples of common queries to extract valuable information from the logs?
Are you able to correlate related logs from different/same applications?

3.) Setup Monitoring

It is crucial for a team to observe application behavior in production. There are lots of monitoring solutions available, that perfectly integrate with whatever platform you are using. If you are unaware of what monitoring solutions exist, chances are you already have one in your corporation - just ask around and get it also for your team.

Basic metrics you should be monitoring include:

Server capacity (CPU, disk, network): This will show you if there are technical issues with your application landscape. You may need to scale.
Important application metrics that are business relevant (e.g. number of online sales of a shop): Even though your tests are green you may still be making no money of your application. If there are no online sales, you might have broken the login form with your last deployment. Any failure of core functionality will increase in a drop of these metrics. Observe the behavior and expect natural drops to occur (e.g. during christmas).
Certificate expiration dates: I don't know any team that never had to deal with expiring certificates. Certificates are usually valid for 1 to 5 years, so definitely a timespan where you could be in charge. Make sure to rotate your certificates before they expire.

Once you have the metrics in a monitoring solution, the next step is to setup alerts that inform you of critical situations. Some monitoring systems allow you to define hard thresholds, which make sense for example at 80% disk utilization. For the number of sales in an online shop, it makes more sense to have a dynamic and adaptive alerting, that considers the usual trend over time and detects anomalies.

Here is a checklist:

Are technical application metrics monitored for all systems?
Are relevant business metrics monitored?
Are certificate expiration dates monitored?
Are alerts setup to inform the team?
Do you have an operations cheat sheet in place that clearly states what possible root causes for an alert may be and what to do in such a case?

4.) Documentation

Documentation is important, but often neglected and outdated. When it comes to documentation it's all about documenting the right things, and not just documenting anything. When writing documentation there are two questions with which you can challenge the usefulness of the documentation:

Who is going to read this documentation?
What will the readers learn from this documentation?

Here are some topics that should be part of your documentation:

An operators handbook explaining concisely how the infrastructure for the various systems is provisioned
An operators guide for common operational tasks (restarting servers, rotating certificates, ...)
Quality goals that your software should satisfy
How to setup a new workstation to begin development

5.) Culture

Culture in a DevOps team is just as or even more important than the technical challenges. The team should adopt an agile culture that strongly aligns with business objectives and visions, and strives to deliver business value. There is just no point in shipping little to no value at a high velocity!

In order to develop a healthy DevOps culture, see if your team needs to improve in any of the following:

Prioritize business objectives
Make small incremental changes, fast
Use an agile process. Agile processes have structure, if you feel that agile means chaos then that's not agile.
Allow failure. While we don't set out to fail, we can learn a lot from failing. Never ever finger-point someone. However, appreciate if someone did a good job.
Efforts in automation are key to achieve a high velocity. However, only automate if it is worth it. Automation needs to be maintained just as software needs to be.
Organisational processes, decision making and incident management should be transparent and documented
The team should use an issue tracker to keep track of all issues that need to be tackled. Having a complete backlog is the foundation for a good prioritization
Use checklists for common tasks, operations and processes. Your memory is great, but don't bother it with remembering the exact steps to follow - just write them down.

Conclusion

DevOps is not just about having the right tools in place. It is way more important that every team member knows how the tools work and that everyone has a shared understanding of what DevOps means to the team. DevOps is a journey that can get started by implementing some of the topics mentioned in this blog post. Not every team needs to implement and consider every aspect, but please do discuss this topics in your team and find alignment among all team members about what to do with them.

Do you think an important topic is missing from the list? Then please do write a comment, so we can collect more topics that are valuable for DevOps teams.