The Challenges of Taking Open Source Cloud Foundry to Production
When committing to Cloud Foundry some companies make the strategic decision to take full ownership of platform deployment and operations and deploy the open source Cloud Foundry distribution to fully understand all moving parts. While open source Cloud Foundry has gotten rather easy to deploy with the cf-deployment project, there is still a lot of more to do in order to make an open source deployment production ready. This blog post highlights the challenges we faced unique to deploying the open source Cloud Foundry, and gives advice on how to grow through these challenges.
Setting the Stage
During our open source Cloud Foundry journey we went through four phases:
- Planning: As the deployment was an on-premises deployment some initial planning was done to decide on network topology, address ranges, hardware, firewall configurations and load balancer setup. With the existing organizational structure a different team covered each of these topics, so a lot of communication and thorough planning was required in the beginning.
- Deploying: Once the data center was configured, the actual Cloud Foundry deployment began and we started spinning up all the Cloud Foundry components through BOSH. Once we had it deployed and operating, we ensured that we were able to backup and restore the whole environment, and then went live immediately.
- Operating: Once in production, the platform needs to be maintained and upgraded. So we began to invest in automation and focused on the most important operational aspects such as monitoring and backing up, first, and then proceeded with upgrading the platform itself.
- Thriving: When building more and more automation, the platforms day-to-day operation tasks are eventually taken care of by Concourse. Having achieved a high degree of automation allows the operations team to focus activities on strategic platform growth and development efforts.
During those four phases we have identified several challenges that we had to overcome while taking open source Cloud Foundry to production. Naturally, there were technological challenges that had to be solved. However, there are at least as difficult challenges to be solved on organizational topics, processes, and the culture. The following sections provide an overview of the different challenges we faced in those areas that are unique to open source Cloud Foundry.
Technological Challenges
The good thing about Cloud Foundry is, that for developers the platform experience is absolutely the same regardless of whether running the open source distribution or any vendor distribution, as in it's core all flavors run the unmodified open source code. That means, that for developers we have the same developer experience in general, while technically the platform even provides the same runtime and the same way of operating apps.
A bigger difference from a technological standpoint is the operators experience (OX). Most vendor distributions hide complexity of the product behind graphical user interfaces or simplified configuration options, but with open source Cloud Foundry operators are presented with the full set of configuration options. And of course, those are defined in dozens of YAML files. So for the operators team, the first challenge was to deep-dive into BOSH as deployment tool for Cloud Foundry.
Luckily, cf-deployment
already provides lots of operations files out of the box, that we can leverage to configure Cloud Foundry for a production setup.
Despite the ready-made operations, customizations were also necessary and the team learned how to really tailor BOSH deployments by using the documentation that is available for each BOSH release.
When things got tough, we also learned to dive into the Cloud Foundry core source code and understand what happens there. This allowed us to resolve issues rather quickly ourselves, though it is definitely not easy finding the right location in the source code and building the correct mental model of all the things that happen. A solid understanding of the Cloud Foundry internal architecture was definitely very helpful here!
What we did to overcome these challenges:
- Learn BOSH, learn BOSH, learn BOSH. You never stop learning BOSH.
- Invested into properly architecting our YAML deployment files for simplicity and understandability.
- Documenting technical insights we learned along the way.
- Learned from the official cf-deployment operations files.
- Heavily used GitHub search to inspect the Cloud Foundry source code.
Organizational Challenges
In the beginning of the project we wanted to hire an experienced Cloud Foundry operator who is familiar with the whole ecosystem. Turns out there is basically no one available for hire. We were lucky though, and found an experienced freelance operator that joined our team part time and helped us making the right decisions.
So if hiring is not an option, the only valid way to move forward is by upskilling every team member. We invested strongly into a culture of learning and sharing and strived for everyone being able to do everything related to the platform. That meant that especially in the beginning we did a lot of tooling workshops for the team, so everyone knew what the JSON syntax and YAML syntax specifics were, how jq can be used to process JSON on the CLI, how git branches work, what S3 is and how to perform basic S3 operations, how Concourse can be used to automate anything, and finally how we use BOSH to actually deploy stuff in the data center. Learning took time, and was also exhausting, but the investment paid off immediately as everyone was on the same page right from the beginning.
To keep the momentum of learning and sharing the knowledge, we also developed most parts of our setup in pair or mob programming sessions. That helped to keep a high velocity as everyone in the team was always aware of what happened and why things were built as they were. Also, everyone developed a great amount of ownership for everything we built.
What we did to overcome these challenges:
- Identify what skills everyone in the team already has.
- Create trainings for all required skills and tools that everyone attends. Experienced team members could then train their colleagues.
- Do an intensive Cloud Foundry architecture training, so everyone understands the moving parts of the platform.
- Leverage pair and mob programming to increase the teams output while sharing knowledge.
Process Challenges
One process that differs from a vendor distribution is the upgrade process for any moving part of the platform. Instead of getting a tested and packaged binary from a vendor that is ready to install, we get new versions of source code being tagged on GitHub. For Cloud Foundry buildpacks, that means compiling the buildpack from the GitHub repository and deploying it to Cloud Foundry. We have automated the compilation of the buildpacks and used a staged deployment to rollout the buildpacks accross environments, that also includes testing the buildpack with a sample application.
To upgrade the platform itself, we have built a GitOps-based approach, so all changes to the platform must be done through code in a Git repository. That also means that any pipeline we build follows the same pattern: First, upload required releases and configure any relevant settings, then perform the actual deployment, and finally run a smoke test to see whether things really work. This approach usually collides with processes established companies have in place to take software to production, so be prepared for some discussions.
Another process that completely changes with open source Cloud Foundry is getting support. As there is no vendor with any SLAs the only way to get support for any question is to reach out to the Cloud Foundry Community. The best channel for this is either the Cloud Foundry slack team or filing an issue in one of the Cloud Foundry repositories.
However, there is no guarantee that anyone can and will answer the specific question you may have. So in that case it is very helpful to learn how to fix and patch Cloud Foundry components on your own. That involves learning Go (as most Cloud Foundry components are written in Go) and understanding how to create a BOSH release from the patched source code. And of course, once you fixed and patched the issue you had, please submit a pull request to the repository.
What we did to overcome these challenges:
- Automate the upgrade process of any moving parts and be transparent about it with the rest of the company.
- Get involved with the Cloud Foundry community.
- Learning Go and being able to fix issues yourself.
Cultural Challenges
While still doing operations, the way we work together has changed completely compared to classical operations team we found in companies for the past decades. As we're using GitOps for all changes, we actually do some kind of software development, and it turns out that Scrum works really well for the continuous development of the platform. We started curating a backlog of the tasks at hand, prioritize and refine them, plan sprints and do retrospectives to continuously improve the way we're working together.
When working agile and moving at speed, failure can happen. Especially as we're dealing with heaps of YAML configuration errors can easily sneak in. While of course everyone tries to avoid these situations, once they happen a lot of developers (or even end customers) are affected when the platform goes south. To err is human, and if there is a bad blame culture with finger-pointing, platform development will slow down tremendously as everyone is afraid to make changes. Instead, everyone needs to help in first resolving any bad situation, and then ensuring that it can't happen again. On the other hand, also do appreciate if someone did a good job, and celebrate what you have accomplished!
Taking and using an open source product means that someone has built great software and made it available at no monetary cost. The more time you spend with the platform, the more proficient you will become with it. When gaining so much value from open source, it's also important to give back to sustain the community and it's future development. Even if you're not able to contribute code, there is so much you can do to give back: share what you do and how you did it (think: blog posts, meetups, conferences, ...), file good issues that help developers pinpoint faults and resolve then or just get involved and help others that are on the same journey by answering their questions.
What we did to overcome these challenges:
- Adopt an agile methodology like Scrum (this takes practice).
- Invest in properly creating a backlog organized in stories and epics. This also makes it easy to talk with stakeholders about where the platform development goes to.
- Get rid of a blame culture entirely. Learn and improve from failures.
- Getting involved with the community.
Conclusion
Taking open source Cloud Foundry to production is a wild ride and contains a couple of steep learning curves. While definitely doable, it only makes sense if there is a sincere and strong commitment to investing into the platform itself and the community around it.
At the Cloud Foundry Summit Europe 2019 I gave a talk about the challenges we faced when taking open source Cloud Foundry to a production environment and here is the recording:
Embedded content: https://www.youtube.com/watch?v=k1uPdlW_lP0
You may find the slides of the talk at Speaker Deck.