Taking Open-Source Cloud Foundry to Production

August 30, 2018

Cloud Foundry

Running open-source Cloud Foundry is a challenging task. Compared to vendor-distilled distributions deploying open-source Cloud Foundry requires in-depth knowledge of BOSH and Cloud Foundry itself. This blog post shows solutions to typical operations topics when taking open-source Cloud Foundry to production and gives an overview of what it takes to run the Cloud Foundry core itself.

Deploying Open-Source Cloud Foundry

The official deployment documentation is comparatively short and gives a good overview of the different steps required to deploy open-source Cloud Foundry. In a nutshell, we will first need to stand up a BOSH director, then adjust the BOSH cloud config and finally deploy the cf-deployment BOSH release.

If your IaaS you wish to deploy Cloud Foundry to is in the cloud you may leverage the bosh-bootloader tool (bbl) to pave the required infrastructure with the help of Terraform and to eventually stand up the BOSH director. For on-premises installations the bosh-bootloader tool does not have a huge benefit and you may also choose to deploy the BOSH director as described in the documentation.

In either case, if you have the BOSH director you need to customize the cf-deployment manifest with operations files to suit your needs. And this is where we can customize important parts of the platform to be production-ready. The following sections give an overview over some of the most important operations topics and will point you in the right direction for addressing them.

If you are unfamiliar with BOSH make sure to first checkout the Ultimate Guide to BOSH to get to know BOSH terminology. If you have never done a BOSH deployment yourself, some of the wording here may sound strange to you. In that case you can deploy Concourse as BOSH deployment as an exercise and then come back here to take Cloud Foundry to production. But now let's get started.

Backup and Restore

Of course we want to backup and restore the platform and luckily, the BOSH backup and restore tool (BBR) has proven to take good care of the job. As of today's writing, BBR is able to backup the platform itself and optionally an external Cloud Controller database and blobstore.

To enable BOSH BBR support you will need to add the respective operations file to your deployment manifest:

bosh -d cf deploy $CF_DEPLOYMENT/cf-deployment.yml \
    -o operations/backup-and-restore/enable-backup-restore.yml \
    -o operations/backup-and-restore/enable-backup-restore-credhub.yml \
    -o ...<ADDITONAL_OPS_FILES>...

Depending on the IaaS you are using you need to add additional operations files according to the backup and restore documentation.

Once deployed, make sure to regularly make a backup with the BBR tool. If you are willing to, also try to restore from a backup as that will be a great exercise for you and your team.

Log Management

BOSH supports sending platform logs via the syslog release to a remote syslog endpoint. Of course when operating a production Cloud Foundry foundation it is essential to have an external and persistent log management system to trace any errors in case they happen. So to enable the forwarding of platform logs to the external syslog endpoint you can add an operations file to the deployment as follows:

bosh -d cf deploy $CF_DEPLOYMENT/cf-deployment.yml \
    -o operations/addons/enable-component-syslog.yml \
    -v syslog_address="logs4.papertrail.com" \
    -v syslog_port=38559 \
    -v syslog_permitted_peer="*.papertrail.com"

You may have a look at the reference documentation of the syslog_forwarder job from the syslog BOSH release to see all available configuration options that are not covered by the provided syslog operations file.

Monitoring Cloud Foundry

BOSH has an internal component called BOSH Health Monitor (HM) that is capable of sending monitoring data to external systems. The HM needs to be configured during the deployment of the BOSH director and can of course be added to an existing BOSH director deployment. To enable the HM you need to write a custom operations file depending on your monitoring backend and add that to the deployment (BOSH supports OpenTSDB, Graphite, PagerDuty, DataDog and AWS CloudWatch out-of-the-box). The custom operations file for monitoring to a Graphite instance might look as follows:

# file: operations/enable-graphite-hm.yml
# enable Graphite as backend for BOSH health monitor
- type: replace
  path: /instance_groups/name=bosh/properties/hm/graphite?
  value:
    address: ((hm_graphite_address))
    port: ((hm_graphite_port))
    prefix: ((hm_graphite_prefix))
- type: replace
  path: /instance_groups/name=bosh/properties/hm/graphite_enabled?
  value: true

Then, you can add the operations file filling in the variables when standing up the BOSH director:

bosh create-env \
    $BOSH_DEPLOYMENT/bosh.yml \
    --state bosh-state.json \
    -o operations/enable-graphite-hm.yml \
    -v hm_graphite_adress="graphite.your.org" \
    -v hm_graphite_port=2003 \
    -v hm_graphite_prefix="oscf.dev." \
    ...

The HM also has auto-healing capabilities through the so-called resurrector. The resurrector continuously monitors BOSH-deployed VMs and automatically recreates them if they are not considered healthy. By default, the resurrector is enabled - however, it sometimes makes sense to turn the resurrector off (during upgrades, for example). To do so, you may write another operations file:

# file: operations/toggle-resurrector.yml
# enable or disable the BOSH HM resurrector as needed
- type: replace
  path: /instance_groups/name=bosh/properties/hm/resurrector_enabled?
  value: ((hm_resurrector_enabled))

Instance Sizing

Cloud Foundry deploys a large number of VMs that all host different jobs that may need more or less resources depending on your usage pattern. Properly sizing instance groups is key to operating a cost-efficient but fast and highly-available Cloud Foundry foundation. It requires some understanding of the platforms architecture and some monitoring to get this right.

Scaling Horizontally

By default, Cloud Foundry deploys instance groups to be highly-available where some instance groups are deployed 2 times and others are deployed 3 times (in case they need a quorum). To change the number of instances we need to write operations files to change the instances count of each instance_group we want to change:

- type: replace
  path: /instance_groups/router/instances?
  value: ((scale_router_instances))
- type: replace
  path: /instance_groups/diego-cell/instances?
  value: ((scale_diego-cell_instances))

Extracting the relevant instance counts to variables makes sense in order to reuse the scripts across different stages, which may then be scaled individually.

Scaling Vertically

In order to scale vertically we need to first understand how BOSH allocates VM sizes to instance groups. In a deployment manifest instance groups are sized by the vm_type: <SIZE> property next to the instance count property. The value of that property refers to the available VM types from the configured BOSH cloud config. You may view the current cloud config by executing

bosh cloud-config | bosh int --path "/vm_types" -

In the cloud config you can see the actual hardware resources that are allocated to each VM type. In order to choose a specific VM type for an instance group you may provide the VM type name through an operations file:

- type: replace
  path: /instance_groups/router/vm_type?
  value: ((scale_router_vmtype))
- type: replace
  path: /instance_groups/diego-cell/vm_type?
  value: ((scale_diego-cell_vmtype))

To change the available VM types in the BOSH cloud config you will need to adjust the cloud config with an operations file similar to how you would change the BOSH deployment itself. As this strongly depends on how you deployed BOSH this is out of scope of this article (for a BBL deployment see the customization guide and for a vanilla BOSH deployment see the cloud config docs).

High Availability

With open-source Cloud Foundry it is easy to achieve a high-available deployment. By default, Cloud Foundry is configured to use two availability zones. If you need to adjust the number of availability zones for your deployment then you need to write some operations file magic, as each instance group is located individually on the availability zones. To get started, you may find the relevant sections to adjust with the following command:

$ cat cf-deployment.yml | grep -n -A3 -B1 "azs"
...
215-  - name: diego-bbs
216:  azs:
217-  - z1
218-  - z2
219-  instances: 2
...

You can then decide that you want 3 instances of the diego-bbs instance group and spread it across all availability zones with the following operations file:

- type: replace
  path: /instance_groups/diego-bbs/instances?
  value: ((scale_diego-cell_instances))
- type: replace
  path: /instances_groups/diego-bbs/azs?
  value: [z1, z2, z3]

Of course, the names of the availability zones are configured in the BOSH cloud config specific to your deployment. Two instances per instance group (the default) provide enough fault tolerance for dealing with a single availability zone outage it is highly important to keep an eye on the average workload per instance group compared to the number of instances deployed. If you have 2 instances of an instance group with an average utilization of 60% then a single instance won't be able to keep up in case of an AZ outage, so make sure to scale the deployment accordingly with the above operations file skeleton.

Security

You are lucky! A Cloud Foundry deployment has a decent amount of security built-in. BOSH is used to generate strong passwords and certificates that are used to secure the communication between internal components. It is best practice to deploy the BOSH director with a credentials store such as CredHub in order to keep the generated credentials secure. Concerning security, there is nothing you need to do in particular for the open-source version of Cloud Foundry.

To change some security aspects of Cloud Foundry you may use some of the standard operations file already provided by the cf-deployment project:

disable-router-tls-termination.yml: to eliminate keys related to performing TLS termination within the gorouter job
enable-cc-rate-limiting.yml: to enable rate limiting for UAA-authenticated endpoints
stop-skipping-tls-validation.yml: to enforce TLS validation for all components which skip it by default
use-trusted-ca-cert-for-apps.yml: to inject the specified CA into the Diego trusted store

Authentication

The cf-deployment project ships with a pre-configured UAA instance that uses an internal user store. If you want to add an external identity provider such as SCIM or LDAP you can do so easily by setting the appropriate properties through a custom operations file. The UAA server contains a wide range of configuration options that contribute to a great authentication mechanism for Cloud Foundry users.

Developer Topics

Last but not least, there are more things to be done once Cloud Foundry has been deployed.

Buildpacks: Buildpacks that are available on the platform need to be installed and maintained. Ideally, you setup a fully-automated Cloud Foundry buildpack management pipeline running on Concourse that takes care of always installing the latest buildpacks on the platform.

Developer Console: Open-source Cloud Foundry does not ship with a graphical user interface for developers by default. However, there is a great open-source user interface developed by Suse: Stratos. The UI covers a lot of features and can even manage multiple Cloud Foundry foundations at the same time. There are also various deployment options for Stratos.

Quotas and Security Groups: Make sure to configure good quotas for your Cloud Foundry foundation. To get an easy start with quotas think of t-shirt sizing the quotas, so for example a small quota with 20GB of memory, a medium quota with 60GB of memory and a large quota with 150GB of memory. It is best practice to keep the number of quotas low to ease maintenance. Security groups should be configured to control egress traffic from application containers. In the security groups configuration you need to whitelist all resources that need to be accessible.

Service Brokers: Cloud Foundry developers need a marketplace filled with useful service offerings! As an operator you can fill the marketplace by registering service brokers with the Cloud Foundry deployment. In most settings I know where Open Source Cloud Foundry has been deployed lots of custom-built service brokers have been added to the foundation. However, there are also some official and some community-driven service brokers available on GitHub.

Conclusion

Deploying a production-grade open source Cloud Foundry distribution is not unrealistic. In fact, the open-source Cloud Foundry distribution addresses all key concerns a production-grade deployment needs to consider. With the above pointers you should be up-and-running in no time. If not, get in touch with us at mimacom - I'm sure we can help! :-)