An important part of our work involves running and managing services for our clients. Whether public facing data services such as the Environment Agency Hydrology service or Food Standards Agency listings of regulated products, or private data services used for data integration and access within a client’s business. Running so many managed services efficiently means adopting common tools and patterns that we can reuse across our services.

Unsurprisingly, in this day and age, one of the key tools we use is kubernetes and we’ve been successfully running production services on kubernetes clusters for more than three years now. In that time we’ve learned a lot about how to tame kubernetes and where, and where not, to use it. This blog is a dive into one of the key challenges we’ve faced – how to manage state.

If you’ve never heard of kubernetes then this is probably not the blog post for you 🙂 but at heart kubernetes gives us a way to run the many components that make up a typical customer service (applications, apis, editors, publishing chains, data stores) as separate communicating containers. Kubernetes schedules and runs these containers for us to provide scaling (runs resizable sets of replicas of each component that can share the load), resilience (balancing requests across the replicas and restarting failed components) and manageability (shared tool stack for logging, monitoring and metrics).

Aside: the units of work in kubernetes are actually called pods where a pod can be a small collection of containers – typically one main one but optionally some additional containers can be run to help initialise things or run as side-cars to help with management and monitoring.

Kubernetes is absolutely brilliant for running stateless services. Components such as the front end applications or the API layers just need access to configuration information and the in-memory state for each incoming request but there’s no long-term persistent state to store. That means that if an instance of the API dies, kubernetes can safely automatically restart it. If a node in the cluster dies all the stateless workloads can all be just restarted on other working nodes and the services just carry on running (so long as we have enough replicas running to make sure at least one is operating during the transition).

But what about those components that do have persistent state? In particular, given we are all about data services, then what about the databases that are serving the data – typically triple stores in our case. Does it make sense to run those in the cluster?

If so, how best to do that and how to manage those stateful workloads?

While there’s machinery in kubernetes for supporting this kind of situation it turns out there’s a bunch of different options, with different tradeoffs and choosing the right approach can be tricky.

Option 0 – don’t do it!

There’s an argument that while you can run stateful services, including databases, in a kubernetes cluster then you shouldn’t if you don’t have to.

This is especially true if the database layer you need is available to you as part of your cloud provider’s “Platform as a Service” offering. We typically run on AWS (Amazon Web Services) and on those occasions when we do need a traditional database such as a Postgresql store then the AWS Relational Data Service (especially the impressive Amazon Aurora) can be a better choice. Such cloud platform services already provide for replication, resilience, scaling, backup and monitoring out of the box.

The rub here is that you do have to pay for this convenience, but if it saves support time those costs may well be worth it.

On the occasions we do go this route, then a best practice is to expose the platform database within the kubnetes cluster as if it were also running in the cluster. This way your cluster workloads are decoupled from the choice of how you run your database. This is made possible through the Service/ExternalName mechanism. In kubernetes a Service is an abstraction for how you access a collection of components as a network service. It is mostly used to allow you to access a group of replicated pods, such as API instances, without having to worry how many instances are running or where they are. An ExternalName is a kind of Service that allows you to access an external service, such as a cloud hosted database, as if it were an in-cluster service. This is achieved with a simple service definition like this:

kind: Service
apiVersion: v1
metadata:
 name: data-rds
spec:
 type: ExternalName
 externalName: rds-data-controller.epimorphics.net

where rds-data-controller.epimorphics.net is a DNS name we’ve given to the managed database and data-rds is internal kubernetes Service which we’ll use to access it.

Then later, if we migrate to a different backend store we simply replace this service definition, without having to reconfigure any of the components that use it.

While this can be a good solution, the costs of a fully managed database aren’t always justified.

More seriously, in our case there isn’t always a cloud provider solution. We mostly use RDF databases, so called triple stores, and most cloud providers don’t offer those. Amazon does offer something, in the form of Amazon Neptune, but that lacks features like integrated text search that are critical to many of our applications. So for those cases we have no choice but to run something ourselves, either on raw machine instances or in the cluster.

Interlude – the elements of kubernetes state management

Kubernetes does provide lots of machinery for running stateful services.

Each cluster will have some means to allocate file system storage and attach it to a pod. In setting up a cluster you’ll install one or more storage drivers (CSI Driver) and then declare, to the workloads in the cluster, what types of storage are available as a set of StorageClass objects.
When a workload needs persistent storage it asks for some via a PersistentVolumeClaim (or PVC). The claim says how much storage, in which storage class is needed.
When the driver for the requested storage class allocates space to match the claim it represents it as a PersistentVolume (PV). It’s this volume which gets attached to the containers in the pod that asked for it. The claim is said to be bound to the volume.
The final element in the chain is the notion of a StatefulSet. This is a kubernetes workload that represents a set of replicas of some component (each replica will be a pod) collectively accessed via some Service. The stateful set includes a template for a PersistentVolumeClaim for each replica. The stateful set guarantees consistent naming and ordering of replicas and of the associated PVCs. Which means that if a replica has to restart the new one can safely re-attach the PVC/PV associated with the old one with no change of name.

Option 1 – Statefulsets with persistent block storage (EBS)

So finally we are ready to look at the first real option for running a data store in kubernetes – run as a stateful set using networked block storage.

All cloud providers offer some form of block storage where you can dynamically allocate some storage and attach it to a machine instance. In AWS that storage is called EBS (Elastic Block Store) and is what you would use as e.g. the root disk on normal EC2 instances. There is a CSI driver for EBS which is a standard addon for kubernetes clusters in AWS.

So the standard route to defining a workload with persistent storage is to create a stateful set with a volume claim template that asks for an EBS volume for each pod in the set.

In practice that looks something like this:

apiVersion: apps/v1
kind: StatefulSet
metadata:
 name: data-store
spec:
 serviceName: data-store
 replicas: 2
 selector:
   matchLabels:
     app: data-store
 template:
   metadata:
     labels:
       app: data-store
   spec:
     containers:
     - name: store
       image: store-implementation
       volumeMounts:
       - name: data-volume
         mountPath: /var/lib/store/databases
 volumeClaimTemplates:
 - metadata:
     name: data-volume
   spec:
     accessModes: [ "ReadWriteOnce" ]
     storageClassName: "ebs-gp3-sc"
     resources:
       requests:
         storage: 50Gi

This will fire up two pods (replicas: 2) running the store-implementation image. The first pod will be called data-store-0 and the PVC for it will be data-volume-data-store-0. That asks for 50Gb of block storage from EBS. On first start-up this won’t exist, so the driver will allocate a new 50Gb EBS volume and bind a PersistentVolume describing it to the data-volume-data-store-0 PVC.

This volume will be initially empty, of course, so it’ll be up to our store-implementation to initialise it and to replicate data between the different replicas as updates happen – that’s not kubenetes’ problem.

If that pod dies the PVC and the bound volume will be kept. When kubernetes restarts the pod then, thanks to the consistent naming provided by stateful sets, the new pod will get the same name, will look for the same PVC and find it. So the pod will restart with the same volume attached and all the persistent data will still be there.

Problem solved? Not entirely.

Apart from the question how all the replication and data management works there are two problems here.

Firstly, EBS volumes are fine for normal workloads but their I/O performance is not good enough for very high performance databases. You can provision different levels of EBS IOPS rates and the EBS approach can be performant enough for modest service levels. We do, in fact, run some of our triple stores out of EBS successfully, but the performance may not be good enough for a service with heavy load.

Secondly, there’s the question of availability zones. Running services in AWS we want to spread them across zones, so a failure in one zone doesn’t bring down the service. In AWS there are three zones in each region and our clusters are set up so there are nodes in each of the zones. All fine and normal until you realise that EBS volumes are limited to a zone.

So if data-store-0 first starts up in, say, zone eu-west-1a that’s where the EBS volume data-volume-data-store-0 will be allocated. If the pod dies and restarts then kubernetes knows that the pod can only be started in zone eu-west-1a and won’t try to run it on a node in a different zone but if we have too much pressure on our resources in that zone then there might not be space for it and the pod can’t restart even if there is plenty of spare capacity in the other zones.

If you run big clusters this isn’t so much of an issue but our clusters are small, we might only have 6 nodes with just two in each zone. At which point it’s quite possible to run into situations where the inability to start a stateful workload in a different zone leads to a deadlock.

Still the EBS approach is good enough for some situations. For example we use Elasticsearch as part of the log management stack for running our services. That’s only needed for operational management so the load on it is small and performance of running off EBS is just fine. Elasticsearch manages all the replication and we run with three replicas so we end up with a replica in each zone, and if one goes down and can’t immediately restart the elasticsearch service continues to be available.

Option 2 – Elastic File Store

One way round the limitations of zones is to use an alternative storage system which supports cross-zone access. For AWS the offering here is EFS (Elastic File Store).

The big advantage of EFS is that you can network mount an EFS volume across zones and you can mount the same volume in multiple places. So instead of having lots of allocated storage volumes you can just have one which is shared across many different workloads. That volume is elastic – you don’t have to allocate a specific size, just store as much data in it as you like and you just get charged for the amount stored.

There’s been a kubernetes CSI Driver for EFS for quite a while, but to begin with it only let you mount the whole EFS volume. So every workload sees exactly the same file system. That would be fine (indeed really useful when building data pipelines, but that’s another story) but you need some way to tell each replica of the data-store to use a different directory in the mounted volume for its data. That gets awkward.

Modern versions of the EFS driver now support a notation of dynamic allocation so that each dynamic persistent volume is given a separate directory in the shared EFS store which is mounted as if it’s the whole file system. That makes EFS as easy to work with as EBS but without the issues of moving between zones.

Assuming you have a sufficiently up to date EFS driver in your cluster then you just need to define a StorageClass for dynamic access:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
 name: efs-dyn-sc
provisioner: efs.csi.aws.com
parameters:
 provisioningMode: efs-ap
 fileSystemId: fs-1234567890
 directoryPerms: "755"
 uid: "1000"
 gid: "1000"
 basePath: "/dynamic"

Then you can use efs-dyn-sc in the volume claim templates in your stateful set and you have unlimited storage with no limitations to a single availability zone.

So what’s the catch?

Well first there’s cost. EFS costs around three times EBS per GB per month. That’s actually not so bad. With EBS you pay for the size of the volume you allocate, independent of the amount stored. Whereas since EFS is elastic you only pay for what you actually use. Typically we would size an EBS volume to be safely big enough for peak storage which can easily be two or three times the average storage use. So the raw storage costs may come out as comparable.

The other issue is performance. By default with EFS the more you store the better IO performance you get. Well, it’s more complicated than that with burstable IO rates and usage credits that count down, but that’s the essence. So to get the IO performance you need for a database it needs to be a pretty big one or you can buy provisioned IO but that gets really expensive really quickly.

Bottom line is that this pattern works fine persisting small amounts of state – state trackers, persistent redis queues and the like – but generally isn’t cost effective for running large triple stores.

Interlude – managed triple stores

So far we’ve been talking in generalities about the store implementations and skated over how the data gets in there and how replicas are maintained. Now’s the time to fill in a few of those gaps.

For the bulk of our services we represent the information as linked data which means we want to hold it in an RDF database aka triple store that supports the full Sparql query language and protocols.

We use the Apache Jena open source framework for most of our services, a project we have contributed to in the past. In particular, Jena includes an RDF server (fuseki) and a couple of persistent storage layers – TDB (Triple DataBase) and TDB2. Out of the box this doesn’t provide for replication though there is a separate project which provides for that (RDF Delta).

An important part of our data platform is a tool we call podium which is a wrapper around Jena to provide a cluster-friendly managed, replicated fuseki service. A key feature is it allows us to publish a variety of data updates and replacements and have them distributed to a set of replicas. It also makes it easy to manage checkpoints and backups. When we bring up a new replica it can automatically start from the most recent checkpoint and only needs to replay the log of updates from that point forward making it quick to catch up. We store those checkpoints and all the update logs in AWS S3 so they are very durable and even if we lose all the replicas we can quickly reinstate a podium cluster from the S3 state.

We might come back to podium, and how it works, in a future blog post but for now that’s enough to continue the persistence story.

Option 3 – ephemeral node storage

When running a high performance database directly on machine instances, the way to get the IO performance needed is to use local ephemeral storage. There are several classes of AWS machines which provide fast NVMe style storage for just this reason.

Can we do the same from within a kubernetes cluster? That would give us much of the benefits of manageability and monitoring without losing database performance.

Obviously the answer is “yes”, otherwise we wouldn’t have reached this point in the story!

The brute force option, which is just about viable for some use cases, is to allocate some storage partitions on the ephemeral disk on the local nodes (by some external scripting) and just just expose them to the cluster as fixed storage resources associated with those nodes.

To do this we can define our own storage class with no provisioner (the volumes have all be been pre-provisioned on the nodes).

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
 name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

Then we create a set of static persistent volume declarations for each partition and limit these to node they are on (using the “node affinity” machinery in kubernetes).

apiVersion: v1
kind: PersistentVolume
metadata:
 name: local-$PVNUMBER-$NODE_NAME
spec:
 capacity:
   storage: 100Gi
 accessModes:
 - ReadWriteOnce
 persistentVolumeReclaimPolicy: Retain
 storageClassName: local-storage
 local:
   path: /mnt/disks/ssd1/$PVNUMBER
 nodeAffinity:
   required:
     nodeSelectorTerms:
     - matchExpressions:
       - key: kubernetes.io/hostname
         operator: In
         values:
         - $NODE_NAME

Then in our stateful set template we just request a persistent volume from the local-storage class and it’ll pick a node with an unused associated persistent volume and bind to that.

We then rely on the store worker’s startup machinery to check if the volume that it has been given is empty or out of date. If empty it downloads the latest checkpoint image and then, or if out of date, it replays the upload forwards until it has caught up before signalling ready and going into service.

If a database instance restarts it will be forced to restart on the node with the persistent volume it last used and starts off with the state from where it left off so likely very quick to catch up on updates.

It would be better to be able to dynamically allocate partitions rather than choosing from amongst some fixed size pre-allocated partitions. There are a variety of emerging tools for doing this, including TopoLVM, CSI LVM and many others. However, in our initial testing of this approach we didn’t find one that was both mature enough and fit our needs well enough, though they continue to improve.

The bigger issue here is the tight binding between a pod and the node that it’s volume is on. When a node dies and is replaced, or when we are updating a cluster (which entails replacing all the nodes) then we need to force the existing store workers to drop their binding to the local volume on the old node and find new bindings. We’ve not been able to reliably automate that and typically manual intervention is needed, even if it’s just cleaning up out-of-date PersistentVolume definitions. So we’ve lost a significant value of running in kubernetes in the first place – reliable, unattended self-repair.

Turns out there’s one more option to explore.

Option 4 – podium with fast emptyDir

Kubernetes has a notion that a pod can request an emptyDir volume. The volume will be storage of some sort, typically local storage, which will start off as empty (hence the name). If the pod restarts then it can restart on any node and will be given a new empty volume, even if on the node it was previously on.

Given the design of podium, with the state stored durably in S3, then it is fairly quick to bring up a replica to the current state starting from such an empty state. So long as we maintain reasonably up to date checkpoint images then we are largely limited by the speed of downloading the checkpoint from S3 which, given the fast networks within the cloud provider, is pretty good. Unless we have a very high update rate the cost of replaying all changes from the last checkpoint is small. So the time to start a new replica from scratch can be quite modest – a few tens of seconds for small data up to a few minutes for bigger datasets. But assuming we have a good replication level that’s fine. The service will still be operating and a few minutes of recovery after a node failure is acceptable.

The trouble is that by default this emptyDir will be just a space on the node’s normal file and will, in practice, just be EBS. So we have convenience but still not the performance we need.

The trick is to map emptyDir to fast local ephemeral storage. That turns out to be reasonably painless. We use Amazon’s managed kubernetes service, EKS. That allows us to define the configuration for each group of nodes declaratively and among the annotations we can provide is a script to run when the node starts up. We can use that script to create an LVM volume on fast ephemeral storage and then mount that where the node manager (kubelet) places all pod specific data (/var/lib/kubelet/pods for current EKS).

So in our cluster definition file we can create node groups such as:

- name: managed-db
   instanceType: i3.large
   desiredCapacity: 3
   updateConfig:
     maxUnavailable: 1
   labels: { role: db }
   tags:
     nodegroup-role: managed
   iam:
     attachPolicyARNs: …
     withAddonPolicies: …
   preBootstrapCommands:
     # Create  LVM volume group for ephemeral and allocate 250G to emptydir
     - "pvcreate /dev/nvme0n1"
     - "vgcreate vg01 /dev/nvme0n1"
     - "lvcreate -L 250G -n lvol01 vg01"
     - "mkfs -t ext4 /dev/vg01/lvol01"
     - "mkdir -p /var/lib/kubelet/pods"
     - "mount -t ext4 /dev/vg01/lvol01 /var/lib/kubelet/pods"
     - "if ! grep -q /var/lib/kubelet/pods /etc/fstab ; then echo '/dev/vg01/lvol01 /var/lib/kubelet/pods  ext4  defaults 0 0' >> /etc/fstab ; fi"

Then when configuring a podium service we can use a nodeSelector to ensure the worker pods only run on nodes labelled with the db role and we know the pod information, particularly the emptyDir, will be on fast local storage.

Now we have a good solution. Fast storage for the database. No tight binding of pods to nodes so automatic node replacement and repair works, with no manual intervention.

The cost is that if a worker restarts it always reloads state from S3 and can’t cheat by knowing it’s actually on the same node it was on last time and the local state might still be intact. So the time to reload that state is the limiting factor. For very large datasets that time can grow from minutes up to a significant part of an hour which would make cluster updates awkwardly slow. So this is not a universal solution but it works well for many cases.

Wrap up

So there’s no one-size-fits-all but a range of good options.

If you do already have access to a managed, resilient triple store then just use it but expose it within your cluster as an ExternalName type Service to ensure good decoupling.
If you don’t require absolutely the best performance but resilience and ease of management are your top priorities, then use the “standard” route of EBS persistent volumes. However, make sure you have enough nodes in each availability zone to avoid allocation deadlocks in the event of node failure.
Consider using dynamic EFS volumes instead of EBS for extra convenience in zone management and scaling but carefully examine the performance/size/pricing trade offs before committing.
If database performance is critical, and you can manage all your data updates through a persistent update manager such as podium, then use the fast emptyDir trick to get the performance of local ephemeral storage with full flexibility. Though if the data volumes get too large it will take little while for a new replica to start which could impact ops processes such as cluster updates.
If none of those fit then you may have to invent a different solution, but that’s OK, there’s always more options available once you understand the constraints of your particular situation.

Hope this was of interest. If you are faced with a challenge of designing or running data services and would like some help then do please contact us.

A #TechTalk post by Dave (our CTO)