Epimorphics builds a data publish platform for the Environment Agency

Epimorphics has developed a linked data publishing system for the UK Environment Agency in support of their Bathing Water data. Ian has described the application; this blog entry describes the data publishing platform.

The publishing system consists of the Bathing Water Data Explorer and a fault-tolerant, scalable data platform. We’d previously built a prototype system which ran on a single machine. Now the requirements are to provide flexible capacity and fault tolerance.

While the bathing water season is May to September, the data is available all year round so flexible capacity is there to meet demand that is not evenly distributed throughout the year – there is more demand during sunny, summer weather around national holidays, than out of season in the winter.

The fault tolerance aspect because the system is changing – the data is updated throughout the season. Systems can and does go wrong. The flexible demand and fault tolerance both rely on being able to add and service to the data explorer and direct API access to the data.

To meet this, a clustered solution has been built, using Apache Jena™, and it’s SPARQL server Fuseki, with data stored in TDB and presented through EDLA providing a Link Data API to the data.

Bathing water data is not static – there are weekly updates at each beach from measurements of water quality against certain predefined criteria, as well as year-by-year changes to designated beaches and details of the information gathered. If it were just static, then simple replication of machines behind a load balancer would meet the requirements. The data, in size, can be handled by one machine; it is the query and data formatting in the LDA that mean the processing needed may exceed the limits of a single machine at peak times.

The basic idea of replicated, independent machines can still be used – there is no need for a single replicated database because the way the data is updated provides a way to design a simple system based on the open source software used in the prototype.

Updates during the bathing water season are additional triples to be added to the existing previous data. The core data is growing, not changing, on a week-by-week basis, or ideally, day-by-day basis as beach sampling analyses are performed. Any reorganization of the data only happens once a year, out of season.

There is one additional piece to an update – there is a “current” view of the data. This changes as as new data arrives, removing the old current view and adding in the new current view.

The master copy of the data resides in a controller server. Given new updates, this calculates the triples to be added, the triples to be removed (for data fixes), and the action needed to remove the old view and add the new view. The nature of the changes is that that a change can be re-applied to end up with the same result – in the jargon, the changes are ‘idempotent’, so whether you apply a change once or twice, the effect is the same. This is exploited so that when the system is adding machines, there is no need to stop updates for a while – the updates can be reapplied until the controller is sure all machines have received and executed all the updates.

The final piece of the design jigsaw is that the data presented to the end user (member of the public or data scientist) is not explicitly told about when updates happen. It does not matter if their view of the data is out of step for a few moments (at most 10s of seconds) because they are not aware the data is being updated.

This all leads to a quite simple design, and simple systems are easier to operate.

There are a number of data publishing servers, typically two, so there are two copies of the data available at any one time in case one server fails or becomes uncontactable. The servers are fronted on their local machine by Apache httpd because that restricts the access to the database, blocks certain known overzealous robots, and routes requests within the machine between static files and JavaScript, and to ELDA and to Fuseki. The cluster is fronted by a load balancer which evenly spreads requests across the currently active data servers.

There is a separate data controller which provide the administration user interface for data upload and then performing coordinated changes to the replicas. All updates are performed by sending SPARQL Update scripts to each replica. A change is applied transactionally at each replica. While an update is happening, the database on difference machines may be slightly different but the user is not aware of what it means to be “current” so does not know. In fact, updates occur quickly and in parallel, so the period of time for any inconsistency is a few seconds.

New data publishing replica can be added at any time. A new machine is built, the last backup of the database installed and then any updates done since that backup are applied. The machine is then ready to be added to the load balancer. Finally, after the machine has become active, just in case an update happened while adding it, any updates that just happened are run to ensure the new machine is in-step.

When a busy period is over a machine can be removed from the cluster simply by removing it from the load balancer control and switching it off.

This is all built in with Amazon Web Service with EC2 and Amazon load balancers. A separate prototype was also built using HAProxy as the load balancer to ensure the system is not Amazon specific.