All articles

Bathing Water Quality as Linked Data

Is is safe to go back in the water? In the movie Jaws 2 the risk to avoid was a giant shark, but fortunately this is not a threat we have to worry about in the UK. There are, however, other reasons to be careful. One of these is water quality: we would prefer to swim in water that is relatively clean, and not contaminated by sewage. For England and for Wales, the duty of monitoring bathing waters to assess water quality falls to the Environment Agency of England and Wales (EA). Under the aegis of the European Bathing Water Directive, EA staff members collect weekly water samples during the May to September bathing season, and test those samples for compliance with clean water regulations. Currently this data is collated for reports at the EU level, and results are also published on the EA web site. However, the data itself was not directly available outside the EA organization. EA wanted to make the data available directly, both for transparency reasons and to encourage new and creative uses of the data. We designed and implemented a linked data system that would allow the water quality data to be made available, in a timely fashion, for anyone to re-use.

So why use a Linked Data approach to publishing the EA’s water quality data? To answer that, let’s back up a bit and just review what linked data actually is. Data is, when you strip it down, one of two things: numbers or words. For example: 36, 2011 or “Lulworth Cove”. These data elements are hard to interpret by themselves: we need some context, some additional information. 36 could be the number of points in an international music competition, or the age of a TV presenter, or many other things. In this particular case, the value I’m thinking of is the concentration, in colonies per 100ml, of faecal streptococci found in a water sample taken at Lulworth Cove on August 2nd, 2011. I have to convey those values – the units, the type of thing being measured, the time and location – to allow someone else to make sense of the measurement and conclude that it’s a low reading, and indicates nice clean water. The question then is how to convey that information: the association between the data itself (36, etc) and the metadata that gives it meaning (units, measure type etc)

Linked data addresses this central issue not by changing the data itself – at root, data is still words and numbers – but by changing the way that we interact with the data, especially via computer programs. In particular:

  1. every data resource, such as a particular water quality reading is given a unique identity (known as a URI);
  2. that unique identity is based on HTTP, the widely-used network protocol used by web browsers to fetch web pages for display;
  3. properties of resources, such as their type, scale, units, provenance, last updated time, etc, are represented using the Resource Description Framework, RDF, in which values are connected together in networks of named links, called properties
  4. the names of the properties, and also resource types, classification codes, etc, are also all given web identifier URIs. This means that they too can fetched (technically the term we use is resolved) using HTTP. The information returned when resolving, say, an RDF property gives a clear semantic meaning to the property.

Using URI’s for data has a number of advantages. A resource identifier such as https://environment.data.gov.uk/id/bathing-water/ukk2204-20000 provides a globally unique identifier for a particular bathing water, in this case Lulworth Cove in Dorset. Because the identifier is based on HTTP, I can follow that link, in a program or by clicking on it in a browser, to get a representation of the resource. The link is set up so that the representation returned can be varied; by default returns human-readable HTML. However, by requesting an alternative representation format, such as JSON, or Turtle, other formats can be delivered. The following links present the same resource as above, but in JSON, Turtle, or XML formats. The point of these different formats is that they describe the same underlying resource, but using different encodings that might be easier for a programmer to use. Web developers, for example, often like to use JSON because that format can be easily handled by JavaScript programs. It’s not just the bathing waters themselves that have URIs: the water sample I referred to above has its own URI. It’s quite a long URI, since it has to capture the characteristics that distinguish it from other similar samples at the same site, or at different sites on the same day, so I’ve shorted it to https://goo.gl/DqKCQ. However, if you click through to that page, you’ll see the full URI, and links on the top right of the page to also view the sample data in JSON, Turtle, etc.

The web domain used in these identifiers is environment.data.gov.uk. It suggests that the domain owner – EA in this case – can make authoratitive statements about the resource. But it is also true that anyone can make a statement – using RDF – about that bathing water. I might, for example, publish a list of bathing water sites I have visited with my family and whether we liked them or not. By using the official identity URI, it is much more likely that my data can be found, and re-used, by other people.

It’s this practice of connecting resources together using named, meaningful links that gives linked data both its name and its power. It would be quite possible, for example, for someone to publish a data about the time of year that given bathing water sites are open to dog walkers or horse riders, using the same reference identifiers to unambiguously identify the beach locations. Someone else might then write a smartphone app which lists bathing waters that are both clean and dog-friendly (or dog-free), and perhaps further link to weather or tide data. Or recommended nearby pubs and restaurants! Once the reference data has been created, the possibilities to extend and refine it with additional data sources are extensive.

A key enabler to unlocking the potential of linked data is to make life easy for software developers. End-users typically don’t want raw data, they prefer it nicely presented and easy to understand. Their goal, of course, is not just to observe data, but to interpret it and make decisions. The question “what was the most recent faecal streptococci count?” is implicitly part of “how clean is the water?” which is probably really part of “shall we take the kids to to beach today?”. Making it easier for developers to create web pages and apps that help end-users to make those kinds of decisions was an important goal for EA. While the so-called follow your nose approach to linked data – following the links in the data to see where they lead – does work, app writers really need more support than that. The query language SPARQL is a powerful and general-purpose tool for accessing RDF data, but having to learn SPARQL is an extra burden for developers. We elected instead to use an API approach, in which a collection of HTTP-accessible end-points provide a programming interface that web developers can make use of easilyl. In particular we use the Linked Data API (LDA) to provide a programming interface to the data. The LDA uses well-established conventions for accessing the details of invididual data resources, and of collections of resources. For example, the following link returns a list of all bathing waters, in JSON, five at a time:

https://environment.data.gov.uk/id/bathing-water.json?_pageSize=5

From easily-understood pieces like this (which, indidentally, are documented at some length), developers can build quite feature-rich user experiences. We are aware of at least one app currently in Apple’s iPhone store which makes use of the data. As part of our work for EA, we also implemented a reference application: the Bathing Water Data Explorer, which allows users to search for bathing waters of interest by name, county and postcode, and then view all of the details, including detailed water sample history, for that location . While this application is intended to be useful to end-users, its main goal is to illustrate to other developers the range of bathing water quality data now available to use – completely free of charge – through the API.

Conclusion

Bathing water quality is one of a number of datasets curated by the Environment Agency, which they would like to make more accessible to the public and the developer community. We created a publishing system so that weekly bathing water quality sample updates can be incrementally published by EA staff, and made available as linked data for consumption via API calls. To illustrate one use of the data, the Bathing Water Data Explorer acts as a reference application. Together with extensive developer documentation, EA hopes that this will encourage a broad range of innovative uses of the data.