Data cube vocabulary

Sample diagram of QB ontology

People often think of linked data as being a way to connect descriptions of things into a graph of relationships, for example to link a school to the organization that oversees it. So how do we deal with data which is naturally presented as tables and charts, things like official statistics or environmental measurements? Is linked data of any use in that case? If so, how do we approach the problem?

Why?

In fact the same values that apply to linked data in general apply to such tabular data.

Firstly, it allows us to integrate, compare and slice across data sets. It enables us to publish the information model along with the raw data and to use common terms (identified by URIs) for the dimensions and units in our data sets and for the non-numeric values (e.g. for geographic regions or organizations). This in turn allows us to compare and link reliably across data sets in ways that are not possible with simple spreadsheets.

Secondly, by publishing as linked data we give web addresses to the data sets, data slices and individual observations. This opens everything out. It makes it possible to annotate, explain and qualify the values. For example, to explain why a given payment is unusual or to reference the measurement methodologies for a particular year of samples to explain why they might not be comparable with another year. It allows us to link to provenance information about where the data came from and allows re-users of our data to link back to our sources laying down a trail of evidence to support decisions based on the data.

How?

We need a vocabulary or ontology which allows us to describe statistical or multi-dimensional data in such a way that we can gain those benefits. In particular, we need one that makes the associated information model clear and inspectable, that allows us to link out to other linked data resources as identifiers for units, dimensions and entities, one that is rich enough to faithfully represent the data but easy to use and extend.

The most well-known option for this in linked data is the scovo vocabulary. This is a great vocabulary, very simple and reusable, but it lacks several features that are need for many of the use cases we find in public sector data sets - features such as an inspectable data model, data attributes, grouping/ordering of dimensions to organize data into slices and options for data compaction.

To address this the data.gov.uk linked data programme sponsored an expert group to develop such a vocabulary, particularly one that would be suited to linked data publication of statistical and similar datasets. The group comprised Richard Cyganiak of DERI, Jeni Tennison of The Stationery Office and Dave Reynolds of Epimorphics, under the sponsorship of John Sheridan, The National Archives. We built upon the data modelling standards already in use in the worlds of statistical agencies (SDMX) and social and economic micro-statistics (DDI).

The end result was the Data Cube vocabulary - so called, because it captures the core information model for a multi-dimensional cube of data, and does so in a way that is compatible with Scovo, SDMX and DDI. The data cube captures the notion of a data set comprising observations organized along one or more dimensions, where each observation includes one or more measures and the values of those measures are interpreted according to some attributes. For example a data set on population and demographics might have dimensions of time period, geographic region and sex, a measure of life expectancy and attributes defining the units, in this case years. The set of dimensions, attributes and measures is explicitly represented as a first class information object, the data structure definition that describes the cube. This in turn makes a data cube data set self-describing, enabling us to build tools that can automatically generate APIs and visualizations for a data set.

Uses

The Data Cube approach has proved very broadly applicable and easy to extend. It is straight forward to apply it directly to the original domain of government statistics. For example, we have worked with ESDToolkit to show how Data Cube can be used to represent metric data for local government. It also works very nicely in the area of environmental monitoring, see for example our work with the Environment Agency on Bathing Water Quality assessments. In some domains we have found that a useful pattern is to create an extension ontology to specialize the generic Data Cube notions to more clearly reflect the terminology of the domain, see for example our work on publication of local government payments.

Given these successes using Data Cube across a broad set of domains, we are actively looking at tooling which can better support Data Cube publication - including data conversion, validation and visualization. Contact us for more information.