The data.gov.uk linked data project found itself in need on an ontology or vocabulary to support linked data publishing of government organizational information. A check for existing ontologies didn’t turn up an ideal solution so we decided to develop one.
This is the first of a series of posts describing the process we went through to do so. It is not a tutorial on ontology design but hopefully will give some guidance and hints for people in a similar situation. It also documents what we did, which might help to answer “why did you do it that way?” questions in future. 🙂
For anyone who is looking for a tutorial on ontology design then I’d recommend:
- as an introduction, use the classic Ontology 101
- for high level design criteria then see the five core principles in Gruber
- for a methodology, based on philosophical principles but leads to practical issues to consider, see OntoClean
- for design patterns to help keep a large ontology modular and maintainable see Rector
- for further study then look at Sowa; Semantic Web for the Working Ontologist gets good reviews but I’ve not read it yet
First we need to figure out what the ontology has to cover. For this we need to both scope the domain and develop some informal or formal “competency questions”.
Aside: When developing a data schema you need to know a lot of details of the application so that you can optimize the schema for the data access paths needed for the application. An ontology is not a data schema, it is supposed to directly model the domain concepts and be independent of low level representation issues. So it’s tempting to think that once you know the domain you can just start modeling. However, an ontology is never some neutral Platonic ideal, an ontology is coloured by your modeling perspective and there are always lots of decisions about how deeply to model something. In this case we are developing a vocabulary which will directly define the RDF representation for some data, not a pure ontology, so ease of query is indeed a factor.
In this case the domain is organizational structure. The intended application is for publishing UK Government information but we want the ontology to be independent of the details of government structure. The aim is to have a simple, reusable core which can be specialized to particular organizational situations.
The concepts we need to cover include:
- organizational structure
- notion of an organization
- decomposition into sub-organizations and groups
- maybe some relationships between organizations
- reporting structure
- people reporting structure within an organization
- roles, relationship between person and organization, such as “head of”
- attributes of those roles
- people fulfilling the roles and attributes of that (e.g. salary)
- location information
- sites, buildings, locations within buildings
- virtual locations (registered addresses, virtual offices)?
We might also want:
- organizational purpose and responsibilities
- organizational history (merger, renaming, repurposing)
Initial competency questions are probably things like:
- who is the head of organization The National Archive?
- where are they located?
- what was their official salary in 2009?
- who reports to Prith Banerjee within HPLabs?
- who is the head of the IT department within HPLabs Bristol?
- list all geographies in which HP Labs operates
- is Distinguished Technologist in HP Labs a permanent tenured position or a rotating appointment?
- which government department was previously responsible for universities?
- which departments, if any, were merged to create DCSF?
[I’ve used concrete names to make the questions easier to read but the specific organizations and roles will come from the specializations, not the core ontology.]
In terms of design criteria we are interested in qualities such as:
- focused and modular, minimum additional ontological commitments, should be possible to reuse the ontology with a range of others without conflict
- small and easy learn
- extensible so that we can generalize specialized versions for particular organizational settings
- should lead to instance data that is easy to query, especially via the Linked Data API, this argues against heavy use of n-ary relationships
This first pass just gives us a rough sketch. As we start to pin things down with the stake holders the domain scoping and competency questions should evolve to reflect their needs.