All articles

Data Autopsy by SPARQL – looking for things that don’t join up

Naval Photo Laboratory Sorting Through the Photos by army.arch (FlickR)

We have worked on a number of projects where we assemble a linked data set from data received in several CSV files. The CSV files are separately processed to generate RDF and the resulting RDF is poured into an RDF store. We generally inspect the RDF arising from each transformed CSV. One of the first things we look for in the resulting RDF assemblage is ‘dangling references’ – broken links.

This post is about the generic SPARQL queries we use to track down ‘dangling references’ and to develop some insight into the systemic causes of the breakage. This may be some misalignment of key URIs formed in the transformations process; whole tables of data missing from the process; aggressive pruning of source data or misaligned data export routines for different tables. We have encountered all of these situations.

The SPARQL queries below are generic in the sense that they are not specialised for any particular data model. Often we will tweak them to remove clutter (e.g. references to vocabularies that have not been included in the data load).

The basic probe

A dangling reference is just like a broken link on the web. In an RDF store it’s a reference to some ‘thing’ that should have a description in the store and, simply, doesn’t. The query below finds ‘things’ that are referenced by data in the store, but that are not described within that data.

PREFIX rdf: <https://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <https://www.w3.org/2000/01/rdf-schema#>


SELECT ?thing
WHERE {
 ## Find all references to 'things' that have URI (i.e. not blanks or literals)
 ?subject ?prop ?thing .
 FILTER isIRI(?thing)


 ## Include only 'things' that not described (i.e. that have no useful properties).
 FILTER NOT EXISTS {
 ?thing ?p ?o
 # include 'thing's that only have a { ?thing a rdfs:Resource } statement.
 FILTER (?p!=rdf:type || ?o!=rdfs:Resource)
 }
}
# LIMIT 10

First time through it’s probably useful to include a limit in order to provoke a quick result, for example in one recent investigation we found over 4 million dangling references to 340k distinct un-described ‘things’ as reported by the following query:

PREFIX rdf: <https://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <https://www.w3.org/2000/01/rdf-schema#>


SELECT (count(distinct ?thing) as ?things) (count(*) as ?references)
WHERE {
 ?subject ?prop ?thing .
 FILTER isIRI(?thing)
 FILTER NOT EXISTS {
 ?thing ?p ?o
 FILTER (?p!=rdf:type || ?o!=rdfs:Resource)
 }
 FILTER (!contains(str(?thing),'/def/'))
}

The heart of this query is unchanged from its predecessor. We’ve modified the SELECT clause to give a count of the number of dangling references and the distinct ‘things’ targeted by those references. We’ve added a filter in the tail of the query that filters out dangling references to vocabulary terms. Many of the vocabularies we work with include a .../def/... segment within their URI. Practices vary and you will likely need to adapt the trailing filters to your own situation.

4 million references and 340k undescribed things are too many to look at individually. We’d need to find some patterns.

Getting a clearer view

One approach is to look for the ‘things’ that are the target of the highest number of dangling references. While the heart of the query is unchanged, it has been altered count the number of dangling references to each ‘thing’. Output is ordered by number of dangling references high to low. It’s also potentially insightful to include a sample of a referring subject resource, and a sample of a property making the reference.

Limit is still handy to restrict the amount of output, but it is the nature of aggregation queries that limiting the output will rarely speed them up – so it may be time for some coffee.

PREFIX rdf: <https://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <https://www.w3.org/2000/01/rdf-schema#>


SELECT ?thing
 (count(?subject) as ?numRefs)
 (sample(?subject) as ?sampleRefferingSubject)
 (sample(?prop) as ?sampleRefferingProp)
WHERE {
 ?subject ?prop ?thing .
 FILTER isIRI(?thing)
 FILTER NOT EXISTS {
 ?thing ?p ?o
 FILTER (?p!=rdf:type || ?o!=rdfs:Resource)
 }
 FILTER ( strstarts(str(?thing),'https://environment.data.gov.uk') &&
 !contains(str(?thing),'/def/'))
} GROUP BY ?thing
 ORDER BY desc(?numRefs)
 LIMIT 10

Note that we’ve extended the trailing filter to further restrict the search to link targets that belong within a particular site.

The top ten items and associated counts will give a feel for the scale of the problem. We have found and corrected situations where there is missing: data set metadata, reference data. We’ve found edge cases where upstream or downstream references have crossed borders between England and Scotland or Wales – and data for entities on one side of the border has not been present.

In our 4m dangling references / 340k referenced things case, without a LIMIT clause this query would return 340k result lines. Still too many to work with. There is a need to summarise still further in order to investigate patterns.

Summarising using Model Concepts (Classes and Properties)

That there are several instances in the previous result of similar kinds of thing suggests that grouping by model/vocabulary constructs will be more revealing of pattern. The query retains the core pattern as all the previous queries. It also retains the filters added to previous queries. It adds:

OPTIONAL {
 ?subject a ?subjectClass
}
BIND (IF(bound(?subjectClass),?subjectClass, "Untyped-Subject") as ?class)

which captures the class (if any) of the entity making dangling reference. If the referring entity has no given class, the BIND assigns a dummy literal for the purposes of reporting. Lastly the SELECT variables and the trailing GROUP BY are altered to group and count class and referring
property.

PREFIX rdf: <https://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <https://www.w3.org/2000/01/rdf-schema#>


SELECT ?class ?prop
 (count(?thing) as ?numRefs)
 (sample(?thing) as ?exampleMissingThing)
 (sample(?subject) as ?exampleReferrer)
WHERE {
 { ?subject ?prop ?thing .
 FILTER isIRI(?thing)
 FILTER NOT EXISTS {
 ?thing ?p ?o
 FILTER (?p!=rdf:type || ?o!=rdfs:Resource)
 } 
 }


 OPTIONAL {
 ?subject a ?subjectClass
 }
 BIND (IF(bound(?subjectClass),?subjectClass, "Untyped-Subject") as ?class)


 FILTER ( strstarts(str(?thing),'https://environment.data.gov.uk/') &&
 !contains(str(?thing),'/def/'))


}
GROUP BY ?class ?prop
ORDER BY desc(?numRefs) ?class ?prop
# LIMIT 20

In our 4m/340k case removing the limit from this query gave a total of 61 rows to investigate further which is now much more tractable for the manual inspection necessary to identify and rectify the problems.

Conclusion

Looking for dangling references, particularly internal references, within a data set is a pretty basic check, akin to running a link checker over a web site. The queries presented here are pretty generic and can be used to investigate and diagnose internal dangling reference problems. They go beyond simply listing broken links and provide for some analysis with reference to the implicit data model as apparent from the data (rather than externally supplied), with reference counts that provide a basis to choose which problems to tackle first.

#TechTalk