All articles

Investigating VoID and CKAN

For some experiments we’re doing at the moment, I need some datasets from CKAN that also include VoID markup. Fortunately, the metadata from CKAN has been translated to RDF: there’s no official reference that I can find, but this thread describes the main points. Sadly, the CKAN RDF does not include any elements from VoID. My next hypothesis was that some existing datasets from CKAN might already have VoID metadata elsewhere. There are a couple of VoID browsers around, rkbexplorer from U. Southampton and voiD browser from Talis. Of the two, the Talis browser seems to have a more diverse set of VoID descriptions, so I decided to focus on that.

So my question became: is there any linkage between the data in the voiD browser and the data on CKAN. A sparql end-point for CKAN is also hosted by Talis, so it should be possible to ask a federated query between the two stores. However, the Talis sparql end-points don’t currently permit federated queries.

An alternative is to run one part of the federated query against a local model, making remote calls to the other query using sparql’s service keyword. It would work either way round, but since I want the CKAN data for other purposes, I decided to make CKAN the local model. As far as I could tell, there’s nowhere to download the CKAN RDF dataset in one step. However, for models of reasonable size, it’s possible to simulate a download using sparql’s construct verb. This won’t work for large models if the queries have a time limit (which is the case for the Talis platform, quite reasonably). Here’s a simple Java program to locally cache a dataset from a sparql endpoint (comments stripped for brevity):

public class GetCKAN
{
    public static final String CKAN_SPARQL = "https://api.talis.com/stores/ckan/services/sparql";

    public static final String DEFAULT_DESTINATION = "src/main/resources/ckan.ttl";

    private static final Logger log = LoggerFactory.getLogger( GetCKAN.class );

    public static void main( String[] args ) {
        System.out.println( "beginning load from " + CKAN_SPARQL );
        String sparql = "construct {?s ?p ?o} where {?s ?p ?o}";

        Query q = QueryFactory.create( sparql, Syntax.syntaxARQ );
        QueryExecution qe = QueryExecutionFactory.sparqlService( CKAN_SPARQL, q );
        Model m = qe.execConstruct();
        System.out.println( "got model, starting serialization ..." );

        // manually set some known prefixes
        m.setNsPrefix( "foaf", "https://xmlns.com/foaf/0.1/" );
        m.setNsPrefix( "ckan-data", "https://ckan.net/#" );
        m.setNsPrefix( "dcterms", "https://purl.org/dc/terms/" );
        m.setNsPrefix( "ckan-ont", "https://ckan.net/ontology/" );

        String file = (args.length > 0) ? args[0] : DEFAULT_DESTINATION;

        try {
            FileOutputStream out = new FileOutputStream( file );
            m.write( out, "Turtle" );
            out.close();
        }
        catch (IOException e) {
            log.error( e.getMessage(), e );
        }

        System.out.println( "Serialization completed (" + m.size() + " triples)." );
    }
}

Now it’s a simple matter to query against ckan.ttl and match against the VoID datasets in the browser. To begin with, I was interested in any URL’s that are in common betweent the two datasets, hoping to refine this later to URL’s representing interesting datasets. This query illustrates the use of the service keyword, which is part of ARQ‘s extended query language, and which should be standardised under the current round of SPARQL Working Group activity:

public class CKANVoidJoin
{
    private static final Logger log = LoggerFactory.getLogger( CKANVoidJoin.class );

    public static void main( String[] args ) {
        Model m = FileManager.get().loadModel( GetCKAN.DEFAULT_DESTINATION );

        String sparql = "prefix void: \n" +
                "prefix dc: \n" +
                "prefix dct: \n" +
                "prefix rdfs: \n" +
                "prefix foaf: \n" +
                "prefix rdf: \n" +
                "prefix ckan-ont:   \n" +
                "prefix ckan-data:   \n" +
                "" +
                "\n" +
                "SELECT DISTINCT ?dv ?p1 ?dc ?p2 ?u WHERE { \n" +
                "  ?dc a ckan-ont:Package ; \n" +
                "      ?p1 ?u.\n" +
                "\n" +
                "  SERVICE  {\n" +
                "      ?dv ?p2 ?u\n" +
                "  }\n" +
                "}";

        Query q = QueryFactory.create( sparql, Syntax.syntaxARQ );
        QueryExecution qe = QueryExecutionFactory.create( q, m );
        ResultSet rs = qe.execSelect();

        if (rs.hasNext()) {
            while (rs.hasNext()) {
                QuerySolution qs = rs.nextSolution();
                System.out.println( "Solution: " + qs );
            }
        }
        else {
            System.out.println( "No results!" );
        }
    }
}

At the time of writing, the only common URI’s between CKAN and the voiD browser were FOAF, MusicBrainz and dbpedia. Which was a rather disappointing result. For a next step, we’ll have to look at automating the generation of VoID metadata from a given dataset.

#TechTalk