All articles

Generating RDF from CSV – example using xlwrap

Although there is a dump of the RDF data from CKAN available on-line, it’s the full CKAN dataset and not the UK Government-specific set. For some ongoing experiments, I want some CKAN data in RDF for the published government data listed on data.hmg.gov.uk. The source data is published, however, in CSV and JSON format. This gave me an opportunity to try xlwrap, a tool from Andreas Langegger at JKU, which provides a way to write (mostly-)declarative transforms in an RDF vocabulary, then apply these transforms to a spreadsheet to generate RDF triples. XLwrap operates in two modes: either as a sparql endpoint, allowing you to wrap a spreadsheet and then interrogate it from any sparql client, or in embedded mode. Since I just wanted a one-shot tranformation of the CKAN CSV file to RDF, the embedded mode is what I used. The code is below, but the summary is that the basic process was very straightforward.

I did have one problem initially, which turned out to be XLwrap not parsing the CKAN CSV file correctly. I’ve logged a bug, but in the meantime an easy workaround was to import the CSV file into OpenOffice Calc, then re-export it as a .ods file. XLwrap has input filters for various spreadsheet formats, including Excel and OpenOffice. Once I was reading the right file format, developing the transformation was straightforward.

Other than the input format issue, I did hit a couple of limitations. To be fair, Andreas describes the tool as being early in its development, so I view these (and am going to submit them) as feature/support requests.

  • maven2 support would be nice
  • being able to explode a value into multiple triples (so one field containing tags “tag1 tag2” could become two triples: x dc:subject "tag1" and x dc:subject "tag2"
  • adding datatypes to generated literals (although this may be possible using user-defined functions, it would still be handy to have it built-in to the expression language).

Here’s a simple Java program that performs the transformation (comments stripped for brevity):

import java.io.FileOutputStream;
import java.io.OutputStream;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import at.jku.xlwrap.exec.XLWrapMaterializer;
import at.jku.xlwrap.map.MappingParser;
import at.jku.xlwrap.map.XLWrapMapping;

import com.hp.hpl.jena.rdf.model.Model;

public class ConvertCSVtoRDF { 
   private static final Logger log = LoggerFactory.getLogger( ConvertCSVtoRDF.class ); 

   public static void main( String[] args ) {
      if (args.length < 1) { 
         System.out.println( "Usage java ConvertCSVtoRDF []" );
      } 
      else { 
         try {
             String source = args[0];
             OutputStream out = (args.length == 1) ? System.out : new FileOutputStream( args[1] ); 

             XLWrapMapping map = MappingParser.parse( source );

             Model m = new XLWrapMaterializer().generateModel(map);
             setPrefixes( m );

             m.write( out, "Turtle" ); 
         } catch (Exception e) { 
             log.error( e.getMessage(), e ); 
         } 
      } 
   }

   private static void setPrefixes( Model m ) { 
      m.setNsPrefix( "rdfs", "https://www.w3.org/2000/01/rdf-schema#" );
      m.setNsPrefix( "rdf", "https://www.w3.org/1999/02/22-rdf-syntax-ns#" ); 
      m.setNsPrefix( "xsd", "https://www.w3.org/2001/XMLSchema#" );
      m.setNsPrefix( "owl", "https://www.w3.org/2002/07/owl#" );
      m.setNsPrefix( "dc", "https://purl.org/dc/elements/1.1/" );
      m.setNsPrefix( "scv", "https://purl.org/NET/scovo#" );
      m.setNsPrefix( "ckan", "https://ckan.net/ns#" );
      m.setNsPrefix( "sdx", "https://www.epimorphics.com/vocab/sdx#" );
      m.setNsPrefix( "foaf", "https://xmlns.com/foaf/0.1/" );
   }
}

And here’s the current version of the transform itself:

@prefix rdfs: <https://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf: <https://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <https://www.w3.org/2001/XMLSchema#> .
@prefix owl: <https://www.w3.org/2002/07/owl#> .
@prefix dc: <https://purl.org/dc/terms/> .
@prefix xl: <https://purl.org/NET/xlwrap#> .
@prefix ckan: <https://ckan.net/ns#> .
@prefix sdx: <https://www.epimorphics.com/vocab/sdx#> .
@prefix foaf: <https://xmlns.com/foaf/0.1/> .
@prefix sioc: <https://rdfs.org/sioc/ns#> . 

{ # default graph in TriG file 
    [] a xl:Mapping ;
    xl:template [ 
      xl:fileName "file:src/main/data/hmg.ckan.net-20091204.ods" ;
      xl:templateGraph sdx:ckan ;
      xl:transform [ 
          a xl:RowShift ;
          xl:breakCondition "ALLEMPTY(A2)" 
      ] 
    ] .
} 

sdx:ckan { 
    # resource URI is https://ckan.org/package/rdf/package-name 
    [ xl:uri "'https://ckan.org/package/rdf/' & URLENCODE(A2)"^^xl:Expr ] 
      a ckan:Package ;
      foaf:isPrimaryTopicOf [ xl:uri "'https://ckan.org/package/' & URLENCODE(A2)"^^xl:Expr ] ;
      dc:title "B2"^^xl:Expr ;
      ckan:updateFrequency "C2"^^xl:Expr ;
      ckan:downloadUrl [ xl:uri "J2"^^xl:Expr ] ;
      ckan:source "K2"^^xl:Expr ;
      dc:subject "M2"^^xl:Expr ;
      sioc:has_creator "R2"^^xl:Expr ;
      dc:creator "R2"^^xl:Expr ;
      dc:rights "O2"^^xl:Expr ;
      foaf:homepage [ xl:uri "P2"^^xl:Expr ] ;
      dc:description "Q2"^^xl:Expr 
}

The transform works by repeatedly applying a template graph (sdx:ckan) to the current state of the spreadsheet, and then applying a transform (shift the rows up by one) until the A2 cell is empty. Thus while the core transform from cell data to RDF is fairly declarative, the control loop is rather more imperative in style. Still, it was simple to write (noting the limitations listed above). In summary, IĀ found XLwrap a useful tool, and certainly quicker than writing the transfrorms myself in raw Java or Ruby code.

#TechTalk