Preparation of the data

In order to tranform the Embarkation Roll data from a printed form to a digital form that we can query, a number of steps needed to be worked through:

  1. Digitisation to TEI XML
  2. Augmentation and conversion to custom XML
  3. Splitting into component XML files
  4. Transform XML to RDF N-Triples
  5. Concatenate the results
  6. Loading into Sesame
  7. Querying Sesame using Open RDF Sesame Workbench

Digitisation to TEI XML

As part of the project of digitising the Embarkation Roll data for inclusion into Cenotaph, TEI files were prepared, with each row on the rolls looking somewhat like this:

<row>
  <cell>8/3186</cell>
  <cell>Private</cell>
  <cell><name type="person"><seg type="family">Berridge</seg><seg type="given">John</seg></name></cell>
  <cell>Eighth</cell>
  <cell>Otago Infantry Batln.</cell>
  <cell>S.</cell>
  <cell>Tauranga</cell>
  <cell>Auckland</cell>
  <cell>W. C. Berridge (father), 24 Seymour St., Ponsonby, Auckland.</cell>
</row>
        

Augmentation and conversion to custom XML

These TEI files were then processed by a PERL script, with the various cells being mapped to named elements, and further augmented with other data, notably the laititude and longitude values for any recognised geographical names. These results were described using a custom XML grammar.

<group number="1" status="unmatched" match="">
  <source id="id-35181" file="batch 1.dmp" type="person" completeName="yes">
    <terms/>
    <assembledName>Private Kenneth Ferris Abbot</assembledName>
    <displayName/>
    <given>Kenneth Ferris</given>
    <family>Abbot</family>
    <serials>
      <serial>12/2545</serial>
    </serials>
    <ranks>
      <rank>Private</rank>
    </ranks>		
    <addresses>
      <address>
        <street>20 Dominion Road</street>
        <location></location>
        <area lat="-36.85" long="174.783333">Auckland</area>
        <country lat="-42" long="174">New Zealand</country>
      </address>
    </addresses>
    <datesOfEmbark>
      <dateOfEmbark>13 June 1915</dateOfEmbark>
    </datesOfEmbark>
    <units>
      <unit>Auckland Infantry Battalion</unit>
    </units>
    <transports>
      <transport>HMNZT 24</transport><transport>HMNZT 25</transport><transport>HMNZT 26</transport>
    </transports>
    <dates>
      <datesOfBirth>
        <dateOfBirthTerminusPostQuem>1849</dateOfBirthTerminusPostQuem>
        <dateOfBirthTerminusAnteQuem>11 November 1903</dateOfBirthTerminusAnteQuem>
      </datesOfBirth>
      <datesOfDeath>
        <dateOfDeath>22 September 1916</dateOfDeath>
        <dateOfDeathTerminusPostQuem>28 June 1914</dateOfDeathTerminusPostQuem>
        <dateOfDeathTerminusAnteQuem>22 September 1916</dateOfDeathTerminusAnteQuem>
      </datesOfDeath>
    </dates>		
    <nextOfKins>
      <nextOfKin>
        <assembledNextOfKin>Mr R.T. Abbot (father), 20 Dominion Road, Auckland, New Zealand</assembledNextOfKin>
        <assembledName>Mr R.T. Abbot</assembledName>
        <title>Mr</title>
        <given>R.T.</given>
        <family>Abbot</family>
        <relationship>father</relationship>
        <care></care>
        <street>20 Dominion Road</street>
        <location></location>
        <area lat="-36.85" long="174.783333">Auckland</area>
        <country lat="-42" long="174">New Zealand</country>
      </nextOfKin>
    </nextOfKins>
    <placesOfDeath>
      <placeOfDeath>
        <area></area>
        <country lat="44.633333" long=".45">France</country>
        </placeOfDeath>
    </placesOfDeath>
    <notes>Serials: 12/2545</notes>
    <url>35181.detail</url>
  </source>
</group>
        

Splitting into component XML files

The preceding work produced twenty large XML files, each containing around details of 5,000 personnel. An XSL transform was written to transform these into RDF, however it was quickly realised that the size of each XML file (around 12MB) meant that the parsing was painfully slow (around two days to process one of the twenty files), and this needed to be speeded up. The solution was to break each of these twenty XML files down into around 100 smaller files of around 100KB each, which would then speed up parsing such that all 2,000 component XML files could be processed within an hour or so.

Notably, this script made use of xml_split, a useful tool to split XML files. If we were to arbitrarily split an XML file at any given line, the result would be malformed; xml_split solves this by determining the most likely place to split given the supplied size directive.

#!/bin/bash
#
# Split each of the input XML files into XML files of about 100Kb each.
# This will produce around about 100 outputXML files for each input XML file
# and these should hopefully be faster to transform using xsltproc.
#:
for i in ./xml.in/*.xml
  do echo "Processing $i ..."
  cp "$i" ./split.tmp/
  cd split.tmp

  for i in *.xml
    do echo "Splitting $i ..."
    xml_split -s100Kb "$i"
    rm -f "./$i"
    for f in *.xml
      do echo "Cleaning up $f ..."
      sed 's||http://muse.aucklandmuseum.com/databases/cenotaph|' <"$f" >"$f.tmp"
      sed 's|||' <"$f.tmp" >"$f"
      mv "$f" ../split.out/
    done
  done
  cd ..
done
      

Transform XML to RDF N-Triples

Although RDF can be described using various formats, the oldest (and most verbose) format, N-Triples was used as, although wordy, it is the easiest to both create and parse programmatically. The actual XSL transform can be found here.

#!/bin/bash
#
# Run the transform across each file to convert out XML to N-triples RDF

cd split.out
for i in *.xml
 do echo "Processing $i ..."
 xsltproc ../make_rdf.xsl "$i" > "../rdf.out/$i.nt"
done
      

Concatenate the results

To make the files of N-Triples easier to handle during ingest, the 2,000 files from the previous step were then concatenated back into a series of twenty files.

#!/bin/bash
#
# cat together all the component rdf files back into a series of 20 files.

# unmatched-awm-batch 1-9-67.xml.nt

for j in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
  do echo "Processing files for iteration $j ..."
  for i in ./rdf.out/*\ 1-$j-*.xml.nt
    do
    cat "$i"
  done > "./awm.nt/awm-$j.nt"
done
      

The resulting N-Triples

A full set of RDF triples for a given person looks somewhat like the following:

<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://muse.aucklandmuseum.com/databases/cenotaph#personnel> .
<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#isa> <http://muse.aucklandmuseum.com/databases/cenotaph#person_107580> .
<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://muse.aucklandmuseum.com/databases/cenotaph/#personnelUri> <http://muse.aucklandmuseum.com/databases/cenotaph35181.detail> .
<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#personnelId> "35181" .
<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#assembledName> "Private Kenneth Ferris Abbot" .
<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#given> "Kenneth Ferris" .
<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#family> "Abbot" .
<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#serial> "12/2545" .
<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#hasRank> <http://muse.aucklandmuseum.com/databases/cenotaph#private> .
<http://muse.aucklandmuseum.com/databases/cenotaph#address_35181> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://muse.aucklandmuseum.com/databases/cenotaph#address> .
<http://muse.aucklandmuseum.com/databases/cenotaph#address_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#numberStreet> "20" .
<http://muse.aucklandmuseum.com/databases/cenotaph#address_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#addressConstitutedOf> <http://muse.aucklandmuseum.com/databases/cenotaph#dominion_road> .
<http://muse.aucklandmuseum.com/databases/cenotaph#dominion_road> <http://muse.aucklandmuseum.com/databases/cenotaph#namePosition> "Dominion Road" .
<http://muse.aucklandmuseum.com/databases/cenotaph#dominion_road> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://muse.aucklandmuseum.com/databases/cenotaph#street> .
<http://muse.aucklandmuseum.com/databases/cenotaph#address_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#addressConstitutedOf> <http://muse.aucklandmuseum.com/databases/cenotaph#auckland> .
<http://muse.aucklandmuseum.com/databases/cenotaph#auckland> <http://muse.aucklandmuseum.com/databases/cenotaph#latitude> "-36.85"^^<http://www.w3.org/2001/XMLSchema#decimal> .
<http://muse.aucklandmuseum.com/databases/cenotaph#auckland> <http://muse.aucklandmuseum.com/databases/cenotaph#longitude> "174.783333"^^<http://www.w3.org/2001/XMLSchema#decimal> .
<http://muse.aucklandmuseum.com/databases/cenotaph#auckland> <http://muse.aucklandmuseum.com/databases/cenotaph#namePosition> "Auckland" .
<http://muse.aucklandmuseum.com/databases/cenotaph#auckland> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://muse.aucklandmuseum.com/databases/cenotaph#area> .
<http://muse.aucklandmuseum.com/databases/cenotaph#address_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#addressConstitutedOf> <http://muse.aucklandmuseum.com/databases/cenotaph#new_zealand> .
<http://muse.aucklandmuseum.com/databases/cenotaph#new_zealand> <http://muse.aucklandmuseum.com/databases/cenotaph#latitude> "-42"^^<http://www.w3.org/2001/XMLSchema#decimal> .
<http://muse.aucklandmuseum.com/databases/cenotaph#new_zealand> <http://muse.aucklandmuseum.com/databases/cenotaph#longitude> "174"^^<http://www.w3.org/2001/XMLSchema#decimal> .
<http://muse.aucklandmuseum.com/databases/cenotaph#new_zealand> <http://muse.aucklandmuseum.com/databases/cenotaph#namePosition> "New Zealand" .
<http://muse.aucklandmuseum.com/databases/cenotaph#new_zealand> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://muse.aucklandmuseum.com/databases/cenotaph#country> .
<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#hasAddress> <http://muse.aucklandmuseum.com/databases/cenotaph#address_35181> .
<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#embarkedOn> <http://muse.aucklandmuseum.com/databases/cenotaph#1915-06-13> .
<http://muse.aucklandmuseum.com/databases/cenotaph#1915-06-13> <http://muse.aucklandmuseum.com/databases/cenotaph#date> "1915-06-13"^^<http://www.w3.org/2001/XMLSchema#date> .
<http://muse.aucklandmuseum.com/databases/cenotaph#auckland_infantry_battalion> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://muse.aucklandmuseum.com/databases/cenotaph#unit> .
<http://muse.aucklandmuseum.com/databases/cenotaph#auckland_infantry_battalion> <http://muse.aucklandmuseum.com/databases/cenotaph#unitTitle> "Auckland Infantry Battalion" .
<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#hasMembership> <http://muse.aucklandmuseum.com/databases/cenotaph#auckland_infantry_battalion> .
<http://muse.aucklandmuseum.com/databases/cenotaph#hmnzt_24> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://muse.aucklandmuseum.com/databases/cenotaph#transport> .
<http://muse.aucklandmuseum.com/databases/cenotaph#hmnzt_24> <http://muse.aucklandmuseum.com/databases/cenotaph#transportName> "HMNZT 24" .
<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#transportedBy> <http://muse.aucklandmuseum.com/databases/cenotaph#hmnzt_24> .
<http://muse.aucklandmuseum.com/databases/cenotaph#hmnzt_25> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://muse.aucklandmuseum.com/databases/cenotaph#transport> .
<http://muse.aucklandmuseum.com/databases/cenotaph#hmnzt_25> <http://muse.aucklandmuseum.com/databases/cenotaph#transportName> "HMNZT 25" .
<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#transportedBy> <http://muse.aucklandmuseum.com/databases/cenotaph#hmnzt_25> .
<http://muse.aucklandmuseum.com/databases/cenotaph#hmnzt_26> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://muse.aucklandmuseum.com/databases/cenotaph#transport> .
<http://muse.aucklandmuseum.com/databases/cenotaph#hmnzt_26> <http://muse.aucklandmuseum.com/databases/cenotaph#transportName> "HMNZT 26" .
<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#transportedBy> <http://muse.aucklandmuseum.com/databases/cenotaph#hmnzt_26> .
<http://muse.aucklandmuseum.com/databases/cenotaph#personnel_35181> <http://muse.aucklandmuseum.com/databases/cenotaph#diedOn> <http://muse.aucklandmuseum.com/databases/cenotaph#1916-09-22> .
<http://muse.aucklandmuseum.com/databases/cenotaph#1916-09-22> <http://muse.aucklandmuseum.com/databases/cenotaph#date> "1916-09-22"^^<http://www.w3.org/2001/XMLSchema#date> .
      

Loading into Sesame

The files of N-Triples were then loaded into a local instance of Sesame, an open-source triplestore developed by Aduna:

> ./console.sh < ./cenotaph_load_awm.bat
      

The cenotaph_load_awm.bat script, is rather simple, being:

connect http://localhost:8080/openrdf-sesame.
open cenotaph.

load data/awm-1.nt.
load data/awm-2.nt.
load data/awm-3.nt.
load data/awm-4.nt.
load data/awm-5.nt.
load data/awm-6.nt.
load data/awm-7.nt.
load data/awm-8.nt.
load data/awm-9.nt.
load data/awm-10.nt.
load data/awm-11.nt.
load data/awm-12.nt.
load data/awm-13.nt.
load data/awm-14.nt.
load data/awm-15.nt.
load data/awm-16.nt.
load data/awm-17.nt.
load data/awm-18.nt.
load data/awm-19.nt.
load data/awm-20.nt.
      

Querying Sesame using Open RDF Sesame Workbench

Once we've loaded our local instance of Sesame with our triples, we can then use the openRDF Workbench to query and retrieve results: