XQuery/TEI Document Timeline

Motivation edit

You want to create a timeline of the dates with a single TEI document.

Approach edit

TEI documents may include date elements in any of the sections of the document - in the meta-data, in the document publication details, in front and back matter as well as in the body of the text. Let's assume that we want a time line showing dates in the text body.

We will use the Simile Timeline Javascript API to create a browsable timeline in an HTML page.

Extracting timeline dates edit

TEI documents store dates in the date element in the following format:

  <date when="1861-03-16">March 16</date>

or

  <date when="1861">1861</date>

We will write an XQuery script that will extract all of the date elements in the body of a TEI document and generate a Simile Timeline.

Getting the dates edit

Dates are used throughout the sections of a TEI document, but we are most likely to be interested in dates in the body of the text.

 let $dates := doc($tei-document)//tei:body//tei:date

For example:

<date when="1642-01">January 1642</date>
<date when="1616">1616</date>
<date when="1642">1642</date>
<date when="1642-08-13">13 August 1642</date>
<date when="1643-05">May</date>
<date when="1643-07">July 1643</date>

Transforming to Simile events edit

We can then transform this sequence of date elements into the format that is needed by Simile.

<data>{
   for $date in $dates
   return 
     <event start='{$date/@when}' >
       {$date/text()}
     </event>
}</data>

Note that there are two path expressions in the above query. The first expression $date/@when extracts the when attribute of the date element. The second path expression $date/text() extracts the body text of the date element, i.e. the text between the begin and end date tags:

  <date when="1642-08-13">13 August 1642</date>

Sample XQuery to Extract Dates from TEI File edit

xquery version "1.0";
declare namespace tei = "http://www.tei-c.org/ns/1.0";

(: get the file name from the URL parameter :)
let $file := request:get-parameter('file', '')

(: this is where we will get our TEI documents :)
let $data-collection := '/db/Wiki/TEI/docs'
 
(: open the document :)
let $tei-document := concat($data-collection, '/', $file)

(: get all dates in the body of the document :)
let $dates := doc($tei-document)//tei:body//tei:date

return
<data>{
   for $date in $dates
   return 
     <event start='{$date/@when}'>
      {$date/text()}
     </event>
}</data>

For example, here are the dates in the TEI document "The Discovery of New Zealand" by J. C. Beaglehole, produced by the New Zealand Electronic Text Centre

Execute

Discussion edit

  • TEI dates are generally XML dates which are recognised by the Simile timeline API. However TEI supports the encoding of relative dates such as
<date when="--01-01">New Years Day</date>

so dates really need filtering using a suitable RegExp. One option is to check the date format with the "castable" XQuery function.

Creating the Timeline bubbles edit

Providing Context edit

We can enhance the timeline by providing some context for the date in the timeline bubble. One approach is to include some of the preceding and following text.

Each date node is part of a parent node, e.g.

<date when="1777-02-12">12 February 1777</date>

is a child node in

   <p>Cook left Queen Charlotte's Sound for the fourth time on <date when="1774-11-10">10 November</date>. 
      He returned for a fifth visit on <date when="1777-02-12">12 February 1777</date> and remained a fortnight; but this
      last voyage contributed nothing to the discovery of New Zealand. The discoverer
      was bound for the northern hemisphere, and for his death.</p>

We need to access the mixture of elements and text nodes on either side of the target date. For example, preceding this node are a text node ("Cook left.."), a date node and another text node ("He returned .."). Following the target date is the text node ("and remained ..."). We can select these nodes using the preceding-sibling and following-sibling axes:

   let $nodesbefore := $date/preceding-sibling::node()
   let $nodesafter := $date/following-sibling::node()

A crude approach to construct a context string is to join the node strings and extract a suitable substring. The text after:

   let $after := string-join($nodesafter, ' ')
   let $afterString := substring($after,1,100)

and the text before:

   let $before := string-join($nodesbefore,' ')
   let $beforeString := substring($before,string-length($before)- 101,100)

We can then create an XML fragment with the target date in bold:

    let $context := 
        <div>
          {concat('...', $beforeString,' ')} 
          <b>{$date/text()}</b>
          {concat($afterString,' ...')}
        </div>

Finally the element needs to be serialized and added to the event:

   return 
     <event start='{$when}' title='{$when}' >
       {util:serialize($context,("method=xhtml","media-type=text/html"))}
     </event>

Execute

Improved Context edit

The context is extracted from the parent node without regard to word or sentence boundaries. Splitting on word boundaries would be better.

   let $nodesafter := $date/following-sibling::node()
  (: join the nodes, then split on space :)
   let $after := tokenize(string-join($nodesafter, ' '),' ')
  (: get the first $scope words :)
   let $afterwords := subsequence($after,1,$scope)
  (: join the subsequence of words, and suffix with ellipsis if the paragraph text has been truncated :)  
    let $afterString := 
           concat (' ',string-join($afterwords,' '),if (count($after) > $scope) then '... ' else '')

Similarly, the text before the target date:

   let $nodesbefore := $date/preceding-sibling::node()
   let $before := tokenize(string-join($nodesbefore,' '),' ')
   let $beforewords := subsequence($before,count($before) - $scope + 1,$scope)
   let $beforeString := 
           concat (if (count($before) > $scope) then '... ' else '',string-join($beforewords,' '),' ')

Splitting on sentence boundaries would be even better. We can use the pattern '\. ' as the marker. This may not be entirely accurate but false positives will merely shorten the context. The ellipsis is not now needed. $scope now is the number of sentences on either side.

    let $nodesafter := $date/following-sibling::node()
  (: join the nodes, then split on the pattern fullstop space :)
   let $after := tokenize(string-join($nodesafter, ' '),'\. ')
  (: get the first $scope sentences :)
   let $afterSentences := subsequence($after,1,$scope)
  (: join the subsequence of sentences :)  
    let $afterString := 
           concat (' ',string-join($afterSentences,'. '))

Similarly for the beforeString.

  let $nodesbefore := $date/preceding-sibling::node()
   let $before := tokenize(string-join($nodesbefore,' '),'\. ')
   let $beforeSentences := subsequence($before,count($before) - $scope + 1,$scope)
   let $beforeString := 
           concat (string-join($beforeSentences,'. '),'. ')

Execute

Discussion edit

In addition, each event could link into the full text of the document. (to do)

Generating an HTML page edit

Since the event stream is parameterised by the source document, the HTML page containing the timeline also needs to be parameterised, so we will generate it using another XQuery script.

Simile API edit

The definition of the timeline layout uses the SIMILE timeline Javascript API. To define the basic bands:

function onLoad(file,start) {
  var theme = Timeline.ClassicTheme.create();
  theme.event.label.width = 400; // px
  theme.event.bubble.width = 300;
  theme.event.bubble.height = 300;

  var eventSource = new Timeline.DefaultEventSource();

  var bandInfo = [
    Timeline.createBandInfo({
        eventSource:    eventSource,
        theme:          theme,
        trackGap:       0.2,
        trackHeight:    1,
        date:           start,
        width:          "90%", 
        intervalUnit:   Timeline.DateTime.YEAR, 
        intervalPixels: 45
    }),
   Timeline.createBandInfo({
         date:           start,
         width:          "10%", 
         intervalUnit:   Timeline.DateTime.DECADE, 
         intervalPixels: 50
     })

  ];
  bandInfo[1].syncWith = 0;
  bandInfo[1].highlight = true;

  Timeline.create(document.getElementById("my-timeline"), bandInfo);
  Timeline.loadXML("dates.xq?file="+file, function(xml, url) { eventSource.loadXML(xml, url); });

}

Note that the bands are set for YEAR and DECADE which are appropriate for historical texts. The function has two parameters: the source file and the start year.

The events are generated by a call to the transformation script in the previous section.

 Timeline.loadXML("dates.xq?file="+file, function(xml, url) { eventSource.loadXML(xml, url); });

Setting the Start date edit

The start date is the earliest date in the sequence of dates. We can find this by ordering the dates using the order by clause and then selecting the first item in the sequence.

let $orderedDates := 
    for $date in $doc//tei:body//tei:date/@when
    order by $date
    return $date
let $start := $orderedDates[1]

We can retrieve the Document title and author

Full script edit

xquery version "1.0";

declare namespace tei = "http://www.tei-c.org/ns/1.0";
declare option exist:serialize "method=xhtml media-type=text/html";

let $file:= request:get-parameter('file','')
let $data-collection := '/db/Wiki/TEI/docs'
let $tei-document := concat($data-collection, '/', $file)
let $doc := doc($tei-document)
(: get the title and author from the titleStmt element :)
let $header := $doc//tei:titleStmt
(: there may be several titles, differentiated by the type property - just take the first :)
let $doc-title :=  string(($header/tei:title)[1])
let $doc-author := string(($header/tei:author/tei:name)[1])

(: get the start date :)
let $orderedDates := 
    for $date in $doc//tei:body//tei:date/@when
    order by $date
    return $date
let $start := $orderedDates[1]

return
<html>
    <head>
        <title>TimeLine: {$doc-title}</title>
        <script src="http://simile.mit.edu/timeline/api/timeline-api.js" type="text/javascript"></script>
        <script  type="text/javascript">
        <![CDATA[
function onLoad(file,start) {
  var theme = Timeline.ClassicTheme.create();
  theme.event.label.width = 400; // px
  theme.event.bubble.width = 300;
  theme.event.bubble.height = 300;

  var eventSource = new Timeline.DefaultEventSource();

  var bandInfo = [
    Timeline.createBandInfo({
        eventSource:    eventSource,
        theme:          theme,
        trackGap:       0.2,
        trackHeight:    1,
        date:           start,
        width:          "90%", 
        intervalUnit:   Timeline.DateTime.YEAR, 
        intervalPixels: 45
    }),
   Timeline.createBandInfo({
         date:           start,
         width:          "10%", 
         intervalUnit:   Timeline.DateTime.DECADE, 
         intervalPixels: 50
     })

  ];
  bandInfo[1].syncWith = 0;
  bandInfo[1].highlight = true;

  Timeline.create(document.getElementById("my-timeline"), bandInfo);
  Timeline.loadXML("dates.xq?file="+file, function(xml, url) { eventSource.loadXML(xml, url); });

}
 ]]>
        </script>  
    </head>
    <body onload="onLoad('{$file}','{$start}');">
        <h1>Timeline of <em>{$title}</em> by {$author}</h1>
        <div id="my-timeline" style="height: 700px; border: 1px solid #aaa"></div>
    </body>
</html>

Examples edit

  • Beaglehole Timeline
  • Buck [1]
    • Dates in this encoding are confined to the Bibliography and are publication rather than subject events.

Discussion edit

  • Simile Timeline has a problem displaying many events on closely related dates, so not all events may appear on the timeline.