XQuery/DocBook to ePub

Motivation edit

You want to convert DocBook 5 document into epub format.

Method edit

We will create an XQuery typeswitch transformation that will perform this conversion. Note that no XSLT will be needed. If you are familiar with XQuery you will not need to learn any new transformation languages.

The basis of this transformation will be a central XQuery module that a main dispatch function that will use an typeswitch operator to implement the dispatch pattern. The main function will look at each element and then call the appropriate function. This makes the transform easy to write and easy to maintain. The main function will create a single large XML file that will then converted into a zip file with some additional book metadata. This zip file can then be tested to see if it conforms to the ePub formatting rules using a ePub validation tool.

The zip function we will used is the compression:zip() function that is documented here. One might think that the way to go about this is to put all the correct documents in an eXist collection and then pass this collection to the zip function. Unfortunately there are two problems with this approach. The first is current implementation of the zip function does not allow you to specify relative paths in the collection setting and the second is that the ePub format is very strict about the order that the files appear in the ePub file. For example a text file that indicates the mime type MUST be the first file in the zip container.

For these reasons we must pass the zip function a sequence of <entry> elements that must be in a very strict order. The format is:

 let $entries := (<entry name=""/>, <entry name=""/> <entry name=""/>...)
 return compression:zip($entries , true())

Note: The final step of this transformation only will work on eXist 1.5. There are new features of the "zip" compression function that will not work on eXist 1.4.

Sample ePub File Generator edit

To demonstrate the exact format of a sample ePub file here is a "serialization" of the entire file in a single XML document:

File Entries for ePub Zip File Package:

(: create a sequence of entries for the zip program to use :)
let $entries :=
(
   <entry name="mimetype" type="text" method="store">application/epub+zip</entry>,
   <entry name="META-INF/container.xml" type="xml">
        <container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0">
              <rootfiles>
                  <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
              </rootfiles>
          </container>
    </entry>,
    <entry name="OEBPS/toc.ncx" type="xml">
          <ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1">
               <head>
                   <meta name="dtb:uid" content="http://www.danmccreary.com/books/epub-test"/>
                   <meta name="dtb:depth" content="1"/>
                   <meta name="dtb:totalPageCount" content="0"/>
                   <meta name="dtb:maxPageNumber" content="0"/>
               </head>
               <docTitle>
                   <text>My Book Title</text>
               </docTitle>
               <navMap>   
                   <navPoint id="title-page" playOrder="1">
                       <navLabel>
                           <text>Test Page</text>
                       </navLabel>
                       <content src="test.xhtml"/>
                   </navPoint>
                   <navPoint id="chapter-1" playOrder="2">
                       <navLabel>
                           <text>Chapter 1</text>
                       </navLabel>
                       <content src="chapter-1.xhtml"/>
                   </navPoint>
               </navMap>
           </ncx>
     </entry>,
      
     <entry name="OEBPS/content.opf" type="xml">
          <package xmlns:dc="http://purl.org/dc/elements/1.1/" 
                  xmlns="http://www.idpf.org/2007/opf" unique-identifier="bookid" version="2.0">
            <metadata>
                <dc:title>My Book Title</dc:title>
                <dc:creator>Dan McCreary</dc:creator>
                <dc:identifier id="bookid">http://www.danmccreary.com/books/epub-test</dc:identifier>
                <dc:language>en-US</dc:language>
            </metadata>
            <manifest>
                <item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
                <item id="title-page" href="title-page.xhtml" media-type="application/xhtml+xml"/>
                <item id="chapter-1" href="chapter-1.xhtml" media-type="application/xhtml+xml"/>
            </manifest>
            <spine toc="ncx">
                <itemref idref="title-page" />
                <itemref idref="chapter-1" />
            </spine>
        </package>
      </entry>,
      
      <entry name="OEBPS/title-page.xhtml" type="xml">
          <html xmlns="http://www.w3.org/1999/xhtml">
             <head>
                 <title>Title Page</title>
             </head>
             <body>
                 <h1>Title Page</h1>
                 <p>Text for paragraph 1</p>
                 <p>Text for paragraph 2</p>
             </body>
         </html>
      </entry>,
      
      <entry name="OEBPS/chapter-1.xhtml" type="xml">
          <html xmlns="http://www.w3.org/1999/xhtml">
             <head>
                 <title>Chapter 1</title>
             </head>
             <body>
                 <h1>Chapter 1</h1>
                 <p>Text for paragraph 1</p>
                 <p>Text for paragraph 2</p>
             </body>
         </html>
      </entry>
   )

We will not spend a large amount of time in this article explaining the exact format of the ePub file. Suffice to say that there are several "constants" such as the mime type file and the container.xml file that will not change. The other files are used to describe how the zip file should be uncompressed and what the table of contents for the file should look like. From then on each chapter in a book is essentially an XHTML file with standard elements for head, body, headers and paragraphs. The example above does not include a CSS file but this can also be included.

Storing your ePub file in a Collection edit

Once you have created your entry list you can now store the entries directly in a single zip file. Here is a small utility function that will store the entries to a file in a collection:

declare function epub-util:store-entries-in-epub($entries as element(entry)*,  $collection as xs:string, $file-name as xs:string) as node() {
   let $zip-file := compression:zip($entries, true())
   let $file-name-suffix :=
     if (ends-with($file-name, '.epub'))
       then $file-name
       else concat($file-name, '.epub')
   let $store := xmldb:store($collection, $file-name-suffix, $zip-file, 'application/epub+zip')
   return
   <result>
      <message>{count($entries)} entries stored in {$collection}/{$file-name-suffix}</message>
   </result>
};

This version will check to make sure the file has a suffix of .epub and will also make sure the file is stored with the correct mime-type in the file.

Rendering the ePub to your browser edit

There is no need to store your ePub file in a binary file in the database. You can dynamically render any ePub file directly to your web browser on demand, just like generating any web page.

The following XQuery can then be used view the ePub file in your web browser:

declare function epub-util:render-epub($entries as node()*) {
let $zip-file := compression:zip($entries, true())
return response:stream-binary($zip-file, 'application/epub+zip')
};

Note that you must not put a return type on this function. It returns a binary and must not be cast as item() or node(). This is very important.

This function not only compresses the file but returns a binary stream to the browser that has the mime-type set so that if your browser has an ePub viewer it will be rendered directly in the viewer.

It turns out this is actually a very efficient way of generating documentation to the user. All of the chapters are compressed into a single compressed file and then uncompressed directly in the browser.

Screen Image edit

The following image is a screen image of the test ePub file being rendered in FireFox after the free EPUBReader plugin has been installed.

Sceen Image of ePub in FireFox

Example: Transforming DocBook Chapters edit

Although there is some work that must be done to convert the front and back portions of a book to ePub format, the heart of book creation in this example will be per-chapter processing to build a ePub "book". Note, however, you do not have to use the docbook chapter element to create the various sections of an ePub. This can be done with book parts, section, sect1 or any other docbook elements you want to use. If you do use chapters like this example ere is the main logic of the conversions to the ePub format:

for each chapter we must add:

to the OEBPS/content.opf file we will add an <item> XML element to the <manifest> section and an <itemref> to the <spine> element
to the OEBPS/toc.ncx file add an <navPoint> XML element for navigation
to the main sequence add one <entry> for each chapter in an <xhtml> file for that chapter

Here is the pseudo code for these three items:

In the db2epub:package-entry function:

The following will add one item per chapter:

   <manifest>
   ...
   {
   for $chapter at $count in collection($root)//db:chapter
   return
      <item id="chapter-{$count}" href="chapter-{$count}.xhtml" media-type="application/xhtml+xml"/>
   }</manifest>

The following will reference this chapter if the chapter will be listed in the table of contents of the spine.

   <spine toc="ncx">
   ...
   {for $chapter at $count in collection($root)//db:chapter
   return
      <itemref idref="chapter-{$count}"/>
   }
   </spine>

In the chapter-navmap() function:

   <navMap>
       ...
       {for $chapter at $count in collection($root)//db:chapter
       return
       <navPoint id="chapter-{$count}" playOrder="1">
           <navLabel>
                 <text>Chapter {$count}</text>
           </navLabel>
           <content src="chapter-{$count}.xhtml"/>
       </navPoint>
    }</navmap>

In the chapter-entries() function:

   for $chapter at $count in collection($root)//db:chapter
   return
      <entry name="chapter-{$count}"> type="xml">{db:chapter-to-xhtml($chapter)}</entry>

Note that you must supply a function that will convert each chapter to an XHTML file. But this function frequently has been written already in a docbook-to-html() module.

Sample Input File edit

<book xmlns="http://docbook.org/ns/docbook"
    xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0">
    <info>
        <title>How to Transform DocBook to ePub Format with XQuery</title>
        <author>
            <orgname>Kelly McCreary &amp; Associates</orgname>
            <address>
                <city>Minneapolis</city>
                <state>MN</state>
                <country>USA</country>
            </address>
            <email>user@example.com</email>
        </author>
    </info>
    <part>
        <title>First Part</title>
        <subtitle>Subtitle of First Part</subtitle>
        <chapter>
            <title>Chapter Title</title>
            <subtitle>Subtitle of Chapter</subtitle>
            <sect1>
                <title>Section1 Title</title>
                <subtitle>Subtitle of Section 1</subtitle>
                <para>Text</para>
            </sect1>
        </chapter>
    </part>
</book>

Getting a List of Distinct Elements in Each Chapter edit

We will next show how the elements within each chapter can be transformed into an XHTML ePub section. Our first step is to get a list of all of the element names used in the chapters the source DocBook document. This can be done using the following XPath expression:

  let $distinct-chapter-element-names := distinct-values(/db:book//db:chapter/descendant-or-self::*/name(.))

The "descendant-or-self" XPath axis expression is very similar to using //*/name(.) but also includes the root node. You can sort this list by putting it in a FLWOR statement with the order by clause added:

  let $sorted-element-names := 
     for $element-name in $distinct-chapter-element-names 
     order by $element-name
     return $element-name

This report forms the basis of your inventory of XML elements that you will use as the basis for your typeswitch transform. Note that some elements in the front and back matter of the book are not included in this list and also note that attribute transforms are handled inside the element-level functions.

References edit

DocBook
Wikipedia page on ePub
web site for the International Digital Publishing Forum the people who govern the ePub specification.
Open Publication Structure (OPS) 2.0.1 v1.0v1.0.1 element by element definition
Jedisaber.com has many resource for ePub listing of sample ePub books
[1] TEI to ePub example