XQuery/eXist Replication

This document is outdated and should be either updated or removed.

Please check the pages on GitHub for most recent information.


Motivation edit

You want to configure two or more eXist instances to work together to automatically synchronize collection-specific data sets. This allows you to scale your eXist server capacity. For example, with multiple eXist servers configured to stay in sync as described below, you could add a load-balancer to distribute the load of incoming queries across the pool of servers and still maintain high performance.

Method edit

We will use the new eXist clustering options available only in eXist-db 2.1dev (bleeding edge) developer edition. This feature is based on using collection triggers to trigger "update" messages from a master server to one or more slave systems on remote hosts.

NOTE: This page is under development.

Terminology edit

In the long term there will be many different ways that the eXist clustering system might be configured. In this tutorial we will only cover collection-based replication from a single master to multiple slave systems. The figure below describes the relationship between the nodes in this replication configuration.

 

We will use the following terms in this document:

  • Master - a specially configured eXist instance where active changes to documents and collections are happening. For example XML updates, documents are being stored, updated, or deleted. The master is considered the "publisher" of change events. This is also the server that must have an ActiveMQ server running.
  • Slaves - a collection of one or more specially configured eXist instances that automatically receive updates when changes occur on the master. Each slave is considered a "subscriber" to the change events on the master.
  • Message Store - a location outside the master server's eXist instance where all update events are stored and forwarded to remote systems when they are ready to receive the update events. Although the term "Message Queue" is frequently used in our case we will be using a "Topic", not a "Queue" for distributing messages to remote systems. ActiveMQ provides this function on the master server.

When any document changes on configured collections an update will be placed on a message queue. The update will stay on that message queue until all subscribers receive the update message.

NOTE: To be confirmed with the developers: Any collection on any system can be configured a master or a slave to other collections on other systems.

NOTE: Once "durable" messages are implemented, slave systems will not need to be running when the changes are made. When a slave goes down it will automatically be notified of all the changes since it last communicated to the master.

 

How Replication Works edit

The clustering replication configuration uses a standard compliant messaging system built around the Java Messaging System standard (JMS). The implementation eXist uses is based on the Apache ActiveMQ system. ActiveMQ is widely used as "middle-ware" to help applications communicate in a reliable manner.

Configuration Steps edit

Download and Configure the Apache ActiveMQ edit


  1. Download recent version from ActiveMQ from http://activemq.apache.org/download.html. Note that the TGZ file has additional Unix (Linux, Mac OS X) support, the ZIP file is for Windows.
  2. Extract content to disk, referred as $ACTIVEMQ_HOME
  3. Copy the $ACTIVEMQ_HOME/activemq-all-X.Y.Z.jar file to $EXIST_HOME/lib/user

For testing, I used activemq-all-5.6.0.jar.

Create eXist With Clustering Configuration edit


Note: The current work on clustering is being done in a branch of the eXist-db subversion repository. To build this branch checkout the following URL with a subversion client:

  https://exist.svn.sourceforge.net/svnroot/exist/branches/dizzzz/clustering

This code will be moved into the main trunk at a future time.

Note that the extensions/local.properties has the following line in it that is not yet in the main trunk:

  # Clustering extenstion for reliable document replication
  include.feature.clustering = true

You can then build eXist by using the $EXIST_HOME/build.sh or build.bat.

When you are done with the build you will see the following file:

 $EXIST_HOME/lib/extensions/exist-clustering.jar

Configure the Master Server edit

Add a collection.xconf file for the directory for which the content must be distributed, e.g., for /db/mycollection/ the .xconf file must be stored in /db/system/config/db/mycollection/.

Create collection '/db/mycollection'

Fill in hostname of the activemq message broker (here, "server.local:61616").

file: /db/system/config/db/mycollection/collection.xconf

  <collection xmlns="http://exist-db.org/collection-config/1.0">
    <triggers>
      <trigger class="org.exist.replication.jms.publish.ReplicationTrigger">
      <parameter name="java.naming.factory.initial" value="org.apache.activemq.jndi.ActiveMQInitialContextFactory"/>
      <parameter name="java.naming.provider.url" value="tcp://localhost:61616"/>
      <parameter name="connectionfactory" value="ConnectionFactory"/>
      <parameter name="destination" value="dynamicTopics/eXistdb"/>
      <parameter name="client-id" value="id1"/>
    </trigger>
</triggers>
</collection>

In the sample below the collection is named "mycollection"

 
Location of collection trigger configuration

Configure the Slave Servers edit

For each 'Slave', a job must be started via conf.xml; the job names must match the job name of the 'Master' configuration:

<!--
Start JMS listener for clustering feature. 
   Parameters:
      java.naming.factory.initial  Initial context provider
      java.naming.provider.url     URL of message broker
      connectionfactory            Name of connection factory
      destination                  Name of destination (Topic or Queue)
      client.id                    (optional) ClientID. Leave out or set ""
                                             for default behaviour
-->
<job type="startup" name="clustering"  class="org.exist.replication.jms.subscribe.MessageReceiverJob">
        <parameter name="java.naming.factory.initial" value="org.apache.activemq.jndi.ActiveMQInitialContextFactory"/>
        <parameter name="java.naming.provider.url" value="tcp://localhost:61616"/>
        <parameter name="connectionfactory" value="ConnectionFactory"/>
        <parameter name="destination" value="dynamicTopics/eXistdb"/>
        <parameter name="client-id" value="id2"/>
        <parameter name="subscriber-name" value="sub_name"/>
</job>

Start up the Servers edit

Start ActiveMQ server edit

Start ActiveMQ server:

   cd ACTIVEMQ_HOME
   ./bin/activemq start 
   (for mac, use the bin/macosx wrapper)

Start eXist-db server on slave(s) and master edit

Start eXist on each slave server and create collection that will mirror the slave

   cd EXISTSLAVE_HOME
   ./bin/startup.sh
   Create receive collection '/db/mycollection'

Start Master

   cd EXISTMASTER_HOME
   ./bin/startup.sh
   (No need to create the collection, since we already created above)

Test Document Distribution edit

ActiveMQ queues edit

When you use dynamic topics or dynamic queues, you can see if either master or slave has checked the queue by going to http://localhost:8161/admin/topics.jsp. Remember to refresh the page with F5 key.

Testing edit

On 'Master' create document in /db/mycollection/ (e.g. using java client, or eXide ; login as admin). The document will be automatically replicated to all of the slaves in the system.

Performance edit

With eXide, we can upload a +- 50k XML document to the slave, e.g., /db/mydoc.xml. Then, when we execute the following query, 2000 files (mydoc1000.xml to mydoc3000.xml) will be created on the server and replicated on the slaves.

 let $doc := doc('/db/mydoc.xml')
 for $i in (1000 to 3000)
 return
   xmldb:store('/db/mycollection', concat('mydoc', $i , ".xml"), $doc)

Debugging Tips edit

Configure the Log4j system to debug mode.

On the Master system you should see the following lines:

 2012-06-19 13:26:43,406 [eXistThread-90] DEBUG (Collection.java [storeXMLInternal]:1339) - document stored. 
 2012-06-19 13:26:43,406 [eXistThread-90] DEBUG (ClusterTrigger.java [afterCreateDocument]:63) - /db/mycollection/mydoc1000.xml 
 2012-06-19 13:26:43,406 [eXistThread-90] DEBUG (NativeSerializer.java [serializeToReceiver]:112) - serializing document 1430 (/db/mycollection/mydoc1000.xml) to SAX took 0 msec 
 2012-06-19 13:26:43,419 [eXistThread-90] DEBUG (JMSMessageSender.java [sendMessage]:156) - Message sent with id: ID:Dan-PC12-51166-1340109804913-3:1:1:1:1 


On the Slave system you should see the following:

 2012-06-19 13:48:05,875 [DefaultQuartzScheduler_Worker-2] DEBUG (NotificationService.java [debug]:94) - Registered UpdateListeners: 
 2012-06-19 13:50:06,218 [ActiveMQ Session Task-1] DEBUG (eXistJMSListener.java [onMessage]:138) - CREATE_UPDATE : DOCUMENT from /db/mycollection/mydoc1000.xml
 2012-06-19 13:50:06,234 [ActiveMQ Session Task-1] DEBUG (ConfigurationHelper.java [getExistHome]:55) - Got eXist home from broker: C:\ws\exist-trunk\eXist

Other Configurations edit

Because messaging is such a general purpose way to communicate between computer systems there are many other possible business problems that can be solved by variations of this first example. Replication not only can be used for increased reliability but it can also be used in conjunction with load balancing and auto-scaling to increase performance when a system is under heavy load.

Messages can also be used to distribute queries among many nodes each with their own data collection. The results of queries are places on a results queue and returned to the user as though they were using a single very-fast server.

Because the master eXist system only needs to place an update event on a message queue, you are then free to use message stores in many different configurations to distribute both data and programs to remote sites with varying degrees of reliability.

Static or Dynamic Queues edit

There are two options to creating message queues:

  1. Static you can either define queues in ActiveMQ's configuration file which topics or quest must be created
  2. Dynamic The queues can be created when you use them for the first time

References edit