You want to configure two or more eXist instances to work together to automatically synchronize collection-specific data sets. This allows you to scale your eXist server capacity. For example, with multiple eXist servers configured to stay in sync as described below, you could add a load-balancer to distribute the load of incoming queries across the pool of servers and still maintain high performance.
We will use the new eXist clustering options available only in eXist-db 2.1dev (bleeding edge) developer edition. This feature is based on using collection triggers to trigger "update" messages from a master server to one or more slave systems on remote hosts.
NOTE: This page is under development.
In the long term there will be many different ways that the eXist clustering system might be configured. In this tutorial we will only cover collection-based replication from a single master to multiple slave systems. The figure below describes the relationship between the nodes in this replication configuration.
We will use the following terms in this document:
- Master - a specially configured eXist instance where active changes to documents and collections are happening. For example XML updates, documents are being stored, updated, or deleted. The master is considered the "publisher" of change events. This is also the server that must have an ActiveMQ server running.
- Slaves - a collection of one or more specially configured eXist instances that automatically receive updates when changes occur on the master. Each slave is considered a "subscriber" to the change events on the master.
- Message Store - a location outside the master server's eXist instance where all update events are stored and forwarded to remote systems when they are ready to receive the update events. Although the term "Message Queue" is frequently used in our case we will be using a "Topic", not a "Queue" for distributing messages to remote systems. ActiveMQ provides this function on the master server.
When any document changes on configured collections an update will be placed on a message queue. The update will stay on that message queue until all subscribers receive the update message.
NOTE: To be confirmed with the developers: Any collection on any system can be configured a master or a slave to other collections on other systems.
NOTE: Once "durable" messages are implemented, slave systems will not need to be running when the changes are made. When a slave goes down it will automatically be notified of all the changes since it last communicated to the master.
How Replication Works
The clustering replication configuration uses a standard compliant messaging system built around the Java Messaging System standard (JMS). The implementation eXist uses is based on the Apache ActiveMQ system. ActiveMQ is widely used as "middle-ware" to help applications communicate in a reliable manner.
Download and Configure the Apache ActiveMQ
- Download recent version from ActiveMQ from http://activemq.apache.org/download.html. Note that the TGZ file has additional Unix (Linux, Mac OS X) support, the ZIP file is for Windows.
- Extract content to disk, refered as $ACTIVEMQ_HOME
- Copy the $ACTIVEMQ_HOME/activemq-all-X.Y.Z.jar file to $EXIST_HOME/lib/user
For testing, I used activemq-all-5.6.0.jar.
Create eXist With Clustering Configuration
Note: The current work on clustering is being done in a branch of the eXist-db subversion repository. To build this branch checkout the following URL with a subversion client:
This code will be moved into the main trunk at a future time. Note that the extensions/local.properties has the following line in it that is not yet in the main trunk:
# Clustering extenstion for reliable document replication include.feature.clustering = true
You can then build eXist by using the $EXIST_HOME/build.sh or build.bat.
When you are done with the build you will see the following file:
Configure the Master Server
Add a collection.xconf file for the directory for which the content must be distributed, e.g., for /db/mycollection/ the .xconf file must be stored in /db/system/config/db/mycollection/.
Create collection '/db/mycollection'
Fill in hostname of the activemq message broker (here, "server.local:61616").
<collection xmlns="http://exist-db.org/collection-config/1.0"> <triggers> <trigger class="org.exist.replication.jms.publish.ReplicationTrigger"> <parameter name="java.naming.factory.initial" value="org.apache.activemq.jndi.ActiveMQInitialContextFactory"/> <parameter name="java.naming.provider.url" value="tcp://localhost:61616"/> <parameter name="connectionfactory" value="ConnectionFactory"/> <parameter name="destination" value="dynamicTopics/eXistdb"/> <parameter name="client-id" value="id1"/> </trigger> </triggers> </collection>
In the sample below the collection is named "mycollection"
Configure the Slave Servers
For each 'Slave', a job must be started via conf.xml; the job names must match the job name of the 'Master' configuration:
<!-- Start JMS listener for clustering feature. Parameters: java.naming.factory.initial Initial context provider java.naming.provider.url URL of message broker connectionfactory Name of connection factory destination Name of destination (Topic or Queue) client.id (optional) ClientID. Leave out or set "" for default behaviour --> <job type="startup" name="clustering" class="org.exist.replication.jms.subscribe.MessageReceiverJob"> <parameter name="java.naming.factory.initial" value="org.apache.activemq.jndi.ActiveMQInitialContextFactory"/> <parameter name="java.naming.provider.url" value="tcp://localhost:61616"/> <parameter name="connectionfactory" value="ConnectionFactory"/> <parameter name="destination" value="dynamicTopics/eXistdb"/> <parameter name="client-id" value="id2"/> <parameter name="subscriber-name" value="sub_name"/> </job>
Start up the Servers
Start ActiveMQ server
Start ActiveMQ server:
cd ACTIVEMQ_HOME ./bin/activemq start (for mac, use the bin/macosx wrapper)
Start eXist-db server on slave(s) and master
Start eXist on each slave server and create collection that will mirror the slave
cd EXISTSLAVE_HOME ./bin/startup.sh Create receive collection '/db/mycollection'
cd EXISTMASTER_HOME ./bin/startup.sh (No need to create the collection, since we already created above)
Test Document Distribution
On 'Master' create document in /db/mycollection/ (e.g. using java client, or eXide ; login as admin). The document will be automatically replicated to all of the slaves in the system.
With eXide, we can upload a +- 50k XML document to the slave, e.g., /db/mydoc.xml. Then, when we execute the following query, 2000 files (mydoc1000.xml to mydoc3000.xml) will be created on the server and replicated on the slaves.
let $doc := doc('/db/mydoc.xml') for $i in (1000 to 3000) return xmldb:store('/db/mycollection', concat('mydoc', $i , ".xml"), $doc)
Configure the Log4j system to debug mode.
On the Master system you should see the following lines:
2012-06-19 13:26:43,406 [eXistThread-90] DEBUG (Collection.java [storeXMLInternal]:1339) - document stored. 2012-06-19 13:26:43,406 [eXistThread-90] DEBUG (ClusterTrigger.java [afterCreateDocument]:63) - /db/mycollection/mydoc1000.xml 2012-06-19 13:26:43,406 [eXistThread-90] DEBUG (NativeSerializer.java [serializeToReceiver]:112) - serializing document 1430 (/db/mycollection/mydoc1000.xml) to SAX took 0 msec 2012-06-19 13:26:43,419 [eXistThread-90] DEBUG (JMSMessageSender.java [sendMessage]:156) - Message sent with id: ID:Dan-PC12-51166-1340109804913-3:1:1:1:1
On the Slave system you should see the following:
2012-06-19 13:48:05,875 [DefaultQuartzScheduler_Worker-2] DEBUG (NotificationService.java [debug]:94) - Registered UpdateListeners: 2012-06-19 13:50:06,218 [ActiveMQ Session Task-1] DEBUG (eXistJMSListener.java [onMessage]:138) - CREATE_UPDATE : DOCUMENT from /db/mycollection/mydoc1000.xml 2012-06-19 13:50:06,234 [ActiveMQ Session Task-1] DEBUG (ConfigurationHelper.java [getExistHome]:55) - Got eXist home from broker: C:\ws\exist-trunk\eXist
Because messaging is such a general purpose way to communicate between computer systems there are many other possible business problems that can be solved by variations of this first example. Replication not only can be used for increased reliability but it can also be used in conjunction with load balancing and auto-scaling to increase performance when a system is under heavy load.
Messages can also be used to distribute queries among many nodes each with their own data collection. The results of queries are places on a results queue and returned to the user as though they were using a single very-fast server.
Because the master eXist system only needs to place an update event on a message queue, you are then free to use message stores in many different configurations to distribute both data and programs to remote sites with varying degrees of reliability.