How to Run AEM with TarMK Cold Standby
The Cold Standby capacity of the Tar Micro Kernel allows one or more standby AEM instances to connect to a primary instance. The sync process is one way only meaning that it is only done from the primary to the standby instances.
The purpose of the standby instances is to guarantee a live data copy of the master repository and ensure a quick switch without data loss in case the master is unavailable for any reason.
Content is synced linearly between the primary instance and the standby instances without any integrity checks for file or repository corruption. Because of this design, standby instances are exact copies of the primary instance and cannot help to mitigate inconsistencies on primary instances.
The Cold Standby feature is meant to secure scenarios where high availability is required on author instances. For situations where high availability is required on publish instances using the Tar Micro Kernel, Adobe recommends using a publish farm.
For info on more available deployments, see the Recommended Deployments page.
How it works
On the primary AEM instance, a TCP port is opened and is listening to incoming messages. Currently, there are two type of messages that the slaves will send to the master:
- a message requesting the segmend ID of the current head
- a message requesting segment data with a specified ID
The standby periodically requests the segment ID of the current head of the primary. If the segment is locally unknown it will be retrieved. If it's already present the segments are compared and referenced segments will be requested too, if necessary.
Standby instances are not receiving any type of requests, because they are running in sync only mode. The only section available on a standby instance is the Web Console, in order to facilitate bundle and services configuration.
A typical TarMK Cold Standby deployment:
The data flow is designed to detect and handle connection and network related problems automatically. All packets are bundled with checksums and as soon as problems with the connection or damaged packets occur retry mechanisms are triggered.
Enabling TarMK Cold Standby on the primary instance has almost no measurable impact on the performance. The additional CPU consumption is very low and the extra hard disk and network IO should not produce and performance issues.
On the standby you can expect high CPU consumption during the sync process. Due to the fact that the procedure is not multithreaded it cannot be sped up by using multiple cores. If no data is changed or transferred there will be no measurable activity. The connection speed will vary depending on the hardware and network environment but it does not depend on the size of the repository or SSL use. You should keep this in mind when estimating the time needed for an initial sync or when much data was changed in the meantime on the primary node.
Assuming that all the instances run in the same intranet security zone the risk of a security breach is greatly reduced. Nevertheless, you can add extra security layer by enabling SSL connections between the slaves and the master. Doing so reduces the possibility that the data is compromised by a man-in-the-middle.
Furthermore you can specify the standby instances that are allowed to connect by restricting the IP address of incoming requests. This should help to garantuee that no one in the intranet can copy the repository.
It is recommended that a load balancer be added between the Dispatcher and the servers that are part of the Coldy Standby setup. The load balancer should be configured to direct user traffic only to the primary instance in order to ensure consitency and prevent content from getting copied on the standby instance by other means than the Cold Standby mechanism.
Creating an AEM TarMK Cold Standby setup
The PID for the Segment node store and the Standby store service has changed in AEM 6.3 compared to the previous versions as follows:
- from org.apache.jackrabbit.oak. plugins .segment.standby.store.StandbyStoreService to org.apache.jackrabbit.oak.segment.standby.store.StandbyStoreService
- from org.apache.jackrabbit.oak. plugins .segment.SegmentNodeStoreService to org.apache.jackrabbit.oak.segment.SegmentNodeStoreService
Make sure you make the necessary configuration adjustments to reflect this change.
In order to create a TarMK cold standby setup, you first need to create the standby instances by performing a file system copy of the entire installation folder of the primary to a new location. You can then start each instance with a runmode that will specify its role ( primary or standby ).
Below is the procedure that needs to be followed in order to create a setup with one master and one standby instance:
- Install AEM.
- Shutdown your instance, and copy its installation folder to the location where the cold standby instance will run from. Even if run from different machines, make sure to give each folder a descriptive name (like aem-primary or aem-standby ) to differentiate between the instances.
- Go to the installation folder of the primary instance and:
If, for example, you are running an AEM TarMK instance with an external File Data Store, you need these configuration files:
- Check and delete any preivous OSGi configurations you might have under aem-primary/crx-quickstart/install
- Create a folder called install.primary under aem-primary/crx-quickstart/install
- Create the required configurations for the prefered node store and data store under aem-primary/crx-quickstart/install/install.primary
- Create a file called org.apache.jackrabbit.oak.segment.standby.store.StandbyStoreService.config in the same location and configure it accordingly. For more information on the configuration options, see Configuration .
- If you are using an AEM TarMK instance with an external data store, create a folder named crx3 under aem-primary/crx-quickstart/install named crx3
- Place the data store configuration file in the crx3 folder.
Below you'll find sample configurations for the primary instance:Sample of org.apache.jackrabbit.oak.segment.SegmentNodeStoreService.config
org.apache.sling.installer.configuration.persist=B"false" customBlobStore=B"true" standby=B"false"Sample of org.apache.jackrabbit.oak.segment.standby.store.StandbyStoreService.config
org.apache.sling.installer.configuration.persist=B"false" mode="primary" port=I"8023"Sample of org.apache.jackrabbit.oak.plugins.blob.datastore.FileDataStore.config
org.apache.sling.installer.configuration.persist=B"false" path="./crx-quickstart/repository/datastore" minRecordLength=I"16384"
- Start the primary making sure you specify the primary runmode:
java -jar quickstart.jar -r primary,crx3,crx3tar
- Create a new Apache Sling Logging Logger for the org.apache.jackrabbit.oak.segment package. Set log level to “Debug” and point its log output to a separate logfile, like /logs/tarmk-coldstandby.log . For more information, see Logging .
- Go to the location of the standby instance and start it by running the jar.
- Create the same logging configuration as for the primary. Then, stop the instance.
- Next, prepare the standby instance. You can do this by performing the same steps as for the primary instance:
Below are sample configuration files for a typical standby instance:Sample of org.apache.jackrabbit.oak.segment.SegmentNodeStoreService.config
- Delete any files you might have under aem-standby/crx-quickstart/install .
- Create a new folder called install.standby under aem-standby/crx-quickstart/install
- Create two configuration files called:
- Create a new folder called crx3 under aem-standby/crx-quickstart/install
- Create the data store configuration and place it under aem-standby/crx-quickstart/install/crx3 . For this example, the file you need to create is:
- Edit the files and create the necessary configurations.
org.apache.sling.installer.configuration.persist=B"false" name="Oak-Tar" service.ranking=I"100" standby=B"true" customBlobStore=B"true"Sample of org.apache.jackrabbit.oak.segment.standby.store.StandbyStoreService.config
org.apache.sling.installer.configuration.persist=B"false" mode="standby" primary.host="127.0.0.1" port=I"8023" secure=B"false" interval=I"5" standby.autoclean=B"true"Sample of org.apache.jackrabbit.oak.plugins.blob.datastore.FileDataStore.config
org.apache.sling.installer.configuration.persist=B"false" path="./crx-quickstart/repository/datastore" minRecordLength=I"16384"
- Start the standby instance by using the standby runmode:
java -jar quickstart.jar -r standby,crx3,crx3tar
The service can also be configured via the Web Console, by:
- Going to the Web Console at: https://serveraddress:serverport/system/console/configMgr
- Looking for a service called Apache Jackrabbit Oak Segment Tar Cold Standby Service and double click it to edit the settings.
- Saving the settings, and restarting the instances so the new settings can take effect.
You can check the role of an instance at any time by checking the presence of the primary or standby runmodes in the Sling Settings Web Console.
This can be done by going to http://localhost:4502/system/console/status-slingsettings and checking the "Run Modes" line.
First time synchronization
After the preparation is complete and the standby is started for the first time there will be heavy network traffic between the instances as the standby catches up to the primary. You can consult the logs to observe the status of the synchronization.
In the standby tarmk-coldstandby.log , you will see entries such as these:
*DEBUG* [defaultEventExecutorGroup-2-1] org.apache.jackrabbit.oak.segment.standby.store.StandbyStore trying to read segment ec1f739c-0e3c-41b8-be2e-5417efc05266 *DEBUG* [nioEventLoopGroup-3-1] org.apache.jackrabbit.oak.segment.standby.codec.SegmentDecoder received type 1 with id ec1f739c-0e3c-41b8-be2e-5417efc05266 and size 262144 *DEBUG* [defaultEventExecutorGroup-2-1] org.apache.jackrabbit.oak.segment.standby.store.StandbyStore got segment ec1f739c-0e3c-41b8-be2e-5417efc05266 with size 262144 *DEBUG* [defaultEventExecutorGroup-2-1] org.apache.jackrabbit.oak.segment.file.TarWriter Writing segment ec1f739c-0e3c-41b8-be2e-5417efc05266 to /mnt/crx/author/crx-quickstart/repository/segmentstore/data00016a.tar
In the standby’s error.log , you should see an entry such as this:
*INFO* [FelixStartLevel] org.apache.jackrabbit.oak.segment.standby.store.StandbyStoreService started standby sync with 10.20.30.40:8023 at 5 sec.
In the above log snippet, 10.20.30.40 is the IP address of the primary.
In the primary tarmk-coldstandby.log , you will see entries such as these:
*DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.segment.standby.store.CommunicationObserver got message ‘s.d45f53e4-0c33-4d4d-b3d0-7c552c8e3bbd’ from client c7a7ce9b-1e16-488a-976e-627100ddd8cd *DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.segment.standby.server.StandbyServerHandler request segment id d45f53e4-0c33-4d4d-b3d0-7c552c8e3bbd *DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.segment.standby.server.StandbyServerHandler sending segment d45f53e4-0c33-4d4d-b3d0-7c552c8e3bbd to /10.20.30.40:34998 *DEBUG* [nioEventLoopGroup-3-2] org.apache.jackrabbit.oak.segment.standby.store.CommunicationObserver did send segment with 262144 bytes to client c7a7ce9b-1e16-488a-976e-627100ddd8cd
In this case, the "client" mentioned in the log is the standby instance.
Once these entries stop appearing in the log, you can safely assume that the syncing process is complete.
While the above entries show that the polling mechanism is functioning properly, it is often useful to understand if there is any data being synchronized as polling is occurring. To do so, look for entries like the following:
*DEBUG* [defaultEventExecutorGroup-156-1] org.apache.jackrabbit.oak.segment.file.TarWriter Writing segment 3a03fafc-d1f9-4a8f-a67a-d0849d5a36d5 to /<<CQROOTDIRECTORY>>/crx-quickstart/repository/segmentstore/data00014a.tar
Additionally, when running with a non shared FileDataStore , messages like the following will confirm that the binary files are being properly transmitted:
*DEBUG* [nioEventLoopGroup-228-1] org.apache.jackrabbit.oak.segment.standby.codec.ReplyDecoder received blob with id eb26faeaca7f6f5b636f0ececc592f1fd97ea1a9#169102 and size 169102
The following OSGi settings are available for the Cold Standby service:
- Persist Configuration: if enabled, this will store the configuration in the repository instead of the traditional OSGi configuration files. It is recommeded to keep this setting disabled on production systems so that the primary configuration will not be pulled by the standby.
- Mode ( mode ): this will choose the runmode of the instance.
- Port (port): the port to use for communication. The default is 8023 .
- Primary host ( primary.host ): - the host of the primary instance. This setting is only applicable for the standby.
- Sync interval ( interval ): - this setting determines the interval between sync request and is only applicable for the standby instance.
- Allowed IP-Ranges ( primary.allowed-client-ip-ranges ): - the IP ranges that the primary will allow connections from.
- Secure ( secure ): Enable SSL encryption. In order to make use of this setting, it must be enabled on all instances.
- Standby Read Timeout ( standby.readtimeout ): Timeout for requests issued from the standby instance in milliseconds. The recommended timeout setting is 43200000. It is generally advised you set the timeout to a value of at least 12 hours.
- Standby Automatic Cleanup ( standby.autoclean ): Call the cleanup method if the size of the store increases on a sync cycle.
It is highly recommended that the primary and standby have different repository IDs in order to make them separately indetifiable for services like Offloading.
The best way to make sure this is covered is by deleting the sling.id file on the standby and restarting the instance.
In case the primary instance fails for any reason, you can set one of the standby instances to take the role of the primary by changing the start runmode as detailed below:
The configuration files also need to be modified so that they match the settings used for the primary instance.
- Go to the location where the standby instance is installed, and stop it.
- In case you have a load balancer configured with the setup, you can remove the primary from the load balancer's configuration at this point.
- Backup the crx-quickstart folder from standby installation folder. It can be used as a starting point when setting up a new standby.
- Restart the instance using the primary runmode:
java -jar quickstart.jar -r primary,crx3,crx3tar
- Add the new primary to the load balancer.
- Create and start a new standby instance. For more info, see the procedure above on Creating an AEM TarMK Cold Standby Setup .
Applying Hotfixes to a Cold Standby Setup
The recommended way to apply hotfixes to a cold stanby setup is by installing them to the primary instance and then cloning it into a new cold standby instance with the hotfixes installed.
You can do this by following the steps outlined below:
- Stop the synchronization process on the cold standby instance by going to the JMX Console and using the org.apache.jackrabbit.oak: Status ("Standby") bean. For more information on how to do this, see the section on Monitoring .
- Stop the cold standby instance.
- Install the hotfix on the primary instance. For more details on how to install a hotfix, see How to Work With Packages .
- Test the instance for issues after the installation.
- Remove the cold standby instance by deleting its installation folder.
- Stop the primary instance and clone it by performing a file system copy of its entire installation folder to the location of the cold standby.
- Reconfigure the newly created clone to act as a cold standby instance. For additional details, see Creating an AEM TarMK Cold Standby Setup.
- Start both the primary and the cold standby instances.
The feature exposes information using JMX or MBeans. Doing so you can inspect the current state of the standby and the master using the JMX console . The information can be found in an MBean of type org.apache.jackrabbit.oak:type="Standby" named Status .
Observing a standby instance you will expose one node. The ID is usually a generic UUID.
This node has five read-only attributes:
- Running: boolean value indicating whether the sync process is running or not.
- Mode: Client: followed by the UUID used to identify the instance. Note that this UUID will change every time the configuration is updated.
- Status: a textual representation of the current state (like running or stopped ).
- FailedRequests: the number of consecutive errors.
- SecondsSinceLastSuccess: the number of seconds since the last successful communication with the server. It will display -1 if no successful communication has been made.
There are also three invokable methods:
- start(): starts the sync process.
- stop(): stops the sync process.
- cleanup(): runs the cleanup operation on the standby.
Observing the primary exposes some general information via a MBean whose ID value is the port number the TarMK standby service is using (8023 by default). Most of the methods and attributes are the same as for the standby, but some differ:
- Mode: will always show the value primary .
Furthermore information for up to 10 clients (standby instances) that are connected to the master can be retrieved. The MBean ID is the UUID of the instance. There are no invokable methods for these MBeans but some very useful readonly attributes:
- Name: the ID of the client.
- LastSeenTimestamp: the timestamp of the last request in a textual representation.
- LastRequest: the last request of the client.
- RemoteAddress: the IP address of the client.
- RemotePort: the port the client used for the last request.
- TransferredSegments: the total number of segments transferred to this client.
- TransferredSegmentBytes: the total number of bytes transferred to this client.
Cold Standby Repository Maintenance
Do not run offline revision cleanup on the standby. It is not needed and it will not reduce the segmentstore size.
If you run Online Revision Cleanup on the primary instance, the manual procedure presented below is not needed. Additionally, if you are using Online Revision Cleanup, the cleanup () operation on the standby instance will pe performed automatically.
Adobe recommends running maintenance on a regular basis to prevent excessive repository growth over time. To manually perform cold standby repository maintenance, follow the steps below:
- Stop the standby process on the standby instance by going to the JMX Console and using the org.apache.jackrabbit.oak: Status ("Standby") bean. For more info on how to do this, see the above section on Monitoring .
- Stop the primary AEM instance.
- Run the oak compaction tool on the primary instance. For more details, see Maintaining the Repository .
- Start the primary instance.
- Start the standby process on the standby instance using the same JMX bean as described in the first step.
- Watch the logs and wait for synchronization to complete. It is possible that substantial growth in the standby repository will be seen at this time.
- Run the cleanup() operation on the standby instance, using the same JMX bean as described in the first step.
It may take longer than usual for the standby instance to complete synchronization with the primary as offline compaction effectively rewrites the repository history, thus making computation of the changes in the repositories take more time. It should also be noted that once this process completes, the size of the repository on the standby will be roughly the same size as the repository on the primary.
As an alternative, the primary repository can be copied over to the standby manually after running compaction on the primary, essentially rebuilding the standby each time compaction runs.
Data Store Garbage Collection
It is important to run garbage collection on file datastore instances from time to time as otherwise, deleted binaries will remain on the filesystem, eventually filling up the drive. To run garbage collection, follow the below procedure:
- Run cold standby repository maintenance as described in the section above .
- After the maintenance process has completed and the instances have been restarted:
In case you are not using a shared data store, garbage collection will first have to be run on primary and then on the standby.
- On the primary, run the data store garbage collection via the relevant JMX bean as described in this article .
- On the standby, the data store garbage collection is available only via the BlobGarbageCollection MBean - startBlobGC() . The RepositoryManagement MBean is not available on the standby.