Data Store Garbage Collection

You are reading the AEM 6.1 version of Data Store Garbage Collection.
This documentation is also available for the following versions:  AEM 6.3  AEM 6.2  AEM 6.0 

When a conventional WCM asset is removed, the reference to the underlying datastore record may be removed from the node hierarchy, but the datastore record itself remains. This unreferenced datastore record then becomes "garbage" that need not be retained. In instances where a number of garbage assets exist, it is beneficial to get rid of them to preserve space and to optimize backup and filesystem maintenance performance.

For the most part, a WCM application tends to collect information and not delete it too often. Although new images are added, even superseding old versions, the version control system still retains the old one and supports reverting to it if needed. Thus the majority of the content we think of as adding to the system is effectively permanently stored. So what is the typical source of "garbage" in the repository that we might want to clean up?

AEM uses the repository as the storage for a number of internal and housekeeping activities:

  • packages built and downloaded
  • temporary files created for publish replication
  • workflow payloads
  • assets created temporarily during DAM rendering

When any of these temporary objects is large enough to require storage in the datastore, and when the object eventually passes out of use, the datastore record itself remains as "garbage". In a typical WCM author/publish application, the largest source of garbage of this type is commonly the process of publish activation. When data is being replicated to Publish, it if first gathered in collections in an efficient data format called "Durbo" and stored in the repository under /var/replication/data. The data bundles are often larger than the critical size threshold for the datastore and therefore wind up stored as datastore records. When the replication is complete, the node in /var/replication/data is deleted, but the datastore record remains as "garbage".

Another source of recoverable garbage is packages. Package data, like everything else, is stored in the repository and thus for packages which are larger than 4KB, in the datastore. In the course of a development project or over time while maintaining a system, packages may be built and rebuilt many times, each build resulting in a new datastore record, orphaning the previous build's record.

How does garbage collection work?

Datastore garbage collection is performed manually by the system administrator on as as-needs basis. In general, it is recommended that garbage collection be performed periodically, following an offline revision cleanup, but the following factors should be taken into acount in planning garbage collections:

  • garbage collections take time and may impact performance, so they should be planned accordingly
  • removal of garbage records does not affect normal performance, so this is not a performance optimization
  • if storage utilization, and related factors like backup times are not a concern, then garbage collection might be safely deferred

The garbage collector first makes a note of the current timestamp when the process begins. The GC is then carried out using a multi-pass mark/sweep pattern algorithm.

In the first phase, the garbage collector performs a comprehensive traversal of all of the repository content. For each content object that has a reference to a datastore record, it located the file in the filesystem, performing a metadata update -- modifying the "last modified" or MTIME attribute. At this point files that are accessed by this phase become newer than the initial baseline timestamp.

In the second phase, the garbage collector traverses the physical directory structure of the datastore in much the same way as a "find". It examined the "last modified" or MTIME attribute of the file and makes the following determination:

  • if the MTIME is newer than the initial baseline timestamp, then either the file was found in the first phase, or it is an entirely new file that was added to the repository while the GC process was ongoing. In either of these cases the record is taken to be active and the file shall not be deleted.
  • if the MTIME is prior to the initial baseline timestamp, then the file is not an actively referenced file and it is considered removable garbage.

This approach works well for a single node with a private datastore. However the datastore may be shared, and if it is this means that potientially active live references to datastore records from other repositories are not checked, and active referenced files may be mistakenly removed. It is imperative that the system admin understand the shared nature of the datastore before planning any garbage collections, and only use the simple built-in GC process when it is known that the datastore is not shared.

Running Garbage Collection

There are two ways of running garbage collection, depending on the data store setup AEM is running on:

  1. Via offline revision cleanup, a garbage collection mechanism.
  2. Via the JMX Console.

Revision cleanup is needed when TarMK is being used as both the node store and data store, while explicit garbage collection via the JMX console is used when the binary data is stored in an external data store like File System Data Store.

The below table shows the garbage collection type that needs to be used for all the supported data store deployments in AEM 6:

Node Store
Data Store Garbage Collection Mechanism
TarMK TarMK Revision Cleanup
TarMK External Filesystem JMX Console
MongoDB MongoDB Not needed
MongoDB External Filesystem JMX Console

Running Data Store Garbage Collection via the JMX Console

This section is about running data store garbage collection via the JMX Console. For instructions on how to run Revision Cleanup, see the documentation on Maintaining the Repository.

Note

If you are running TarMK with an external data store, it is required you run offline revision cleanup first, in order for garbage collection to be effective.

To run garbage collection:

  1. In the Apache Felix OSGi Management Console, highlight the Main tab and select JMX from the following menu. and select the Repository Manager MBean (or go to http://localhost:4502/system/console/jmx/org.apache.jackrabbit.oak%3Aid%3D14%2Cname%3D%22repository+manager%22%2Ctype%3D%22RepositoryManagement%22 )

  2. Next,  search for and click the Repository Manager MBean (or go to http://host:port/system/console/jmx/org.apache.jackrabbit.oak%3Aid%3D14%2Cname%3D%22repository+manager%22%2Ctype%3D%22RepositoryManagement%22 )

  3. Click startDataStoreGC(boolean markOnly).

  4. enter "true" for the markOnly parameter if required:

    Option Description
    boolean markOnly Set to true to only mark references and not sweep in the mark and sweep operation. This mode is to be used when the underlying BlobStore is shared between multiple different repositories. For all other cases set it to false to perform full garbage collection.
  5. Click Invoke. CRX runs the garbage collection and indicates when it has completed.

Note

The data store garbage collection will not collect files that have been deleted in the last 24 hours.

Automating Garbage Collection

If possible, garbage collection should be run when there is little load on the system, for example in the morning. 

Garbage collection can be automated using the wget or curl HTTP clients. The following is an example of how to automate backup by using curl:

Caution

In the following example curl commands various parameters might need to be configured for your instance; for example, the hostname (localhost), port (4502), admin password (xyz) and various parameters for the actual garbage collection.

Run the garbage collection; for example:

curl -u admin:admin -X POST --data markOnly=true  http://localhost:4503/system/console/jmx/org.apache.jackrabbit.oak"%"3Aname"%"3Drepository+manager"%"2Ctype"%"3DRepositoryManagement/op/startDataStoreGC/boolean
        

Code samples are intended for illustration purposes only.

The curl command returns immediately.

​