When a conventional WCM asset is removed, the reference to the underlying datastore record may be removed from the node hierarchy, but the datastore record itself remains. This unreferenced datastore record then becomes "garbage" that need not be retained. In instances where a number of garbage assets exist, it is beneficial to get rid of them to preserve space and to optimize backup and filesystem maintenance performance.
For the most part, a WCM application tends to collect information and not delete it too often. Although new images are added, even superseding old versions, the version control system still retains the old one and supports reverting to it if needed. Thus the majority of the content we think of as adding to the system is effectively permanently stored. So what is the typical source of "garbage" in the repository that we might want to clean up?
AEM uses the repository as the storage for a number of internal and housekeeping activities:
- packages built and downloaded
- temporary files created for publish replication
- workflow payloads
- assets created temporarily during DAM rendering
When any of these temporary objects is large enough to require storage in the datastore, and when the object eventually passes out of use, the datastore record itself remains as "garbage". In a typical WCM author/publish application, the largest source of garbage of this type is commonly the process of publish activation. When data is being replicated to Publish, it if first gathered in collections in an efficient data format called "Durbo" and stored in the repository under /var/replication/data. The data bundles are often larger than the critical size threshold for the datastore and therefore wind up stored as datastore records. When the replication is complete, the node in /var/replication/data is deleted, but the datastore record remains as "garbage".
Another source of recoverable garbage is packages. Package data, like everything else, is stored in the repository and thus for packages which are larger than 4KB, in the datastore. In the course of a development project or over time while maintaining a system, packages may be built and rebuilt many times, each build resulting in a new datastore record, orphaning the previous build's record.