Observation

Similar, with everything that is asynchronously happening like acting on observation events, it cannot be guaranteed to be executed locally and therefore must be used with care. This is true for both JCR events and Sling resource events. At the time a change is happening, the instance may be taken down and be replaced by a different instance. Other instances in the topology that are active at that time are able to react to that event. In this case however, this will not be a local event and there might even be no active leader in case of an ongoing leader election when the event is issued.

Background Tasks and Long Running Jobs

Code executed as a background tasks must assume that the instance it is running in can be brought down at any time. Therefore the code must be resilient, and most importantly resumable. That means that if the code gets re-executed it should not start from the beginning again but rather close to from where it left off. While this is not a new requirement for this kind of code, in AEM as a Cloud Service it is more likely that an instance take down is going to occur.

To minimize the trouble, long running jobs should be avoided if possible, and they should be resumable at a minimum. For executing such jobs, use Sling Jobs, which have an at-least-once guarantee and hence if they get interrupted will get re-executed as soon as possible. But they should probably not start from the beginning again. For scheduling such jobs, it is best to use the Sling Jobs scheduler as this again ensures the at-least-once execution.

Do not use the Sling Commons Scheduler for scheduling as execution cannot be guaranteed. It is just more likely that it is scheduled.

Similarly, with everything that is asynchronously happening, like acting on observation events, (being it JCR events or Sling resource events), can’t be guaranteed to be executed and therefore must be used with care. This is already true for AEM deployments in the present.

Outgoing HTTP Connections

It is strongly recommended that any outgoing HTTP connections set reasonable connect and read timeouts; suggested values are 1 second for the connection timeout and 5 seconds for read timeout. The exact numbers must be determined based on the performance of the backend system handling these requests.

For code that does not apply these timeouts, AEM instances running on AEM as a Cloud Service will enforce a global timeouts. These timeout values are 10 seconds for connect calls and 60 seconds for read calls for connections.

Adobe recommends the use of the provided Apache HttpComponents Client 4.x library for making HTTP connections.

Alternatives that are known to work, but may require providing the dependency yourself are:

Next to providing timeouts also a proper handling of such timeouts and unexpected HTTP status codes should be implemented.

Handling request rate limits

When the rate of incoming requests to AEM exceeds healthy levels, AEM responds to new requests with HTTP error code 429. Applications making programmatic calls to AEM can consider coding defensively, retrying after a few seconds with an exponential backoff strategy. Before mid-August 2023, AEM responded to the same condition with HTTP error code 503.

No Classic UI Customizations

AEM as a Cloud Service only supports the Touch UI for third-party customer code. Classic UI is not available for customization.

No Native Binaries or Native Libraries

Native binaries and libraries must not be deployed to or installed in cloud environments.

In addition, code should not attempt to download native binaries or native java extensions (for example, JNI) at runtime.

No Streaming Binaries through AEM as a Cloud Service

Binaries should be accessed through the CDN, which will serve binaries outside of the core AEM services.

For example, do not use asset.getOriginal().getStream(), which triggers downloading a binary onto the AEM service’s ephemeral disk.

No Reverse Replication Agents

Reverse replication from Publish to Author is not supported in AEM as a Cloud Service. If such a strategy is needed, you can use an external persistence store that is shared amongst the farm of Publish instances and potentially the Author cluster.

Forward Replication Agents Might Need to be Ported

Content is replicated from Author to Publish through a pub-sub mechanism. Custom replication agents are not supported.

No Overloading Development Environments

Production environments are sized higher to ensure stable operation, while Stage environments are sized like Production environments to ensure realistic testing under production conditions.

Dev environments and Rapid Dev environments should be limited to development, error analysis, and functional tests, and are not designed to process high workloads, nor large amounts of content.

As an example, changing an index definition on a large content repository on a Dev environment can result in re-indexing resulting in too much processing. Tests that require substantial content should be run on Stage environments.