Generate list of content to pre-extract
Execute Step 1(a-b) during a maintenance window/low-use period as the Node Store is traversed during this operation, which may incur significant load on the system.
1a. Execute oak-run.jar --generate to create a list of nodes that will have their text pre-extracted.
1b. List of nodes (1a) is stored to the file system as a CSV file
Note the entire Node Store is traversed (as specified by the paths in the oak-run command) every time --generate is executed, and a new CSV file is created. The CSV file is not re-used between discrete executions of the text pre-extraction process (Steps 1 - 2)
Pre-extract text to file system
Step 2(a-c) can be executed during normal operation of AEM is it only interacts w the Data Store.
2a. Execute oak-run.jar --tika to pre-extract text for the binary nodes enumerated in the CSV file generated in (1b)
2b. The process initiated in (2a) accesses binary nodes defined in the CSV in Data Store directly, and extracts text.
2c. Extracted text is stored on file system in a format ingestible by the Oak re-indexing process (3a)
Pre-extracted text is identified in the CSV by the binary fingerprint. If the binary file is the same, the same pre-extracted text can be used across AEM instances. Since AEM Publish is usually a sub-set of AEM Author, the pre-extracted text from AEM Author can often be used to re-index AEM Publish as well (assuming the AEM Publish have file-system access to the extracted text files).
Pre-extracted text can be incrementally added to over time. Text pre-extraction will skip extraction for previously extracted binaries, so it is best practice to keep pre-extracted text in case re-indexing must happen again in the future (assuming the extracted contents is not prohibitively large. If it is, evaluate zipping the contents in the interim, since text compresses well).
Re-index Oak indexes, sourcing full-text from Extracted Text files
Execute re-indexing (Steps 3a-b) during a maintenance/low-use period as the Node Store is traversed during this operation, which may incur significant load on the system.
3a. Re-index of Lucene indexes is invoked in AEM
3b. The Apache Jackrabbit Oak DataStore PreExtractedTextProvider OSGi config (configured to point at the Extracted text via a file system path) instructs Oak to sourced full-text text from the Extracted Files, and avoids directly hitting and processing the data stored in the repository.