Show Menu
TOPICS×

About the Filtering menu

Use the Filtering menu to use scripts that change the content of a web document before it is indexed.

About Filtering Script

You can use Filtering Script to change the content of a Web document before it is indexed.
You can insert HTML tags, remove irrelevant content, and even create new HTML metadata based on a document's URL, MIME type, and existing content. The filtering script is a Perl script, which provides powerful string handling and the flexibility of regular expression matching. You use the filtering script with an initialization script, termination script, URL masks script, and test URL.
The filtering script is run each time a document is read from your website. The script runs as a standard filter, In other words, reads data from STDIN, transforms that data in some way, and writes the results to STDOUT. You can use the filtering script to print status messages from the filtering script to the index log. You either printing the messages to STDERR, or by way of the _search_debug_log() subroutine.
Some GNU diff options that you can use while in Expert (diff) mode on the Staged Filtering Script page, include the following:
GNU diff option
Description
-b
Ignores changes in amount of white space.
-B
Ignores changes that insert or delete blank lines.
-c
Uses the context output format, showing three lines of context.
-C lines
Uses the context output format, showing lines (an integer) lines of context, or three if lines is not given.
-i
Ignores changes in case; consider upper- and lowercase letters equivalent.
-f
Makes output that looks similar like an ed script but has changes in the order that they appear in the file.
-n
Outputs RCS-format diffs; like -f except that each command specifies the number of lines that are affected.
-u
Uses the unified output format, showing three lines of context.
-U lines
Uses the unified output format, showing lines (an integer) of context, or three if lines is not given.
You can use local variables, global variables, or both in these scripts. All global variables are prefaced with the namespace "main::". When the filtering script is started, its environment contains the following standard file handles:
  • STDIN - nothing (immediately returns EOF when read)
  • STDOUT - replacement HTML (if data is printed to STDOUT, it is used in place of the original document)
  • STDERR - data printed to STDERR is printed to the Index Log as an error
Additionally, you can write custom messages to the index log using the _search_debug_log() subroutine, as in the following example:
# Log information to the Index Log 
_search_debug_log("Done processing document: " . $main::search_url);

These messages appear with the word DEBUG as a preface, and are not logged as errors.
The following is an example of filtering. Web page <title> fields often begin with the company name. Even though this information is useful for site navigation purposes, it is not relevant when searching. If the titles of all MegaCorp web pages start with a common string, such as the following:
<title>MegaCorp -- meaningful title 
here</title>

You should remove " MegaCorp -- " from the beginning of each document title and count each document processed with the filtering script. To do so, you can use the following script:
# Make sure this is an HTML document. 
if ($main::ws_content_type =~ /^text\/html/) { 
    # Read the entire document into a local scalar variable. 
    my @docarray = <>; 
    my $doc = join("", @docarray); 
 
    # Remove "MegaCorp -- " from the title. 
    $doc =~ s/(<TITLE>)MegaCorp -- /$1/gis; 
 
    # Print the resulting document. 
    print $doc; 
 
    # Count that we've filtered one more document. 
    $main::doc_count++; 
}

Global Variables

You can use the following variables in any filtering script:
Variable
Description
$main::search_crawl_type
The value of $main::search_crawl_type indicates the type of index operation underway. Deprecated form: $main::ws_crawl_type The index operations and associated values include the following:
  • Full Index: Manual - manual
  • Full Index: Scheduled - auto
  • Full Index: Remote Control - CGI
  • Incremental Index: Manual - manual-incremental
  • Incremental Index: Scheduled - auto-incremental
  • Incremental Index: Remote Control - CGI-incremental
  • Scripted Index: Manual - manual-indexlist.txt
  • Scripted Index: Scheduled - auto-indexlist.txt
  • Scripted Index: Remote Control - CGI-indexlist.txt
  • Regenerate - manual-upgrade
$main::search_clear_cache
The value indicates whether the "Clear index cache" indexing option was requested for the current index operation. If "Clear index cache" was requested, the value of $main::search_clear_cache is " 1 ". Deprecated form: $main::ws_clear_cache
$main::search_fields
The value contains a tab-separated list of the metadata fields that are defined in the account. By default, the value is: url title desc keys target body alt date charset language Deprecated form: $main::ws_fields
$main::search_collections
The value contains a tab-separated list of the Collections that are defined in the account. Deprecated form: $main::ws_collections
$main::search_url
The value is the fully qualified URL of the document. Deprecated form: $main::ws_url
$main::search_content_type
The value is the content-type of the document as fetched from the http-equiv meta tag. A typical value is "text/html; charset=iso-8859-1". Deprecated form: $main::ws_content_type
$main::search_content_class
The value is the content class of the document, as derived from the content-type field. Deprecated form: $main::ws_content_class
$main::search_syntax_check
The value reflects the use of the "Check Syntax" button. If clicked, the value is 1 (one); otherwise, its value is 0 (zero). Deprecated form: $main::ws_syntax_check
$main::search_last_mod_date
If provided by the web server, this value contains the Epoch representation (seconds since January 1, 1970) of the document's last-modified date. You can format this value by using the Perl localtime() library call.

Quick tips

  • All global variables are prefaced with the namespace "main::": $main::doc_count = 0;
  • All local variables are declared with "my": my $i = 0;
  • Subroutines are defined in the initialization script. They do not need an explicit "main::" namespace: sub my_sub { ...
    }
  • Test the $main::search_content_type before you make changes to a file. Testing can help you avoid making careless changes to binary files, like SWF files or PDF files:
    if ($main::search_content_type =~ /^text\/html/) { ...
  • The $main::search_content_type is the full Content-Type header delivered by your server. It can sometimes contain a simple MIME type, such as "text/html". Or, it can contain a MIME type followed by other information, like the document's character set encoding, such as "text/html; charset=iso-8859-1".
  • For each type of non-HTML document, $main::search_content_type can take various values. Testing for each value in your script becomes cumbersome. For example, some Word documents have content type values of "application/msword", "application/vnd.ms-word" or "application/x-msword". In such cases, $main::search_content_class can take the following values:
    • html
    • pdf
    • word
    • excel
    • powerpoint
    • mp3
    • text
  • In the example, testing $main::search_content_class for "word" would match any of the three possible content-type values.
  • If nothing is printed to STDOUT from the filtering script, then the document is used exactly as it was downloaded. That is, if you do not need to change anything in a document, then you do not need to copy STDIN to STDOUT for that document.
  • If you want to remove all text from a document, print a valid file STDOUT. For example, to completely remove all text from an HTML document, you do the following: print "<html></html>";

Adding a filtering script

The filtering script is a Perl script that is run for each document that is downloaded from your website.
You use the filtering script in conjunction with an initialization script, termination script, and URL masks script.
Be sure that you rebuild your site index so that the results of your filtering script are visible to your customers.
To add a filtering script
  1. On the product menu, click Settings > Filtering > Filtering Script .
  2. (Optional) On the Filtering Script page, in the Test URL field, enter the URL of a document on your website.
    Click a testing option to see changes to the raw HTML text.
    Option
    Description
    Test URL field
    Lets you enter the URL of a document on your website.
    Test
    Tests the URL against the filtering scripts and URL masks.
    The test URL document is downloaded, which is then used as the STDIN input to the filtering script. The initialization, filtering, and termination scripts are then run. If there is any STDOUT output from the filtering script, that output is displayed in a new browser window.
    Test only
    Tests the script's operation only.
    Preview
    Lets you view the page.
    Full visual
    Generates a full before-and-after table view of the documents.
    Short visual
    Shows only the differences between the before-and-after views.
    Expert (diff)
    Displays the raw output of the GNU diff command that is used to compare the files, using the supplied command line options.
    Filtering Script
    Lets you paste your filtering script in the field provided.
    Save Changes
    Saves the filtering script.
    Check Syntax
    Lets you do a quick syntax check of your script by running the initialization, filtering, and termination scripts. It does not update and save your script.
    All Perl compiler errors and warnings, and all STDERR output are printed.
    Before the effects of the script are visible to customers, you must rebuild your site index.
    GNU diff command line options
    Some GNU diff options that you can use while in Expert (diff) mode on the Staged Filtering Script page, include the following:
    GNU diff command line option
    Description
    -b
    Ignores changes in amount of white space.
    -B
    Ignores changes that insert or delete blank lines.
    -c
    Uses the context output format, showing three lines of context.
    -C lines
    Uses the context output format, showing lines (an integer) lines of context, or three if lines is not given.
    -i
    Ignores changes in case; consider upper- and lowercase letters equivalent.
    -f
    Makes output that looks similar like an ed script but has changes in the order that they appear in the file.
    -n
    Outputs RCS-format diffs; like -f except that each command specifies the number of lines that are affected.
    -u
    Uses the unified output format, showing three lines of context.
    -U lines
    Uses the unified output format, showing lines (an integer) of context, or three if lines is not given.
  3. Click Test to test against the filtering scripts and URL masks.
    Clicking Test does not update and save your filtering script.
  4. In the Filtering Script field, paste your script.
  5. (Optional) Click Check Syntax to perform a quick syntax check of your script by running the filtering, initialization, and termination scripts.
    Check Syntax does not update and save your script.
  6. Click Save Changes .
  7. (Optional) Rebuild your staged site index if you want to preview the results.
  8. (Optional) On the Filtering Script page, do any of the following:

About Initialization Script

You can use Initialization Script to change the content of a Web document before it is indexed.
You can insert HTML tags, remove irrelevant content, and even create new HTML metadata based on a document's URL, MIME type, and existing content. The initialization script is a Perl script, which provides powerful string handling and the flexibility of regular expression matching. You use the initialization script with a filtering script, termination script, URL masks script, and test URL.
The initialization script is run once before indexing begins. Use this script to initialize any global variables and subroutines that are used by your filtering script. You can use the initialization script to print status messages from the filtering script to the index log. You either print the messages to STDERR, or by way of the _search_debug_log() subroutine.
Some GNU diff options that you can use while in Expert (diff) mode on the Staged Initialization Script page, include the following:
GNU diff option
Description
-b
Ignores changes in amount of white space.
-B
Ignores changes that insert or delete blank lines.
-c
Uses the context output format, showing three lines of context.
-C lines
Uses the context output format, showing lines (an integer) lines of context, or three if lines is not given.
-i
Ignores changes in case; consider upper- and lowercase letters equivalent.
-f
Makes output that looks similar like an ed script but has changes in the order that they appear in the file.
-n
Outputs RCS-format diffs; like -f except that each command specifies the number of lines that are affected.
-u
Uses the unified output format, showing three lines of context.
-U lines
Uses the unified output format, showing lines (an integer) of context, or three if lines is not given.
You can use local variables, global variables, or both in these scripts. All global variables are prefaced with the namespace "main::". When the initialization script is started, its environment contains the following standard file handles:
  • STDIN - nothing (immediately returns EOF when read)
  • STDOUT - nothing (if data is printed to STDOUT, it is discarded)
  • STDERR - data printed to STDERR is printed to the Index Log as an error
Additionally, you can write custom messages to the index log using the _search_debug_log() subroutine, as in the following example:
# Log information to the Index Log 
_search_debug_log("Done processing document: " . $main::search_url);

These messages appear with the word DEBUG as a preface, and are not logged as errors.
An example of an initialization script is the following:
# My subroutine to do something. 
sub my_sub_for_the_filtering_script { 
    my ($param1, $param2) = @_; 
    ... 
} 
 
# Initialize the document counter. 
$main::doc_count = 0;

Quick tips

  • All global variables are prefaced with the namespace "main::": $main::doc_count = 0;
  • All local variables are declared with "my": my $i = 0;
  • Subroutines are defined in the initialization script. They do not need an explicit "main::" namespace: sub my_sub { ...
    }
  • Test the $main::search_content_type before you make changes to a file. Testing can help you avoid making careless changes to binary files, like SWF files or PDF files:
    if ($main::search_content_type =~ /^text\/html/) { ...
  • The $main::search_content_type is the full Content-Type header delivered by your server. It can sometimes contain a simple MIME type, such as "text/html". Or, it can contain a MIME type followed by other information, like the document's character set encoding, such as "text/html; charset=iso-8859-1".
  • For each type of non-HTML document, $main::search_content_type can take various values. Testing for each value in your script becomes cumbersome. For example, some Word documents have content type values of "application/msword", "application/vnd.ms-word" or "application/x-msword". In such cases, $main::search_content_class can take the following values:
    • html
    • pdf
    • word
    • excel
    • powerpoint
    • mp3
    • text
  • In the example, testing $main::search_content_class for "word" would match any of the three possible content-type values.
  • If nothing is printed to STDOUT from the filtering script, then the document is used exactly as it was downloaded. That is, if you do not need to change anything in a document, then you do not need to copy STDIN to STDOUT for that document.
  • If you want to remove all text from a document, print a valid file STDOUT. For example, to completely remove all text from an HTML document, you do the following: print "<html></html>";

Adding an initialization script

The initialization script is a Perl script that is run once before any documents are indexed.
You use the initialization script in conjunction with a filtering script, termination script, and URL masks script.
Be sure that you rebuild your site index so that the results of your initialization script are visible to your customers.
To add an initialization script
  1. On the product menu, click Settings > Filtering > Initialization Script .
  2. (Optional) On the Initialization Script page, in the Test URL field, enter the URL of a document on your website.
    Click a testing option to see changes to the raw HTML text.
    See the filtering options table under Adding a filtering script .
    Click Test to test against the filtering scripts and URL masks.
    Clicking Test does not update and save your initialization script.
  3. In the Initialization Script field, paste your script.
  4. (Optional) Click Check Syntax to perform a quick syntax check of your script by running the filtering, initialization, and termination scripts.
    Check Syntax does not update and save your script.
  5. Click Save Changes .
  6. (Optional) Rebuild your staged site index if you want to preview the results.
  7. (Optional) On the Initialization Script page, do any of the following:

About Termination Script

You can use Termination Script to change the content of a Web document before it is indexed.
You can insert HTML tags, remove irrelevant content, and even create new HTML metadata based on a document's URL, MIME type, and existing content. The initialization script is a Perl script, which provides powerful string handling and the flexibility of regular expression matching. You use the termination script with an initialization script, filtering script, termination script, URL masks script, and test URL.
The termination script is run once after all the documents are indexed. You can use the termination script to print status messages from the filtering script to the index log. You either print the messages to STDERR, or by way of the _search_debug_log() subroutine.
Some GNU diff command line options that you can use while in Expert (diff) mode on the Staged Termination Script page, include the following:
GNU diff command line option
Description
-b
Ignores changes in amount of white space.
-B
Ignores changes that insert or delete blank lines.
-c
Uses the context output format, showing three lines of context.
-C lines
Uses the context output format, showing lines (an integer) lines of context, or three if lines is not given.
-i
Ignores changes in case; consider upper- and lowercase letters equivalent.
-f
Makes output that looks similar like an ed script but has changes in the order that they appear in the file.
-n
Outputs RCS-format diffs; like -f except that each command specifies the number of lines that are affected.
-u
Uses the unified output format, showing three lines of context.
-U lines
Uses the unified output format, showing lines (an integer) of context, or three if lines is not given.
You can use local variables, global variables, or both in these scripts. All global variables are prefaced with the namespace "main::". When the termination script is started, its environment contains the following standard file handles:
  • STDIN - nothing (immediately returns EOF when read)
  • STDOUT - nothing (if data is printed to STDOUT, it is discarded)
  • STDERR - data printed to STDERR is printed to the index log as an error
Additionally, you can write custom messages to the index log using the _search_debug_log() subroutine, as in the following example:
# Log information to the Index Log 
_search_debug_log("Done processing document: " . $main::search_url);

These messages appear with the word DEBUG as a preface, and are not logged as errors.
To display the number of documents that were processed by your filtering script as an error line in the index log, you can use the following termination script:
# Print the value of the document counter. 
print STDERR "Total docs: $main::doc_count\n"; 
# Or, using the log subroutine: 
_search_debug_log("Total docs: " . $main::doc_count);

Quick tips

  • All global variables are prefaced with the namespace "main::": $main::doc_count = 0;
  • All local variables are declared with "my": my $i = 0;
  • Subroutines are defined in the initialization script. They do not need an explicit "main::" namespace: sub my_sub { ...
    }
  • Test the $main::search_content_type before you make changes to a file. Testing can help you avoid making careless changes to binary files, like SWF files or PDF files:
    if ($main::search_content_type =~ /^text\/html/) { ...
  • The $main::search_content_type is the full Content-Type header delivered by your server. It can sometimes contain a simple MIME type, such as "text/html". Or, it can contain a MIME type followed by other information, like the document's character set encoding, such as "text/html; charset=iso-8859-1".
  • For each type of non-HTML document, $main::search_content_type can take various values. Testing for each value in your script becomes cumbersome. For example, some Word documents have content type values of "application/msword", "application/vnd.ms-word" or "application/x-msword". In such cases, $main::search_content_class can take the following values:
    • html
    • pdf
    • word
    • excel
    • powerpoint
    • mp3
    • text
  • In the example, testing $main::search_content_class for "word" would match any of the three possible content-type values.
  • If nothing is printed to STDOUT from the filtering script, then the document is used exactly as it was downloaded. That is, if you do not need to change anything in a document, then you do not need to copy STDIN to STDOUT for that document.
  • If you want to remove all text from a document, print a valid file STDOUT. For example, to completely remove all text from an HTML document, you do the following: print "<html></html>";

Adding a termination script

The termination script is a Perl script that is run once after all documents are indexed.
You use the termination script in conjunction with a filtering script, termination script, and URL masks script.
Be sure that you rebuild your site index so that the results of your initialization script are visible to your customers.
To add a termination script
  1. On the product menu, click Settings > Filtering > Termination Script .
  2. (Optional) On the Termination Script page, in the Test URL field, enter the URL of a document on your website.
    Click a testing option to see changes to the raw HTML text.
    See the table of filtering options under Adding a filtering script .
    Click Test to test against the filtering scripts and URL masks.
    Clicking Test does not update and save your termination script.
  3. In the Termination Script field, paste your script.
  4. (Optional) Click Check Syntax to perform a quick syntax check of your script by running the initialization, filtering, and termination scripts.
    Check Syntax does not update and save your script.
  5. Click Save Changes .
  6. (Optional) Rebuild your staged site index if you want to preview the results.
  7. (Optional) On the Termination Script page, do any of the following:

About URL Masks script

With filtering, you can change the content of a web document before it is indexed. You can insert HTML tags, remove irrelevant content, and even create new HTML metadata based on a document's URL, MIME type, and existing content. The URL masks script is a Perl script that provides powerful string handling and the flexibility of regular expression matching.
To change the contents of documents that exist only in a specific portion of your website, you can specify include URL masks, exclude URL masks, or both, to define the appropriate pages.
If you want to change only the documents under "https://www.mysite.com/faqs/" , you can use the following set of masks:
include https://www.mysite.com/faqs/ 
exclude *

You can also use regular expression in a URL mask script as in the following example:
include regexp ^https://www\.mysite\.com.*/faqs/.*$ 
exclude *

Scripted URL masks are considered in the order that you entered them in the URL Masks field. When a document URL matches a mask, that document is included or excluded based on the type of mask. If a document's URL does not match any URL mask, the document is included only if its MIME type is "text/html". All other MIME types are excluded.

Adding a URL mask script

Specify URL include masks and exclude masks to change the contents of documents that exist only in a specific portion of your website.
Before the effects of the URL Masks settings are visible to visitors, rebuild your site index.
To add a URL mask script
  1. On the product menu, click Settings > Filtering > URL Masks .
  2. (Optional) On the URL Masks page, in the Test URL field, enter a URL of a document on your website, and then click Test to test the URL against the filtering scripts and masks.
    The test URL document is downloaded, which is used as the STDIN input to the filtering script. Then the filtering, initialization, and termination scripts are run. If there is any STDOUT output from the filtering script that output is displayed in a new browser window.
    Clicking Test does not update and save your script.
  3. In the URL Masks field, enter one URL mask per line.
  4. (Optional) Click Check Syntax to perform a quick syntax check of your URL masks by running the filtering, initialization, and termination scripts.
    Check Syntax does not update and save your script.
  5. Click Save Changes .
  6. (Optional) Rebuild your staged site index if you want to preview the results.
  7. (Optional) On the URL Masks page, do any of the following:

About Content Types in Filtering

Lets you select which content types that you want filtered for this account.
The text found within the selected content types is converted to HTML and then processed using the script that is specified in Filtering Script.
The Content Types that you can select from include the following:
  • PDF documents
  • Text documents
  • Adobe Flash movies
  • Microsoft Word files
  • Microsoft Office files (OpenXML)
  • Microsoft Excel files
  • Microsoft Powerpoint files
  • Text in MP3 music files
Before the effects of the Content Types settings or changes to the settings are visible to customers, you must rebuild your site index.

Selecting the content types that are filtered

Select the content types that you want to pass to the script that is specified in Filtering Script.
To select the content types that are filtered
  1. On the product menu, click Settings > Filtering > Content Types .
  2. On the Content Types page, check the content types that you want pass to the filter script.
  3. Click Save Changes .
  4. (Optional) Rebuild your staged site index if you want to preview the results.
  5. (Optional) On the Content Types page, do any of the following: