Show Menu
TOPICS×

About the Crawling menu

Use the Crawling menu set date and URL masks, passwords, content types, connections, form definitions, and URL entrypoints.

About URL Entrypoints

Most websites have one primary entry point or home page that a customer initially visits. This main entry point is the URL address from which the search robot begins index crawling. However, if your website has multiple domains or subdomains, or if portions of your site are not linked from the primary entry point, you can use URL Entrypoints to add more entry points.
All website pages below each specified URL entry point are indexed. You can combine URL entry points with masks to control exactly which portions of a website that you want to index. You must rebuild your website index before the effects of URL Entrypoints settings are visible to customers.
The main entry point is typically the URL of the website that you want to index and search. You configure this main entry point in Account Settings.
After you have specified the main URL entry point, you can optionally specify additional entry points that you want to crawl in order. Most often you will specify additional entry points for web pages that are not linked from pages under the main entry point. Specify additional entry points when your website spans more than one domain as in the following example:
https://www.domain.com/
https://www.domain.com/not_linked/but_search_me_too/
https://more.domain.com/
You qualify each entry point with one or more of the following space-separated keywords in the table below. These keywords affect how the page is indexed.
Important : Be sure that you separate a given keyword from the entry point and from each other by a space; a comma is not a valid separator.
Keyword
Description
noindex
If you do not want to index the text on the entry point page, but you do want to follow the page's links, add noindex after the entry point.
Separate the keyword from the entry point with a space as in the following example:
https://www.my-additional-domain.com/more_pages/main.html noindex
This keyword is equivalent to a robots meta tag with content="noindex" ) between the <head> ... </head> tags of the entry point page.
nofollow
If you want to index the text in the entry point page but you do not want to follow any of the page's links, add nofollow after the entry point.
Separate the keyword from the entry point with a space as in the following example:
https://www.domain.com/not_linked/directory_listing&nbsp;nofollow
This keyword is equivalent to a robots meta tag with content="nofollow" between the <head> ... </head> tag of an entry point page.
form
When the entry point is a login page, form is typically used so that the search robot can submit the login form and receive the appropriate cookies before crawling the website. When the "form" keyword is used, the entry point page is not indexed and the search robot does not mark the entry point page as crawled. Use nofollow if you do not want the search robot to follow the page's links.

Adding multiple URL entry points that you want indexed

If your website has multiple domains or subdomains and you want them crawled, you can use URL Entrypoints to add more URLs.
To set your website's main URL entry point, you use Account Settings.
To add multiple URL entry points that you want indexed
  1. On the product menu, click Settings > Crawling > URL Entrypoints .
  2. On the URL Entrypoints page, in the Entrypoints field, enter one URL address per line.
  3. (Optional) In the Add Index Connector Configurations drop-down list, select an index connector that you want to add as an entry point for indexing.
    The drop-down list is only available if you have previously added one or more index connector definitions.
  4. Click Save Changes .
  5. (Optional) Do any of the following:

About URL Masks

URL masks are patterns that determine which of your website documents the search robot indexes or not indexes.
Be sure that you rebuild your site index so that the results of your URL masks are visible to your customers.
The following are two kinds of URL masks that you can use:
  • Include URL masks
  • Exclude URL masks
Include URL masks tell the search robot to index any documents that match the mask's pattern.
Exclude URL masks tell the search robot to index matching documents.
As the search robot travels from link to link through your website, it encounters URLs and looks for masks that match those URLs. The first match determines whether to include or exclude that URL from the index. If no mask matches an encountered URL, that URL is discarded from the index.
Include URL masks for your entrypoint URLs are automatically generated. This behavior ensures that all encountered documents on your website are indexed. It also conveniently does away with links that "leave" your website. For example, if an indexed page links to https://www.yahoo.com, the search robot does not index that URL because it does not match the include mask automatically generated by the entrypoint URL.
Each URL mask that you specify must be on a separate line.
The mask can specify any of the following:
  • A full path as in https://www.mydomain.com/products.html .
  • A partial path as in https://www.mydomain.com/products .
  • A URL that uses wild cards as in https://www.mydomain.com/*.html .
  • A regular expression (for advanced users).
    To make a mask a regular expression, insert the keyword regexp between the mask type ( exclude or include ) and the URL mask.
The following is a simple exclude URL mask example:
exclude https://www.mydomain.com/photos

Because this example is an exclude URL mask, any document that matches the pattern is not indexed. The pattern matches any encountered item, both files and folders, so that https://www.mydomain.com/photos.html and https://www.mydomain.com/photos/index.html , both of which match the exclude URL, are not indexed. To match only files in the /photos/ folder, the URL mask must contain a trailing slash as in the following example:
exclude https://www.mydomain.com/photos/

The following exclude mask example uses a wild card. It tells the search robot to overlook files with the ".pdf" extension. The search robot does not add these files to your index.
exclude *.pdf

A simple include URL mask is the following:
include https://www.mydomain.com/news/

Only documents that are linked by way of a series of links from a URL entrypoint, or that are used as a URL entrypoint themselves, are indexed. Solely listing a document's URL as an include URL mask does not index an unlinked document. To add unlinked documents to your index, you can use the URL Entrypoints feature.
Include masks and exclude masks can work together. You can exclude a large portion of your website from indexing by creating an exclude URL mask yet include one or more of those excluded pages with an include URL mask. For example, suppose your entrypoint URL is the following:
https://www.mydomain.com/photos/

The search robot crawls and indexes all of the pages under /photos/summer/ , /photos/spring/ and /photos/fall/ (assuming that there are links to at least one page in each directory from the photos folder). This behavior occurs because the link paths enable the search robot to find the documents in the /summer/ , /spring/ , and /fall/ , folders and the folder URLs match the include mask that is automatically generated by the entrypoint URL.
You can choose to exclude all pages in the /fall/ folder with an exclude URL mask as in the following example:
exclude https://www.mydomain.com/photos/fall/

Or, selectively include only /photos/fall/redleaves4.html as part of the index with the following URL mask:
include https://www.mydomain.com/photos/fall/redleaves4.html

For the above two mask examples to work as intended, the include mask is listed first, as in the following:
include https://www.mydomain.com/photos/fall/redleaves4.html 
exclude https://www.mydomain.com/photos/fall/

Because the search robot follows directions in the order that they are listed, the search robot first includes /photos/fall/redleaves4.html , and then excludes the rest of the files in the /fall folder.
If the instructions are specified in the opposite way as in the following:
exclude https://www.mydomain.com/photos/fall/ 
include https://www.mydomain.com/photos/fall/redleaves4.html

Then /photos/fall/redleaves4.html is not included, even though the mask specifies that it is included.
A URL mask that appears first always takes precedence over a URL mask that appears later in the mask settings. Additionally, if the search robot encounters a page that matches an include URL mask and an exclude URL mask, the mask that is listed first always takes precedence.

About using keywords with URL masks

You can qualify each include mask with one or more space-separated keywords, which affect how the matched pages are indexed.
A comma is not valid as a separator between the mask and the keyword; you can only use spaces.
Keyword
Description
noindex
If you do not want to index the text on the pages that match the URL mask, but you want to follow the matched pages links, add noindex after the include URL mask. Be sure that you separate the keyword from the mask with a space as in the following example:
include&nbsp;*.swf&nbsp;noindex
The above example specifies that the search robot follow all links from files with the .swf extension, but disables indexing of all text contained within those files.
The noindex keyword is equivalent to a robot meta tag with content="noindex" between the <head>...</head> tags of matched pages.
nofollow
If you want to index the text on the pages that match the URL mask, but you do not want to follow the matched page's links, add nofollow after the include URL mask. Be sure that you separate the keyword from the mask with a space as in the following example:
include&nbsp;https://www.mydomain.com/photos&nbsp;nofollow
The nofollow keyword is equivalent to a robot meta tag with content="nofollow" between the <head>...</head> tags of matched pages.
regexp
Used for both include and exclude masks.
Any URL mask preceded with regexp is treated as a regular expression. If the search robot encounters documents that match an exclude regular expression URL mask, those documents are not indexed. If the search robot encounters documents that match an include regular expression URL mask, those documents are indexed. For example, suppose you have the following URL mask:
exclude&nbsp;regexp&nbsp;^.*/products/.*\.html$
The search robot excludes matching files such as https://www.mydomain.com/products/page1.html
If you had the following exclude regular expression URL mask:
exclude&nbsp;regexp&nbsp;^.*\?..*$
The search robot does not to include any URL containing a CGI parameter such as https://www.mydomain.com/cgi/prog/?arg1=val1&arg2=val2 .
If you had the following include regular expression URL mask:
include&nbsp;regexp&nbsp;^.*\.swf$&nbsp;noindex
The search robot follows all links from files with the ".swf" extension. The noindex keyword also specifies that the text of matched files are not indexed.

Adding URL masks to index or not index parts of your website

You can use URL Masks to define which parts of your website that you want or do not want crawled and indexed.
Use the Test URL Masks field to test whether a document is or is not included after you index.
Be sure that you rebuild your site index so that the results of your URL masks are visible to your customers.
To add URL masks to index or not index parts of your website
  1. On the product menu, click Settings > Crawling > URL Masks .
  2. (Optional) On the URL Masks page, in the Test URL Masks field, enter a test URL mask from your website, and then click Test .
  3. In the URL Masks field, type include (to add a website that you want crawled and indexed), or type exclude (to block a website from getting crawled and indexed), followed by the URL mask address.
    Enter one URL mask address per line. Example:
    include https://www.mycompany.com/summer 
    include https://www.mycompany.com/spring 
    exclude regexp .*\.xml 
    exclude https://www.mycompany.com/fall
    
    
  4. Click Save Changes .
  5. (Optional) Do any of the following:

About Date Masks

You can use Date Masks to include or exclude files from your search results based on the age of the file.
Be sure that you rebuild your site index so that the results of your URL masks are visible to your customers.
The following are two kinds of date masks that you can use:
  • Include date masks ("include-days" and "include-date")
    Include date masks index files that are dated on or before the specified date.
  • Exclude date masks ("exclude-days" and "exclude-date")
    Exclude date masks index files that are dated on or before the specified date.
By default, the file date is determined from meta tag information. If no Meta tag is found, the date of a file is determined from the HTTP header that is received from the server when the search robot downloads a file.
Each date mask that you specify must be on a separate line.
The mask can specify any of the following:
  • A full path as in https://www.mydomain.com/products.html
  • A partial path as in https://www.mydomain.com/products
  • A URL that uses wild cards https://www.mydomain.com/*.html
  • A regular expression. To make a mask a regular expression, insert the keyword regexp before the URL.
Both include and exclude date masks can specify a date in one of the two following ways. The masks are only be applied if the matched files were created on or before the specified date:
  1. A number of days. For example, suppose your date mask is the following:
    exclude-days 30 https://www.mydomain.com/docs/archive/)
    
    
    The number of specified days is counted back. If the file is dated on or before the arrived upon date, the mask is applied.
  2. An actual date using the format YYYY-MM-DD. For example, suppose your date mask is the following:
    include-date 2011-02-15 https://www.mydomain.com/docs/archive/)
    
    
    If the matched document is dated on or before the specified date, the date mask is applied.
The following is a simple exclude date mask example:
exclude-days 90 https://www.mydomain.com/docs/archive

Because this is an exclude date mask, any file that matches the pattern is not indexed and is 90 days old or older. When you exclude a document, no text is indexed and no links are followed from that file. The file is effectively ignored. In this example, both files and folders might match the specified URL pattern. Notice that both https://www.mydomain.com/docs/archive.html and https://www.mydomain.com/docs/archive/index.html match the pattern and are not indexed if they are 90 days old or older. To match only files in the /docs/archive/ folder, the date mask must contain a trailing slash as in the following:
exclude-days 90 https://www.mydomain.com/docs/archive/

Date masks can also be used with wild cards. The following exclude mask tells the search robot to overlook files with the ".pdf" extension that are dated on or before 2011-02-15. The search robot does not add any matched files to your index.
exclude-date 2011-02-15 *.pdf

Include date mask looks similar, only matched files are added to the index. The following include date mask example tells the search robot to index the text from any files that are zero days old or older in the /docs/archive/manual/ area of the website.
include-days 0 https://www.mydomain.com/docs/archive/manual/

Include masks and exclude masks can work together. For example, you can exclude a large portion of your website from indexing by creating an exclude date mask yet include one or more of those excluded pages with an include URL mask. If your entrypoint URL is the following:
https://www.mydomain.com/archive/

The search robot crawls and indexes all of the pages under /archive/summer/ , /archive/spring/ , and /archive/fall/ (assuming that there are links to at least one page in each folder from the archive folder). This behavior occurs because the link paths enable the search robot to "find" the files in the /summer/ , /spring/ , and /fall/ folders and the folder URLs match the include mask automatically generated by the entrypoint URL.
You may choose to exclude all pages over 90 days old in the /fall/ folder with an exclude date mask as in the following:
exclude-days 90 https://www.mydomain.com/archive/fall/

You can selectively include only /archive/fall/index.html (regardless of how old it is--any file 0 days or older are matched) as part of the index with the following date mask:
include-days 0 https://www.mydomain.com/archive/fall/index.html

For the above two mask examples to work as intended, you must list the include mask first as in the following:
include-days 0 https://www.mydomain.com/archive/fall/index.html 
exclude-days 90 https://www.mydomain.com/archive/fall/

Because the search robot follows directions in the order they are specified, the search robot first includes /archive/fall/index.html , and then excludes the rest of the files in the /fall folder.
If the instructions are specified in the opposite way as in the following:
exclude-days 90 https://www.mydomain.com/archive/fall/ 
include-days 0 https://www.mydomain.com/archive/fall/index.html 

Then /archive/fall/index.html is not included, even though the mask specifies that it should be. A date mask that appears first always takes precedence over a date mask that might appear later in the mask settings. Additionally, if the search robot encounters a page that matches both an include date mask and an exclude date mask, the mask that is listed first always takes precedence.

About using keywords with date masks

You can qualify each include mask with one or more space-separated keywords, which affect how the matched pages are indexed.
A comma is not valid as a separator between the mask and the keyword; you can only use spaces.
Keyword
Description
noindex
If you do not want to index the text on the pages that are dated on or before the date that is specified by the include mask, add noindex after the include date mask as in the following:
include-days&nbsp;10&nbsp;*.swf&nbsp;noindex
Be sure you separate the keyword from the mask with a space.
The above example specifies that the search robot follow all links from files with the ".swf" extension that are 10 days old or older. However, it disables indexing of all text contained in those files.
You may want make sure that the text for older files is not indexed but still follow all links from those files. In such cases, use an include date mask with the "noindex" keyword instead of using an exclude date mask.
nofollow
If you want to index the text on the pages that are dated on or before the date that is specified by the include mask, but you do not want to follow the matched page's links, add nofollow after the include date mask as in the following:
include-days&nbsp;8&nbsp;https://www.mydomain.com/photos&nbsp;nofollow
Be sure you separate the keyword from the mask with a space.
The nofollow keyword is equivalent to a robot meta tag with content="nofollow" between the <head>...</head> tag of matched pages.
server-date
Used for both include and exclude masks.
The search robot generally downloads and parses every file before checking the date masks. This behavior occurs because some file types can specify a date inside the file itself. For example, an HTML document can include meta tags that set the date of the file.
If you are going to exclude many files based on their date, and you do not want to put an unnecessary load on your servers, you can use server-date after the URL in the date mask.
This keyword instructs the search robot to trust the date of the file that is returned by your server instead of parsing each file. For example, the following exclude date mask ignores pages that match the URL if the documents are 90 days or older, according to the date that is returned by the server in the HTTP headers:
exclude-days&nbsp;90&nbsp;https://www.mydomain.com/docs/archive&nbsp;server-date
If the date that is returned by the server is 90 days or more past, server-date specifies that the excluded documents not be downloaded from your server. The result means faster indexing time for your documents and a reduced load placed on your servers. If server-date is not specified, the search robot ignores the date that is returned by the server in the HTTP headers. Instead, each file is downloaded and checked to see if the date is specified. If no date is specified in the file, then the search robot uses the date that is returned by the server.
You should not use server-date if your files contain commands that override the server date.
regexp
Use for both include and exclude masks.
Any date mask that is preceded by regexp is treated as a regular expression.
If the search robot encounters files that match an exclude regular expression date mask, it does not index those files.
If the search robot encounters files that match an include regular expression date mask, it indexes those documents.
For example, suppose you have the following date mask:
exclude-days&nbsp;180&nbsp;regexp&nbsp;.*archive.*
The mask tells the search robot to exclude matching files that are 180 days or older. That is, files that contain the word "archive" in their URL.

Adding date masks to index or not index parts of your website

You can use Date Masks to include or exclude files from customer search results based on the age of the files.
Use the Test Date and Test URL fields to test whether a file is or is not included after you index.
Be sure that you rebuild your site index so that the results of your URL masks are visible to your customers.
To add date masks to index or not index parts of your website
  1. On the product menu, click Settings > Crawling > Date Masks .
  2. (Optional) On the Date Masks page, in the Test Date field, enter a date formatted as YYYY-MM-DD (for example, 2011-07-25 ); in the Test URL field, enter a URL mask from your website, and then click Test .
  3. In the Date Masks field, enter one date mask address per line.
  4. Click Save Changes .
  5. (Optional) Do any of the following:

About Passwords

To access portions of your website that are protected with HTTP Basic Authentication, you can add one or more passwords.
Before the effects of the Password settings is visible to customers, you must rebuild your site index.
On the Passwords page, you type each password on a single line. The password consists of a URL or realm, a user name, and a password, as in the following example:
https://www.mydomain.com/ myname mypassword

Instead of the using a URL path, like above, you could also specify a realm.
To determine the correct realm to use, open a password-protected web page with a browser and look at the "Enter Network Password" dialog box.
The realm name, in this case, is "My Site Realm."
Using the realm name above, your password might look like the following:
My Site Realm myusername mypassword

If your web site has multiple realms, you can create multiple passwords by entering a user name and password for each realm on a separate line as in the following example:
Realm1 name1 password1 
Realm2 name2 password2 
Realm3 name3 password3

You can intermix passwords that contain URLs or realms so that your password list might look like the following:
Realm1 name1 password1 
https://www.mysite.com/path1/path2 name2 password2 
Realm3 name3 password3 
Realm4 name4 password4 
https://www.mysite.com/path1/path5 name5 password5 
https://www.mysite.com/path6 name6 password6

In the list above, the first password is used that contains a realm or URL that matches the server's authentication request. Even if the file at https://www.mysite.com/path1/path2/index.html is in Realm3 , for example, name2 and password2 are used because the password that is defined with the URL is listed above the one defined with the realm.

Adding passwords for accessing areas of your website that require authentication

You can use Passwords to access password-protected areas of your website for crawling and indexing purposes.
Before the effects of your password are additions are visible to customers, be sure you rebuild your site index
To add passwords for accessing areas of your website that require authentication
  1. On the product menu, click Settings > Crawling > Passwords .
  2. On the Passwords page, in the Passwords field, enter a realm or URL, and its associated user name, and password, separated by a space.
    Example of a realm password and a URL password on separate lines:
    Realm1 name1 password1 
    https://www.mysite.com/path1/path2 name2 password2
    
    
    Only add one password per line.
  3. Click Save Changes .
  4. (Optional) Do any of the following:

About Content Types

You can use Content Types to select which types of files that you want to crawl and index for this account.
Content types that you can choose to crawl and index include PDF documents, text documents, Adobe Flash movies, files from Microsoft Office applications such as Word, Excel, and Powerpoint, and text in MP3 files. The text that is found within the selected content types is searched along with all of the other text on your website.
Before the effects of the Content Types settings is visible to customers, you must rebuild your site index.

About indexing MP3 music files

If you select the option Text in MP3 Music Files on the Content Types page, an MP3 file is crawled and indexed in one of two ways. The first and most common way is from an anchor href tag in an HTML file as in the following:
<a href="MP3-file-URL"></a>

The second way is to enter the URL of the MP3 file as a URL entrypoint.
An MP3 file is recognized by its MIME type "audio/mpeg".
Be aware that MP3 music file sizes can be quite large, even though they usually contain only a small amount of text. For example, MP3 files can optionally store such things as the album name, artist name, song title, song genre, year of release, and a comment. This information is stored at the very end of the file in what is called the TAG. MP3 files containing TAG information are indexed in the following way:
  • The song title is treated like the title of an HTML page.
  • The comment is treated like a description that is defined for an HTML page.
  • The genre is treated like a keyword that is defined for an HTML page.
  • The artist name, album name, and year of release are treated like the body of an HTML page.
Note that each MP3 file that is crawled and indexed on your website counts as one page.
If your website contains many large MP3 files, you may exceed the indexing byte limit for your account. If this happens, you can deselect Text in MP3 Music Files on the Content Types page to prevent the indexing of all MP3 files on your website.
If you only want to prevent the indexing of certain MP3 files on your website, you can do one of the following:
  • Surround the anchor tags that link to the MP3 files with <nofollow> and </nofollow> tags. The search robot does not follow links between those tags.
  • Add the URLs of the MP3 files as exclude masks.

Selecting content types to crawl and index

You can use Content Types to select which types of files that you want to crawl and index for this account.
Content types that you can choose to crawl and index include PDF documents, text documents, Adobe Flash movies, files from Microsoft Office applications such as Word, Excel, and Powerpoint, and text in MP3 files. The text that is found within the selected content types is searched along with all of the other text on your website.
Before the effects of the Content Types settings is visible to customers, you must rebuild your site index.
To crawl and index Chinese, Japanese, or Korean MP3 files, complete the steps below. Then, in Settings > Metadata > Injections , specify the character set that is used to encode the MP3 files.
To select content types to crawl and index
  1. On the product menu, click Settings > Crawling > Content Types .
  2. On the Content Types page, check the file types that you want to crawl and index on your website.
  3. Click Save Changes .
  4. (Optional) Do any of the following:

About Connections

You can use Connections to add up to ten HTTP connections that the search robot uses to index your website.
Increasing the number of connections can significantly reduce the amount of time that it takes to complete a crawl and index. However, be aware that each additional connection increases the load on your server.

Adding connections to increase indexing speed

You can reduce the amount of time it takes to index your website by using Connections to increase the number of simultaneous HTTP connections that the crawler uses. You can add up to ten connections.
Be aware that each additional connection increases the load that is placed on your server.
To add connections to increase indexing speed
  1. On the product menu, click Settings > Crawling > Connections .
  2. On the Parallel Indexing Connections page, in the Number of Connections field, enter the number of connections (1-10) that you want to add.
  3. Click Save Changes .
  4. (Optional) Do any of the following:

About Form Submission

You can use Form Submission to help you recognize and process forms on your website.
During the crawling and indexing of your website, each encountered form is compared with the form definitions that you have added. If a form matches a form definition, the form is submitted for indexing. If a form matches more than one definition, the form is submitted once for each matched definition.

Adding form definitions for indexing forms on your website

You can use Form Submission to help process forms that are recognized on your website for indexing purposes.
Be sure that you rebuild your site index so that the results of your changes are visible to your customers.
To add form definitions for indexing forms on your website
  1. On the product menu, click Settings > Crawling > Form Submission .
  2. On the Form Submission page, click Add New Form .
  3. On the Add Form Definition page, set the Form Recognition and Form Submission options.
    The five options in the Form Recognition section on the Form Definition page are used to identify forms in your web pages that can be process.
    The three options in the Form Submission section are used to specify the parameters and values that are submitted with a form to your web server.
    Enter one recognition or submission parameter per line. Each parameter must include a name and a value.
    Option
    Description
    Form Recognition
    Page URL Mask
    Identify the web page or pages that contain the form. To identify a form that appears on a single page, enter the URL for that page as in the following example:
    https://www.mydomain.com/login.html
    To identify forms that appear on multiple pages, specify a URL mask that uses wildcards to describe the pages. To identify forms encountered on any ASP page under https://www.mydomain.com/register/ , for example, you would specify the following:
    https://www.mydomain.com/register/*.asp&nbsp;
    You can also use a regular expression to identify multiple pages. Just specify the regexp keyword before the URL mask as in the following example:
    regexp&nbsp;^https://www\.mydomain\.com/.*/login\.html$
    Action URL Mask
    Identifies the action attribute of the <form> tag.
    Like the page URL mask, the action URL mask can take the form of a single URL, a URL with wildcards, or a regular expression.
    The URL mask can be any of the following:
    • A full path as in the following: https://www.mydomain.com/products.html
    • A partial path as in the following: https://www.mydomain.com/products
    • A URL that uses wild cards as in the following: https://www.mydomain.com/*.html
    • A regular expression as in the following: regexp&nbsp^https://www\.mydomain\.com/.*/login\.html$
    If you do not want to index the text on pages that are identified by a URL mask or by an action URL mask, or if you do not want links followed on those pages, you can use the noindex and nofollow keywords. You can add these keywords to your masks using URL masks or entrypoints.
    Form Name Mask
    Identifies forms if the <form> tags in your web pages contain a name attribute.
    You can use a simple name ( login_form ), a name with a wildcard ( form* ), or a regular expression ( regexp ^.*authorize.*$ ).
    You can usually leave this field empty because forms typically do not have a name attribute.
    Form ID Mask
    Identifies forms if the <form> tags in your web pages contain an id attribute.
    You can use a simple name ( login_form ), a name with a wildcard ( form* ), or a regular expression ( regexp ^.*authorize.*$ ).
    You can usually leave this field empty because forms typically do not have a name attribute.
    Parameters
    Identify forms that contain, or do not contain, a named parameter or a named parameter with a specific value.
    For example, to identify a form that contains an e-mail parameter that is preset to rick_brough@mydomain.com, a password parameter, but not a first-name parameter, you would specify the following parameter settings, one per line:
    email=rick_brough@mydomain.com password not first-name
    Form Submission
    Override Action URL
    Specify when the target of the form submission is different from what is specified in the form's action attribute.
    For example, you might use this option when the form is submitted by way of a JavaScript function that constructs a URL value that is different from what is found in the form.
    Override Method
    Specify when the target of the form submission is different from what is used in the form's action attribute and when the submitting JavaScript has changed the method.
    The default values for all form parameters ( <input> tags, including hidden fields), the default <option> from a <select> tag, and the default text between <textarea>...</textarea> tags) are read from the web page. However, any parameter that is listed in the Form Submission section, in the Parameters field, is replaced with the form defaults.
    Parameters
    You can prefix form submission parameters with the not keyword.
    When you prefix a parameter with not , it is not submitted as part of the form submission. This behavior is useful for check boxes that should be submitted deselected.
    For example, suppose you want to submit the following parameters:
    • The e-mail parameter with the value nobody@mydomain.com
    • The password parameter with the value tryme
    • The mycheckbox parameter as deselected.
    • All other <form> parameters as their default values
    Your form submission parameter would look like the following:
    email=nobody@mydomain.com password=tryme not mycheckbox
    The method attribute of the <form> tag on the web page is used to decide if the data is sent to your server using the GET method or the POST method.
    If the <form> tag does not contain a method attribute, the form is submitted using the GET method.
  4. Click Add .
  5. (Optional) Do any of the following:

Editing a form definition

You can edit an existing form definition if a form on your website has changed or if you just need to change the definition.
Be aware that there is no History feature on the Form Submission page to revert any changes that you make to a form definition.
Be sure that you rebuild your site index so that the results of your changes are visible to your customers.
To edit a form definition
  1. On the product menu, click Settings > Crawling > Form Submission .
  2. On the Form Submission page, click Edit to the right of a form definition that you want to update.
  3. On the Edit Form Definition page, set the Form Recognition and Form Submission options.
  4. Click Save Changes .
  5. (Optional) Do any of the following:

Deleting a form definition

You can delete an existing form definition if the form no longer exists on your website, or if you no longer want to process and index a particular form.
Be aware that there is no History feature on the Form Submission page to revert any changes that you make to a form definition.
Be sure that you rebuild your site index so that the results of your changes are visible to your customers.
To delete a form definition
  1. On the product menu, click Settings > Crawling > Form Submission .
  2. On the Form Submission page, click Delete to the right of a form definition that you want to remove.
    Make sure you choose the right form definition to delete. There is no delete confirmation dialog box when you click Delete in the next step.
  3. On the Delete Form Definition page, click Delete .
  4. (Optional) Do any of the following:

About Index Connector

Use Index Connector to define additional input sources for indexing XML pages or any kind of feed.
You can use a data feed input source to access content that is stored in a form that is different from what is typically discovered on a website using one of the available crawl methods. Each document that is crawled and indexed directly corresponds to a content page on your website. However, a data feed either comes from an XML document or from a comma- or tab-delimited text file, and contains the content information to index.
An XML data source consists of XML stanzas, or records, that contain information that corresponds to individual documents. These individual documents are added to the index. A text data feed contains individual new-line-delimited records that correspond to individual documents. These individual documents are also added to the index. In either case, an index connector configuration describes how to interpret the feed. Each configuration describes where the file resides and how the servers access it. The configuration also describes "mapping" information. That is, how each record's items are used to populate the metadata fields in the resulting index.
After you add an Index Connector definition to the Staged Index Connector Definitions page, you can change any configuration setting, except for the Name or Type values.
The Index Connector page shows you the following information:
  • The name of defined index connectors that you have configured and added.
  • One of the following data source types for each connector that you have added:
    • Text - Simple "flat" files, comma-delimited, tab-delimited, or other consistently delimited formats.
    • Feed - XML feeds.
    • XML - Collections of XML documents.
  • Whether the connector is enabled or not for the next crawl and indexing done.
  • The address of the data source.

How the indexing process works for Text and Feed configurations in Index Connector

Step
Process
Description
1
Download the data source.
For Text and Feed configurations, it is a simple file download.
2
Break down the downloaded data source into individual pseudo-documents.
For Text , each newline-delimited line of text corresponds to an individual document, and is parsed using the specified delimiter, such as a comma or tab.
For Feed , each document's data is extracted using a regular expression pattern in the following form:
<${Itemtag}>(.*?)</${Itemtag}>
Using Map on the Index Connector Add page, create a cached copy of the data and then create a list of links for the crawler. The data is stored in a local cache and is populated with the configured fields.
The parsed data is written to the local cache.
This cache is read later to create the simple HTML documents that the crawler needs. For example,
<html><head> <title>{title}</title> <meta name="{field}" content="{data}" /> ... </head><body> {body} </body></html>
The <title> element is only generated when a mapping exists to the Title metadata field. Similarly, the <body> element is only generated when a mapping exists to the Body metadata field.
Important : There is no support for the assignment of values to the pre-defined URL meta tag.
For all other mappings, <meta> tags are generated for each field that has data found in the original document.
The fields for each document are added to the cache. For each document that is written to the cache, a link is also generated as in the following examples:
<a href="index:Adobe?key=<primary key field>\" /> <a href="index:Adobe?key=<primary key field>\" /> ....
The configuration's mapping must have one field identified as the Primary Key. This mapping forms the key that is used when data is fetched from the cache.
The crawler recognizes the URL index: scheme prefix, which can then access the locally cached data.
3
Crawl the cached document set.
The index: links are added to the crawler's pending list, and are processed in the normal crawl sequence.
4
Process each document.
Each link’s key value corresponds to an entry in the cache, so crawling each link results in that document’s data being fetched from the cache. It is then “assembled” into an HTML image that is processed and added to the index.

How the indexing process works for XML configurations in Index Connector

The indexing process for XML configuration is similar to the process for Text and Feed configurations with the following minor changes and exceptions.
Because the documents for XML crawls are already separated into individual files, steps 1 and 2 in the table above do not directly apply. If you specify a URL in the Host Address and File Path fields of the Index Connector Add page, it is downloaded and processed as a normal HTML document. The expectation is that the download document contains a collection of <a href="{url}"... links, each of which points to an XML document that is processed. Such links are converted to the following form:
<a href="index:<ic_config_name>?url="{url}">

For example, if the Adobe setup returned the following links:
<a href="https://www.adobe.com/somepath/doc1.xml">doc 1</a> 
<a href="https://www.adobe.com/otherpath/doc2.xml">doc 2</a>

In the table above, step 3 does not apply and step 4 is completed at the time of crawling and indexing.
Alternately, you can intermix your XML documents with other documents that were discovered naturally through the crawl process. In such cases, you can use rewrite rules ( Settings > Rewrite Rules > Crawl List Retrieve URL Rules ) to change the XML documents’ URLs to direct them to Index Connector.
For example, supposed you have the following rewrite rule:
RewriteRule (^http.*[.]xml$) index:Adobe?key=$1

This rule translates any URL ending with .xml into an Index Connector link. The crawler recognizes and rewrites the index: URL scheme. The download process is redirected through the Index Connector Apache server on the master. Each downloaded document is examined using the same regular expression pattern that is used with Feeds. In this case, however, the manufactured HTML document is not saved in the cache. Instead, it is handed directly to the crawler for index processing.

How to configure multiple Index Connectors

You can define multiple Index Connector configurations for any account. The configurations are automatically added to the drop-down list in Settings > Crawl > URL Entrypoints as shown in the following illustration:
Selecting a configuration from the drop-down list adds the value to the end of the list of URL entry points.
While disabled Index Connector configurations are added to the drop-down list, you cannot select them. If you select the same Index Connector configuration a second time, it is added to the end of the list, and the previous instance is deleted.
To specify an Index Connector entry point for an incremental crawl, you can add entries using the following format:
index:<indexconnector_configuration_name>

The crawler processes each added entry if it is found on the Index Connectors page and is enabled.
Note: Because each document's URL is constructed using the Index Connector configuration name and the document's primary key, be sure you use the same Index Connector configuration name when performing Incremental updates! Doing so permits Adobe Search&amp;Promote to correctly update previously indexed documents.
The use of Setup Maps when you add an Index Connector
At the time you add an Index Connector, you can optionally use the feature Setup Maps to download a sample of your data source. The data is examined for indexing suitability.
If you chose the Index Connector type...
The Setup Maps feature...
Text
Determines the delimiter value by trying tabs first, then vertical-bars ( | ), and finally commas ( , ). If you already specified a delimiter value before you clicked Setup Maps , that value is used instead.
The best-fit scheme results in the Map fields being filled out with guesses at the appropriate Tag and Field values. Additionally, a sampling of the parsed data is displayed. Be sure to select Headers in First Row if you know that the file includes a header row. The setup function uses this information to better identify the resulting map entries.
Feed
Downloads the data source and performs simple XML parsing.
The resulting XPath identifiers are displayed in the Tag rows of the Map table, and similar values in Fields. These rows only identify the available data, and do not generate the more complicated XPath definitions. However, it is still helpful because it describes the XML data and identifies Itemtag values.
Note: The Setup Maps function downloads the entire XML source to perform its analysis. If the file is large, this operation could time out.
When successful, this function identifies all possible XPath items, many of which are not desirable to use. Be sure that you examine the resulting Map definitions and remove the ones that you do not need or want.
XML
Downloads the URL of a representative individual document, not the master link list. This single document is parsed using the same mechanism that is used with Feeds, and the results are displayed.
Before you click Add to save the configuration, be sure that you change the URL back to the master link list document.
Important : The Setup Maps feature may not work for large XML data sets because its file parser attempts to read the entire file into memory. As a result, you could experience an out-of-memory condition. However, when the same document is processed at the time of indexing, it is not read into memory. Instead, large documents are processed “on the go,” and are not read entirely into memory first.
The use of Preview when you add an Index Connector
At the time you add an Index Connector, you can optionally use the feature Preview to validate the data, as though you were saving it. It runs a test against the configuration, but without saving the configuration to the account. The test accesses the configured data source. However, it writes the download cache to a temporary location; it does not conflict with the main cache folder that the indexing crawler uses.
Preview only processes a default of five documents as controlled by Acct:IndexConnector-Preview-Max-Documents. The previewed documents are displayed in source form, as they are presented to the indexing crawler. The display is similar to a “View Source" feature in a Web browser. You can navigate the documents in the preview set using standard navigation links.
Preview does not support XML configurations because such documents are processed directly and not downloaded to the cache.

Adding an Index Connector definition

Each Index Connector configuration defines a data source and mappings to relate the data items defined for that source to metadata fields in the index.
Before the effects of the new and enabled definition is visible to customers, rebuild your site index.
To add an Index Connector definition
  1. On the product menu, click Settings > Crawling > Index Connector .
  2. On the Stage Index Connector Definitions page, click Add New Index Connector .
  3. On the Index Connector Add page, set the connector options that you want. The options that are available depend on the Type that you selected.
    Option
    Description
    Name
    The unique name of the Index Connector configuration. You can use alphanumeric characters. The characters "_" and "-" are also allowed.
    Type
    The source of your data. The data source type that you select affects the resulting options that are available on the Index Connector Add page. You can choose from the following:
    • Text
      Simple flat text files, comma-delimited, tab-delimited, or other consistently delimited formats. Each newline-delimited line of text corresponds to an individual document, and is parsed using the specified delimiter.
      You can map each value, or column, to a metadata field, referenced by the column number, starting at 1 (one).
    • Feed
      Downloads a master XML document that contains multiple "rows" of information.
    • XML
      Downloads a master XML document that contains links ( <a> ) to individual XML documents.
    Data source type: Text
    Enabled
    Turns the configuration "on" to crawl and index. Or, you can turn "off" the configuration to prevent crawling and indexing.
    Note : Disabled Index Connector configurations are ignored if they are found in an entrypoint list.
    Host Address
    Specifies the address of the server host where your data is located.
    If desired, you can specify a full URI (Uniform Resource Identifier) path to the data source document as in the following examples:
    https://www.somewhere.com/some_path/some_file.xml
    or
    ftp://user:password@ftpserver.somewhere.com/some_path/some_file.xml
    The URI is broken down into the appropriate entries for the Host Address, File Path, Protocol, and, optionally, Username, and Password fields.
    Specifies the IP address or the URL address of the host system where the data source file is found.
    File Path
    Specifies the path to the simple flat text file, comma-delimited, tab-delimited, or other consistently delimited format file.
    The path is relative to the root of the host address.
    Incremental File Path
    Specifies the path to the simple flat text file, comma-delimited, tab-delimited, or other consistently delimited format file.
    The path is relative to the root of the host address.
    This file, if specified, is downloaded and processed during Incremental Index operations. If no file is specified, the file listed under File Path is used instead.
    Vertical File Path
    Specifies the path to the simple flat text file, comma-delimited, tab-delimited, or other consistently delimited format file to be used during a Vertical Update.
    The path is relative to the root of the host address.
    This file, if specified, is downloaded and processed during Vertical Update operations.
    Note : This feature is not enabled by default. Contact Technical Support to activate the feature for your use.
    Deletes File Path
    Specifies the path to the simple flat text file, containing a single document identifier value per line.
    The path is relative to the root of the host address.
    This file, if specified, is downloaded and processed during Incremental Index operations. The values found in this file are used to construct "delete" requests to remove previously indexed documents. The values in this file must correspond to the values found in the Full or Incremental File Path files, in the column identified as the Primary Key .
    Note : This feature is not enabled by default. Contact Technical Support to activate the feature for your use.
    Protocol
    Specifies the protocol that is used to access the file. You can choose from the following:
    • HTTP
      If necessary, you may enter proper authentication credentials to access the HTTP server.
    • HTTPS
      If necessary, you may enter proper authentication credentials to access the HTTPS server.
    • FTP
      You must enter proper authentication credentials to access the FTP server.
    • SFTP
      You must enter proper authentication credentials to access the SFTP server.
    • File
    Timeout
    Specifies the timeout, in seconds, for FTP, SFTP, HTTP or HTTPS connections. This value must be between 30 and 300.
    Retries
    Specifies the maximum number of retries for failed FTP, SFTP, HTTP or HTTPS connections. This value must be between 0 and 10.
    A value of zero (0) will prevent retry attempts.
    Encoding
    Specifies the character encoding system that is used in the specified data source file.
    Delimiter
    Specifies the character that you want to use to delineate each field in the specified data source file.
    The comma character ( , ) is an example of a delimiter. The comma acts as a field delimiter that helps to separate data fields in your specified data source file.
    Select Tab? to use the horizontal-tab character as the delimiter.
    Headers in First Row
    Indicates that the first row in the data source file contains header information only, not data.
    Minimum number of documents for indexing
    If set to a positive value, this specifies the minimum number of records expected in the file downloaded. If fewer records are received, the index operation is aborted.
    Note : This feature is not enabled by default. Contact Technical Support to activate the feature for your use.
    Note : This feature is only used during full Index operations.
    Map
    Specifies column-to-metadata mappings, using column numbers.
    • Column
      Specifies a column number, with the first column being 1 (one). To add new map rows for each column, under Action , click + .
      You do not need to reference each column in the data source. Instead, you can choose to skip values.
    • Field
      Defines the name attribute value that is used for each generated <meta> tag.
    • Metadata?
      Causes Field to become a drop-down list from which you can select defined metadata fields for the current account.
      The Field value can be an undefined metadata field, if desired. An undefined metadata field is sometimes useful to create content used by Filtering Script .
      When Index Connector processes XML documents with multiple hits on any map field, the multiple values are concatenated into a single value in the resulting cached document. By default, these values are combined using a comma delimiter. However, suppose that the corresponding Field value is a defined metadata field. In addition, that field has the Allow Lists attribute set. In this case, the field's List Delimiters value, which is the first delimiter defined, is used in the concatenation.
    • Primary Key?
      Only one map definition is identified as the primary key. This field becomes the unique reference that is presented when this document is added to the index. This value is used in the document’s URL in the Index.
      The Primary Key values must be unique across all of the documents represented by the Index Connector configuration - any duplicates encountered will be ignored. If your source documents don't contain a single unique value for use as Primary Key , but two or more fields taken together can form a unique identifier, you can define the Primary Key by combining multiple Column values with a vertical bar ("|") delimiting the values.
    • Strip HTML?
      When this option is checked, any HTML tags found in this field's data is removed.
    • Action
      Lets you add rows to the map or remove rows from the map. The order of the rows is not important.
    Data source type: Feed
    Enabled
    Turns the configuration "on" to crawl and index. Or, you can turn "off" the configuration to prevent crawling and indexing.
    Note : Disabled Index Connector configurations are ignored if they are found in an entrypoint list.
    Host Address
    Specifies the IP address or the URL address of the host system where the data source file is found.
    File Path
    Specifies the path to the master XML document that contains multiple "rows" of information.
    The path is relative to the root of the host address.
    Incremental File Path
    Specifies the path to the incremental XML document that contains multiple "rows" of information.
    The path is relative to the root of the host address.
    This file, if specified, is downloaded and processed during Incremental Index operations. If no file is specified, the file listed under File Path is used instead.
    Vertical File Path
    Specifies the path to the XML document that contains multiple sparse "rows" of information to be used during a Vertical Update.
    The path is relative to the root of the host address.
    This file, if specified, is downloaded and processed during Vertical Update operations.
    Note : This feature is not enabled by default. Contact Technical Support to activate the feature for your use.
    Deletes File Path
    Specifies the path to the simple flat text file, containing a single document identifier value per line.
    The path is relative to the root of the host address.
    This file, if specified, is downloaded and processed during Incremental Index operations. The values found in this file are used to construct "delete" requests to remove previously indexed documents. The values in this file must correspond to the values found in the Full or Incremental File Path files, in the column identified as the Primary Key .
    Note : This feature is not enabled by default. Contact Technical Support to activate the feature for your use.
    Protocol
    Specifies the protocol that is used to access the file. You can choose from the following:
    • HTTP
      If necessary, you may enter proper authentication credentials to access the HTTP server.
    • HTTPS
      If necessary, you may enter proper authentication credentials to access the HTTPS server.
    • FTP
      You must enter proper authentication credentials to access the FTP server.
    • SFTP
      You must enter proper authentication credentials to access the SFTP server.
    • File
    Itemtag
    Identifies the XML element that you can use to identify individual XML lines in the data source file that you specified.
    For example, in the following Feed fragment of an Adobe XML document, the Itemtag value is record :
    <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" ""> <gsafeed>      <header>           <datasource>marketplace</datasource>           <feedtype>incremental</feedtype>      </header>      <group action="add"> <record url=https://www.adobe.com/cfusion/marketplace_gsa index.cfm?event=marketplace.home&amp;marketplaceid=1 action="add" mimetype="text/html"displayurl="https://www.adobe.com/cfusion/marketplace/index.cfm?event=marketplace.home&amp;marketplaceid=1"><metadata> <meta name="mp_mkt" content="1"/> <meta name="mp_logo" content="/images/marketplace/ dbreferenced/marketplaceicons/icn_air.png"/> <meta name="title" content="Adobe AIR Marketplace"/> <meta name="description" content="Discover new applications ..."/> </metadata> <content><![CDATA[<html><head><title>Adobe AIR Marketplace</title></head><body>Discover new applications ...</body></html>]]></cntent> </record> <record url=https://www.adobe.com/cfusion/marketplace_gsa/ index.cfm?event=marketplace.home&amp;marketplaceid=2 action="add" mimetype="text/html" displayurl="https://www.adobe.com/cfusion/ marketplace/index.cfm?event=marketplace.home&amp;marketplaceid=2"> <metadata> <meta name="mp_mkt" content="2"/> <meta name="mp_logo" content="/images/marketplace/ dbreferenced/marketplaceicons/icn_photoshop.png"/> <meta name="title" content="Adobe Photoshop Marketplace"/> <meta name="description" content="Extend your creative possibilities ..."/> </metadata> <content><![CDATA[<html><head><title>Adobe Photoshop Marketplace</title></head><body>Extend your creative possibilities ...</body></html>]]>/content> </record> ... <record> ... </record>      </group> </gsafeed>
    Minimum number of documents for indexing
    If set to a positive value, this specifies the minimum number of records expected in the file downloaded. If fewer records are received, the index operation is aborted.
    Note : This feature is not enabled by default. Contact Technical Support to activate the feature for your use.
    Note : This feature is only used during full Index operations.
    Map
    Lets you specify XML-element-to-metadata mappings, using XPath expressions.
    • Tag
      Specifies an XPath representation of the parsed XML data. Using the example Adobe XML document above, under the option Itemtag, it could be mapped using the following syntax:
      /record/@displayurl -> page-url /record/metadata/meta[@name='title']/@content -> title /record/metadata/meta[@name='description']/@content -> desc /record/metadata/meta[@name='description']/@content -> body
      The above syntax translates as the following:
      • /record/@displayurl&nbsp;->&nbsp;page-url
        The displayurl attribute of the record element maps to the metadata field page-url .
      • /record/metadata/meta[@name='title']/@content&nbsp;->&nbsp;title
        The content attribute of any meta element that is contained inside a metadata element, that is contained inside a record element, whose name attribute is title , maps to the metadata field title .
      • /record/metadata/meta[@name='description']/@content&nbsp;->&nbsp;desc
        The content attribute of any meta element that is contained inside a metadata element, that is contained inside the record element, whose name attribute is description , maps to the metadata field desc .
      • /record/metadata/meta[@name='description']/@content&nbsp;->&nbsp;body
        The content attribute of any meta element that is contained within a metadata element, that is contained within the record element, whose name attribute is description , maps to the metadata field body .
      XPath is a relatively complicated notation. More information is available at the following location:
    • Field
      Defines the name attribute value that is used for each generated <meta> tag.
    • Metadata?
      Causes Field to become a drop-down list from which you can select defined metadata fields for the current account.
      The Field value can be an undefined metadata field, if desired. An undefined metadata field is sometimes useful to create content used by Filtering Script .
      When Index Connector processes XML documents with multiple hits on any map field, the multiple values are concatenated into a single value in the resulting cached document. By default, these values are combined using a comma delimiter. However, suppose that the corresponding Field value is a defined metadata field. In addition, that field has the Allow Lists attribute set. In this case, the field's List Delimiters value, which is the first delimiter defined, is used in the concatenation.
    • Primary Key?
      Only one map definition is identified as the primary key. This field becomes the unique reference that is presented when this document is added to the index. This value is used in the document’s URL in the Index.
      The Primary Key values must be unique across all of the documents represented by the Index Connector configuration - any duplicates encountered will be ignored. If your source documents don't contain a single unique value for use as Primary Key , but two or more fields taken together can form a unique identifier, you can define the Primary Key by combining multiple Tag definitions with a vertical bar ("|") delimiting the values.
    • Strip HTML?
      When this option is checked, any HTML tags found in this field's data are removed.
    • Use for Delete?
      Used during Incremental Index operations, only. Records matching this XPath pattern identify items for deletion. The Primary Key value for each such record is used to construct "delete" requests, as with Delete File Path.
      Note : This feature is not enabled by default. Contact Technical Support to activate the feature for your use.
    • Action
      Lets you add rows to the map or remove rows from the map. The order of the rows is not important.
    Data source type: XML
    Enabled
    Turns the configuration "on" to crawl and index. Or, you can turn "off" the configuration to prevent crawling and indexing.
    Note : Disabled Index Connector configurations are ignored if they are found in an entrypoint list.
    Host Address
    Specifies the URL address of the host system where the data source file is found.
    File Path
    Specifies the path to the master XML document that contains links ( <a> ) to individual XML documents.
    The path is relative to the root of the host address.
    Protocol
    Specifies the protocol that is used to access the file. You can choose from the following:
    • HTTP
      If necessary, you may enter proper authentication credentials to access the HTTP server.
    • HTTPS
      If necessary, you may enter proper authentication credentials to access the HTTPS server.
    • FTP
      You must enter proper authentication credentials to access the FTP server.
    • SFTP
      You must enter proper authentication credentials to access the SFTP server.
    • File
    Note : The Protocol setting is only used when there is information specified in the Host Address and/or File Path fields. Individual XML documents are downloaded using either HTTP or HTTPS, according to their URL specifications.
    Itemtag
    Identifies the XML element that defines a "row" in the data source file that you specified.
    Map
    Lets you specify column-to-metadata mappings, using column numbers.
    • Tag
      Specifies an XPath representation of the parsed XML data. Using the example Adobe XML document above, under the option Itemtag, you can map it using the following syntax:
      /record/@displayurl -> page-url /record/metadata/meta[@name='title']/@content -> title /record/metadata/meta[@name='description']/@content -> desc /record/metadata/meta[@name='description']/@content -> body
      The above syntax translates as the following:
      • /record/@displayurl&nbsp;->&nbsp;page-url
        The displayurl attribute of the record element maps to the metadata field page-url .
      • /record/metadata/meta[@name='title']/@content&nbsp;->&nbsp;title
        The content attribute of any meta element that is contained inside a metadata element, that is contained inside a record element, whose name attribute is title , maps to the metadata field title .
      • /record/metadata/meta[@name='description']/@content&nbsp;->&nbsp;desc
        The content attribute of any meta element that is contained inside a metadata element, that is contained inside the record element, whose name attribute is description , maps to the metadata field desc .
      • /record/metadata/meta[@name='description']/@content&nbsp;->&nbsp;body
        The content attribute of any meta element that is contained within a metadata element, that is contained within the record element, whose name attribute is description , maps to the metadata field body .
      XPath is a relatively complicated notation. More information is available at the following location:
    • Field
      Defines the name attribute value that is used for each generated <meta> tag.
    • Metadata?
      Causes Field to become a drop-down list from which you can select defined metadata fields for the current account.
      The Field value can be an undefined metadata field, if desired. An undefined metadata field is sometimes useful to create content used by Filtering Script .
      When Index Connector processes XML documents with multiple hits on any map field, the multiple values are concatenated into a single value in the resulting cached document. By default, these values are combined using a comma delimiter. However, suppose that the corresponding Field value is a defined metadata field. In addition, that field has the Allow Lists attribute set. In this case, the field's List Delimiters value, which is the first delimiter defined, is used in the concatenation.
    • Primary Key?
      Only one map definition is identified as the primary key. This field becomes the unique reference that is presented when this document is added to the index. This value is used in the document’s URL in the Index.
      The Primary Key values must be unique across all of the documents represented by the Index Connector configuration - any duplicates encountered will be ignored. If your source documents don't contain a single unique value for use as Primary Key , but two or more fields taken together can form a unique identifier, you can define the Primary Key by combining multiple Tag definitions with a vertical bar ("|") delimiting the values.
    • Strip HTML?
      When this option is checked, any HTML tags found in this field's data are removed.
    • Action
      Lets you add rows to the map or remove rows from the map. The order of the rows is not important.
  4. (Optional) Click Setup Maps to download a sample of your data source. The data is examined for indexing suitability. This feature is available for Text and Feed Types, only.
  5. (Optional) Click Preview to test the actual working of the configuration. This feature is available for Text and Feed Types, only.
  6. Click Add to add the configuration to the Index Connector Definitions page and to the Index Connector Configurations drop-down list on the URL Entrypoints page.
  7. On the Index Connector Definitions page, click rebuild your staged site index .
  8. (Optional) On the Index Connector Definitions page, do any of the following:

Editing an Index Connector definition

You can edit an existing Index Connector that you have defined.
Not all options are available for you to change, such as the Index Connector Name or Type from the Type drop-down list.
To edit an Index Connector definition
  1. On the product menu, click Settings > Crawling > Index Connector .
  2. On the Index Connector page, under the Actions column heading, click Edit for an Index Connector definition name whose settings you want to change.
  3. On the Index Connector Edit page, set the options you want.
    See the table of options under Adding an Index Connector definition .
  4. Click Save Changes .
  5. (Optional) On the Index Connector Definitions page, click rebuild your staged site index .
  6. (Optional) On the Index Connector Definitions page, do any of the following:

Viewing the settings of an Index Connector definition

You can review the configuration settings of an existing index connector definition.
After an Index Connector definition is added to the Index Connector Definitions page, you cannot change its Type setting. Instead, you must delete the definition and then add a new one.
To view the settings of an Index Connector definition
  1. On the product menu, click Settings > Crawling > Index Connector .
  2. On the Index Connector page, under the Actions column heading, click Edit for an Index Connector definition name whose settings you want to review or edit.

Copying an Index Connector definition

You can copy an existing Index Connector definition to use as the basis for a new Index Connector that you want to create.
When copying an Index Connector definition, the copied definition is disabled by default. To enable or "turn on" the definition, you must edit it from the Index Connector Edit page, and select Enable .
To copy an Index Connector definition
  1. On the product menu, click Settings > Crawling > Index Connector .
  2. On the Index Connector page, under the Actions column heading, click Copy for an Index Connector definition name whose settings you want to duplicate.
  3. On the Index Connector Copy page, enter the new name of the definition.
  4. Click Copy .
  5. (Optional) On the Index Connector Definitions page, do any of the following:

Renaming an Index Connector definition

You can change the name of an existing Index Connector definition.
After you rename the definition, check Settings > Crawling > URL Entrypoints . You want to ensure that the new definition name is reflected in the drop-down list on the URL Entrypoints page.
To rename an Index Connector definition
  1. On the product menu, click Settings > Crawling > Index Connector .
  2. On the Index Connector page, under the Actions column heading, click Rename for Index Connector definition name that you want to change.
  3. On the Index Connector Rename page, enter the new name of the definition in the Name field.
  4. Click Rename .
  5. Click Settings > Crawling > URL Entrypoints . If the previous Index Connector's name is present in the list, remove it, and add the newly renamed entry.
    See Adding multiple URL entry points that you want indexed . 1. (Optional) On the Index Connector Definitions page, do any of the following:

Deleting an Index Connector definition

You can delete an existing Index Connector definition that you no longer need or use.
To delete an Index Connector definition
  1. On the product menu, click Settings > Crawling > Index Connector .
  2. On the Index Connector Definitions page, under the Actions column heading, click Delete for the Index Connector definition name you want to remove.
  3. On the Index Connector Delete page, click Delete .