Show Menu
TOPICS×

Frequently asked questions

When is a SWF file crawled and indexed?

A SWF file is crawled and indexed if it is contained in an embed or object tag on an HTML page, as in the following example:
<embed src="Flash-file-URL">  
 
<object>  
<param name=movie value="Flash-file-URL">  
</object> 

A SWF file is also recognized if you list the file URL as an entrypoint.

What do I have to do to index a SWF file?

To crawl and index SWF files, select the content type Adobe Flash Movies ( Settings > Crawling > Content Types ).
As long as your Flash file is referenced from an <embed> tag or an <object> tag in an HTML document, the text is indexed and all of the URLs listed in the file are crawled.
If your file is not referenced from either an <embed> tag or an <object> tag, you can list the SWF file in an <a href=...> tag in an HTML document or as a URL entrypoint.

How are SWF files recognized?

SWF files are identified by the following MIME type:
application/x-shockwave-flash
SWF files are also recognized with application/octet-stream " or text/plain MIME types, provided that the file extension is .swf.
A misconfigured server might use a different MIME type for SWF files. Be sure that you check your server configuration if you are having problems crawling and indexing SWF files.

How are SWF files indexed?

Text contained in a SWF file is indexed as if it were <body> text in the enclosing HTML page. If a search result finds text contained in an embedded SWF file, the result actually links to the enclosing HTML page and not the SWF file. This way, the SWF file is displayed in the correct context.
If a SWF file contains a URL as a "Load Movie" action, the text in the referenced SWF file is indexed as a part of the enclosing HTML page.
If a SWF file contains a URL as a "Get URL" action, the URL is crawled and indexed later, just as an HTML <a href=...> reference is crawled and indexed later.
If a SWF file is listed as a URL entrypoint, the SWF file text is indexed as a single page. A search result that finds text from an entrypoint SWF links directly to the movie, not to an enclosing HTML page.

Does a SWF file count as a page?

No. A SWF file is considered to be a part of its enclosing HTML page. All "Load Movie" URLs contained in SWF files are also considered part of the enclosing HTML page. Therefore, SWF files that are referenced from an HTML page do not count as a "page" for the account's page total.
If a SWF file is listed as a URL entrypoint, then that SWF file and all "Load Movie" URLs listed in that SWF file are counted as one "page" for the account's page total.

How do I prevent the indexing of individual SWF files?

To prevent the indexing of a SWF file, you can add either a robots meta tag ( <meta name="ROBOTS" content="NOINDEX"> ) or a <noindex> tag to the enclosing HTML document. That is, the document that contains the <embed> or <object> tag.
You can also use the robots meta tag ( <meta name="ROBOTS" content="NOFOLLOW"> ) to prevent following URLs that are contained in the SWF file. If the enclosing HTML document has following disabled, the URLs listed as "Get URL" actions in the SWF file are not followed.

How do I prevent SWF files from being indexed on my website?

To disable SWF indexing deselect the content type Adobe Flash Movies ( Settings > Crawling > Content Types ).
You can also choose to use URL Masks to disable the indexing of SWF files.
To disable SWF indexing, enter one of the following URL masks:
  • exclude *.swf (if you are not using regular expressions)
  • exclude regexp ^.*\.swf$ (if you are using regular expressions)

How come I cannot search the Chinese, Japanese, or Korean SWF files on my website?

Site search/merchandising obtains UTF-8 from SWF files created with Adobe Flash. The UTF-8 contains no indication of language. If you selected the the content type Adobe Flash Movies ( Settings > Crawling > Content Types ), you must use metadata injections to specify the language that is used by the SWF file.
Older SWF files do not specify a character set either. If you selected the SWF content type Adobe Flash Movies ( Settings > Crawling > Content Types ), you must use metadata Injections to specify the character set that is used in the SWF file.

General search

A frequently asked questions page that discusses how site search/merchandising helps customers who visit your website find what they are looking for.
The following are common questions regarding general searching:
The following are common questions regarding search features:

Do I have to install any software to use site search/merchandising?

No. This is the primary advantage of site search/merchandising. The engine is a professional application hosted and maintained entirely on our high-performance servers. This makes the software easier to use than other search solutions. The only thing you have to do is add a small amount of HTML code to your pages so that customers to your website can enter searches. Site search/merchandising takes care of all the rest.

What happens when my site exceeds the page limit?

We keep serving your searches so your visitors can search your website without interruption. To see if your website exceeds the page limit, review your Full Index status or Live Log.

How do I change the email address where the weekly reports are sent?

Weekly reports are sent to the owner of each active account. You can change the email address by clicking Settings > My Profile > Personal Information . If you have more than one active search account, then all newsletters are sent to the new address.

How secure is my customer information on site search/merchandising?

Site search/merchandising is secure, fast, stable, and easy to use. You are not forced to use cookies (although you can if you want) to use our products, and sensitive information, such as passwords, are never put on any URL link that can be later retrieved from your browser.

What about the privacy of my customer information?

Adobe is committed to honoring the privacy of their customers and visitors. See the Adobe Privacy Center .

Can I show my own banner ads on the search results pages?

Yes. You control the appearance and the content of the search results. Within the search results template for your website, you can create links to your own banner exchange network such as LinkExchange or SmartClicks. Any hits made by your visitors are properly credited to your banner exchange account.

Can I customize the search results for my site?

Yes. This is an exclusive feature of site search/merchandising. With our advanced template technology and a little knowledge of HTML, you can control exactly how the search results appear.
The transition between your own servers and site search/merchandising servers is completely seamless and invisible to your customers. If you do not know HTML or you do not have time to create a custom template, you can choose from an assortment of attractive, ready-to-use templates that Adobe's in-house team of professional web developers create.

Can I see what customers are searching for on my site?

Yes. We keep search statistics for searches made by visitors on your website over the last two months. You can review these statistics at any time under Reports on the product menu. Search reports give you vital information regarding exactly what visitors are looking for on your website. You can use this information to improve the design or to tune the site search/merchandising engine to better serve your visitors.

How can I control which content types (PDF, text, Flash, MP3 and Microsoft Office) are indexed and searched?

You can easily configure accounts to enable or disable the indexing and searching of text found within PDF documents, plain text documents, Flash movies, MP3 files, or Microsoft Office documents.
These settings are controlled on the Staged Content Types page.

Are dynamically generated web pages by way of ASP, JSP, PHP, CFM, or Perl-based content supported?

Static or dynamically generated HTML web pages are indexed, including pages built from databases, or any other back-end process. Because the HTML code that a browser sees is indexed, you can use site search/merchandising on websites as long as these back-end architectures results in HTML pages.
The search robot crawls your website by starting with the first page at the website address that is specified in Account Settings, and follows links from page to page.
When the search robot crawls and indexes all the pages of your website, then you can use the search engine to search your site. In other words, if dynamically generated documents are woven into your website with links from other pages, the search robot can still crawl and index the dynamic content.
After your website content is crawled and indexed, customers to your website can search for information within the indexed content.

How can I use synonyms to improve the search results for my site?

You can use synonyms when you want visitors to find pages that are related to their search query.
For example, suppose that you have a page that contains a price list of products for sale on your site. However, after examining the search reports that are provided by site search/merchandising, you find that customers are looking for the word "cost," "expense," "charge," or "fee" in their searches. These words do not display your price list page in the search results. With the Add Synonyms feature in Dictionaries, you can specify that these words are all synonyms, and your customer can find your price list, regardless of which search term they use.

Do I have control over the ordering of search results?

Yes. Using the advanced relevancy interface, you can control which pages are returned for a specific search query. This feature is useful if you want to be sure that customers see a specific page when they query for certain words.

Can I change the language of the search results page?

Yes. The site search/merchandising template is flexible when it comes to letting you construct a results page that uses the language of your choice and matches the appearance of your website.
The template consists of a combination of text, standard HTML tags, and special tags that are defined to display the search results. When a customer performs a search, the search robot reads the template, outputs the text using standard HTML tags, and inserts the results links based on the special template tags.
If you want to change the results language, you can edit the English text that appears on the template.

Can I have more than one site on my Adobe Customer Login?

Yes. With a single Adobe Customer Login, you can manage a different search engine for many different websites. Select and manage accounts under "Accounts."

Can I search more than one domain?

Yes. You can configure access more than one domain by using URL Entrypoints. Provide URL entrypoints for additional domains that you own. Remember that you must have permission to index domains that you do not own.

Can I subdivide my site into separate sections so that customers can search any of these areas individually or the whole site?

Yes. A "Collections" feature is included that lets customers search specific areas of your website to quickly find what they are looking for.
For example, customers can search a collection of URLs related to product sales information or a collection of URLs related to support services. You can set up collections so that your customers see a drop-down list of collections or a group of check boxes.

How do I exclude parts of my website from being searched?

Yes. Specify URL masks to determine which website pages you want to include or excluded from indexing. URL masks determine whether website pages appear in your search results.
To prevent parts of individual web pages from being searched, you can exclude portions of a page from indexing. Surround the text with <noindex> and </noindex> tags. This method is useful if you want to exclude navigation text from searches.

What character sets are supported?

Web pages typically specify the character set with a meta tag similar to the following:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
The site search/merchandising engine properly indexes web pages using all of the common character sets in use on the Internet today. Some of the supported character sets include the following:
Arabic (ISO-8859-6)
Chinese (Traditional; Big5)
Japanese (Shift_JIS)
Arabic (Windows-1256)
Chinese (Traditional; EUC-TW)
Russian (KOI8-R)
Baltic (ISO-8859-4)
Cyrillic (ISO-8859-5)
Southern European (ISO-8859-3)
Baltic (Windows-1257)
Cyrillic (Windows-1251)
Turkish (ISO-8859-9)
Central European (ISO-8859-2)
Greek (ISO-8859-7)
Turkish (Windows-1254)
Central European (Windows-1250)
Greek (Windows-1253)
Unicode (UTF-8)
Chinese (ISO-2022-CN)
Hebrew (ISO-8859-8)
US-ASCII (us-ascii)
Chinese (ISO-2022-CN-EXT)
Hebrew (Windows-1255)
Western European (ISO-8859-1)
Chinese (Simplified; EUC-CN)
Japanese (EUC-JP)
Western European (ISO-8859-15)
Chinese (Simplified; GB2312)
Japanese (ISO-2022-JP)
Western European (Windows-1252)
Chinese (Simplified; GBK)
Japanese (ISO-2022-JP-1)
Western European (x-mac-roman)
Chinese (Simplified; HZ-GB-2312)
Japanese (ISO-2022-JP-2)
Contact Technical Support to ask about character sets that are not listed above.

What if I change or update my website?

After you have changed the content of your website, you can perform either a full index or an incremental index. Site search/merchandising downloads and indexes any changed website content. After indexing is complete, your customers can search the new content. You can also schedule an automatic indexing of your site at a certain time and on a specific day.

Can my site be automatically indexed?

Yes. You can schedule an automatic index of your site every day.
Besides daily automatic indexing, you can choose to have frequently changed portions of their site incrementally indexed. On days that you have an automatic index scheduled, you can control the time of day the index takes place. Also, you can always manually initiate a site index whenever you want.

I use passwords on my website. Can I still use site search/merchandising?

If you use HTTP Basic Authentication to password-protect certain portions of your website, you can specify realms and passwords that site search/merchandising can use to index your site.

Do you support the crawling and indexing of https or secure server content?

Yes. You can crawl and index content on secure servers (https).

Does site search/merchandising honor the robots.txt file on my website?

Yes. The Robots Exclusion Protocol is compliant. The search robot examines the robots.txt file if it is present on your website. If your robots.txt file excludes all robots from crawling your site, the site search/merchandising robot is also excluded. To allow only the site search/merchandising robot to crawl your site, set the contents of your robots.txt file to the following:
User-agent: Atomz/1.0 
Disallow:

User-agent: * 
Disallow: /

You can learn more about web robots and the Robots Exclusion Protocol at the following:

Certain portions of my website must be updated frequently so that my customers get the most accurate search results. Does incremental indexing help with this issue?

Yes. This scenario is what the incremental indexing feature was built to facilitate site search/merchandising. Incremental indexing's primary benefit is that it allows companies to frequently index dynamically changing portions of their website. Such functionality ensures that you are displaying search results with "up to the minute" accuracy.

Are dynamically generated web pages supported from a back-end database, such as product catalogs or inventory management systems?

Static or dynamically generated HTML web pages, including pages built from databases, or any other back-end process are indexed. Because the HTML code, as viewed by a browser, is indexed, you can use site search/merchandising on websites as long as the back-end database information results in HTML pages.
The search robot crawls your website by starting with the first page at the website address that is specified in Account Settings, and follows links from page to page.
When the search robot crawls and indexes all the pages of your website, then you can use the search engine to search your site. In other words, if dynamically generated documents are woven into your website with links from other pages, the search robot can still crawl and index the dynamic database content.
After your website content is crawled and indexed, customers to your website can search for information within the indexed content.
You can easily enable full-content searching or a more narrow topic-based search restricted to information in the title, or the meta-description, or the meta-keywords document tags, or all three. Using metadata definitions, you can also create custom display fields, such as a product image, in the actual search results.

Can I use scripts or programs to initiate an incremental index of my site?

Yes. You can use scripts or programs to initiate an incremental index of your website, as well as to ping the servers to index the site whenever content is changed or updated.

Feature implementations

A frequently asked questions page that discusses various feature implementations in Search&amp;Promote.
The following are common questions regarding feature implementations in Search&amp;Promote on a website:

Why are my business rules not running?

Configure business rules when banners appear, or to help decide what results appear and in what order. You can also configure the position of an item in your facet, and what template is used for a given search. Reorder business rules to change the order in which they run on presentation templates. Business rules run in the order that they were defined; that is, the higher a rule's order number, the later it runs in the process, trumping earlier rules. You reorder rules by entering a new number in the Order column of the table on the Business Rules page.

Why do I have problems scheduling indexing, errors starting indexing, and issues starting staged indexing?

When you generate an index, whether it is full or incremental, index crawl status information is displayed in real-time. For example, you can view the start time, elapsed time, and any errors that have occurred during the indexing process. Information about the status of your last index is also displayed. Use this information to troubleshoot any indexing errors you encounter.

My index size limit exceeds my permitted boundary. Why is this happening and how do I fix it?

A web site can tend to grow and over time Search&Promote "discovers" more documents and web pages that were added. Eventually, your account may exceed your indexing size limit, In such cases, you can consider using URL Mask . This feature hides docs and web pages from index crawling that you do not want or do not need to have indexed, thereby reducing your index size. Another option may be to contact Technical Support to have your indexing size limit set larger in your account.
If you are unsure what to do, you should contact Technical Support. There may be many other variables affecting your index size that, if adjusted, may also affect the billing of your account.

What controls the character set encoding of the search query?

The "Web Forms" section of your Search account contains sample search forms that you use to add search functionality to your website. If you look this search forms code, you can find a line similar to the following:
<input type=hidden name="sp_f" value="iso-8859-1">
This line of code tells the search engine that the incoming query is encoded in iso-8859-1, a common encoding for Western European languages. You can change this setting by going to the product menu and clicking Settings > My Profile > Personal Information . On the Personal Information page, in the Character Encoding drop-list, select a new encoding.
You can also manually change the encoding value on your web pages by editing the sp_f line of the search form. Remember that the sp_f value of the search form must match the character set encoding of the page in which it appears.

Are only pages searched whose encoding matches the encoding of the search query?

By default, no. As long as your website pages correctly identify their character set encoding, the necessary conversions are made between the encoding of the search query and that of the pages, even when pages use multiple encodings.

What encoding is used for the search results page?

The character set encoding of your account determines the default encoding for your results template.
You can learn more about specifying a character set in an HTML template.

Can I use site search/merchandising on Unicode, UTF-8, encoded pages?

Yes. However, Unicode character sets, such as UTF-8, do not provide enough information to determine the language that the pages are written in. To correctly search these pages, it is necessary to specify the language. To determine the document language, information is processed in the following order:
  • Content-Language HTTP header delivered for the document by your server.
  • META elements (for example, META HTTP-EQUIV="Content-Language" Content="ja_JP" ) in the <HEAD> section of the document.
  • LANG attribute of the <HTML> tag (for example, <HTML LANG="ja_JP"> ).
If your server is not configured to deliver the Content-Language HTTP header, and your documents contain neither the language META element, nor the language attribute for the <HTML> tag, you can use metadata injections to specify the appropriate language.

How come I cannot search the Chinese, Japanese, or Korean PDF files on my website?

Site search/merchandising obtains UTF-8 from Adobe PDF files with no indication of language. If you selected PDF Documents ( Settings > Crawling > Content Types ), you must use metadata injections to specify the language that is used in the PDF file.

How come I cannot search the Chinese, Japanese, or Korean SWF files on my website?

Site search/merchandising obtains UTF-8 from Adobe Flash movie files that were created with Adobe Flash with no indication of language. If you selected the content type Adobe Flash Movies ( Settings > Crawling > Content Types ), you must use metadata injections to specify the language that is used in the SWF file.
For Flash version 4 or older versions of SWF files, the character set of the characters in the file is not specified. If you selected the content type Adobe Flash Movies ( Settings > Crawling > Content Types ), you must use metadata injections to specify the character set that is used in the SWF file.

How come I cannot search the Chinese, Japanese, or Korean Microsoft Office files on my website?

Site search/merchandising obtains UTF-8 from Microsoft Office files (Microsoft Word, Microsoft Excel, and Microsoft PowerPoint) with no indication of language. If you selected the content type Microsoft Office Files ( Settings > Crawling > Content Types ), you must use metadata injections to specify the language used in the Microsoft Office files.

How come I cannot search the Chinese, Japanese, or Korean MP3 files on my website?

If you select the content type Text in MP3 Music Files ( Settings > Crawling > Content Types ), you must use metadata injections to specify the character set that is used to encode the MP3 files.

Do I need to do anything special to get the .txt files on my website to index correctly?

If you selected the content type Text Documents ( Settings > Crawling > Content Types ), you must use metadata injections to specify the character set used to encode the .txt files.

How come Chinese, Japanese, or Korean fonts appear in search results under Netscape 4.7 and earlier?

If your account uses the default template, one of the ready-to-use templates, or a template based on any of those templates, it may contain font tags that specify Arial or Helvetica as font faces. For example, <font face="arial, helvetica" size="+1"> . Netscape 4.7 and earlier does not display Chinese, Japanese, or Korean characters when the Arial or Helvetica font face is used. Remove the face attribute or replace the font face with one that is more appropriate for Chinese, Japanese, or Korean.

Have you examined your index log?

The index log contains detailed information that the site search/merchandising robot collects as it indexes your website. The log includes a list of links crawled and errors encountered. Examining the index log is the best place to start to determine why all the pages on your website are not indexed.

Do you have typing mistakes in your URL?

When you type lengthy URLs into HTML forms, it can introduce one or more typographical errors. Remember that URLs should not contain any spaces. Also, be aware that some web servers handle URLs in a case-sensitive manner.
On the product menu, click Settings > Crawling > URL Entrypoints . On the Staged URL Entrypoints page, verify the following:
  • You do not have any typographical errors in your URLs.
  • The characters in the URLs are all using the correct casing.
  • There are no space characters in the URLs.
To test your URL entrypoints, copy and paste a URL into a web browser to see if your website appears. If it does not appear, check again to ensure that you have not made any mistakes in your URL path.

Does the entrypoint web page have links to other pages on your website?

The site search/merchandising robot crawls your website just like your customer do; by following links from page to page. Links must be present in the entrypoint web page before the search robot can find and index other pages on your site.

Are links to other pages on your website embedded in JavaScript?

You can use sophisticated navigation techniques on your website, such as roll-over actions and menus, that use JavaScript to link to other pages. However, the site search/merchandising robot cannot follow links embedded in JavaScript.
One solution you can use to overcome this issue is to place hidden links to other pages in the HTML that contains the JavaScript. Although customers to your website do not see these links, the search robot still finds and crawls them. You can place hidden tags at the bottom of the page just before the </body> tag. They might look like the following:
<a href="/mydir/mypag1.html"></a> 
<a href="/mydir/mypag2.html"></a>

Another solution is to list the URLs of the additional pages on your website as entrypoints to crawl and index. Begin the URLs with https:// as shown in the following:
https://www.mydomain.com/mydir/mypag1.html 
https://www.mydomain.com/mydir/mypag2.html

Are the HTML tags on your web page in an invalid sequence?

The HTML specification requires that the <html> , <head> , and <body> tags follow a specific sequence in an HTML document. Tags in all your web pages must have the following sequence:
<html> 
<head> 
...  
<i>head tags go here</i> ... 
</head> 
<body> 
...  
<i>body tags go here</i> ... 
</body> 
</html>

If the HTML tags are not in proper order the site search/merchandising robot is unable to properly parse and index your web page. The following is an example of tags that are not in the right sequence:
<body> 
<head> 
...  
<i>head tags are here</i> ... 
</head> 
...  
<i>body tags are here</i> ... 
</body>

In such case, place the <html> , <head> , and <body> tags into the proper sequence on your web page.

Do you have improperly formed HTML comment tags in your web page?

Be sure that you carefully review and correct any invalid HTML comments in your web pages.
The HTML specification requires that an HTML comment begin with the characters <!-- and end with the characters --> . It is easy to overlook incorrectly formatted comments that cause the site search/merchandising robot to improperly parse the tags on your web page. An improperly formed comment can cause the site search/merchandising robot to miss other important tags that must be parsed. Be mindful of comments just before the <body> tag in your web page.
The following is an example of a properly formed comment:
<!-- This HTML comment is OK. -->
The following is an example of an improperly formed comments:
<!- This HTML comment is improperly formed. -> 
<! This HTML comment is also improperly formed. >

Does your web page contain links to pages on another domain?

Often a website can consist of pages that actually exist on a web server with a different domain address. For example, if your main website address is the following:
https://www.mydomain.com/
Your website may also have pages on another domain such as the following:
https://www.otherdomain.com/
By default, the site search/merchandising robot does not follow links on a domain other than the main one. However, by setting additional entrypoints for your search account, you can easily index multiple domains.
On the product menu, click Settings > Crawling > URL Entrypoints . Add the "main website entrypoint" URL of your site. Then, add additional URL entrypoints to any other domains that contain site pages. For example, you would set your main URL entrypoint to:
https://www.mydomain.com/
and add the following additional site URL entrypoint:
https://www.otherdomain.com/

Are you using a virtual domain service for your URL?

You might be using a virtual domain service (sometimes called a "domain redirection service") to provide a better URL for customers to get to your website. For example, suppose the real address of your website is the following:
https://www.myispdomain.com/~myname/mywebpages/
However, you use a virtual domain service so customers can get to your site at following addresses:
https://myname.adomain.com/
or
https://adomain.com/myname/
By default, the site search/merchandising robot does not follow links on a domain other than the main one. However, by setting additional entrypoints for your search account, you can easily index multiple domains.
On the product menu, click Settings > Crawling > URL Entrypoints . Add the "main website URL entrypoint" to the virtual domain name of your site. Then, add additional entrypoints to the domain where your website actually lives.
For example, you would set your main URL entrypoint to the following:
https://myname.adomain.com/
And add the following additional website URL entrypoint:
https://www.myispdomain.com/~myname/mywebpages/

Does your web page use a meta refresh tag?

Many websites have a front page that includes a meta refresh tag between the <head>...</head> tags similar to the following:
<meta http-equiv="Refresh" content="0;URL=https://www.adomain.com/apath/afile.html">
Under certain circumstances, the site search/merchandising robot is unable to follow the meta refresh URL to index the content of your website. This issue is easy to work around by setting additional entrypoints.
On the product menu, click Settings > Crawling > URL Entrypoints . Add another entrypoint to the URL of the meta refresh tag.

Does your web page use a meta robots tag?

Sometimes web pages use meta robots tags to control web robots that periodically attempt to crawl a website. Meta robots tags appear between the <head>...</head> tags of a web page and look similar to the following tag:
<meta name="robots" content="noindex, nofollow">
Because the site search/merchandising robot is itself a web robot, it follows the directions of the meta robots tag. By excluding other robots in this manner you also exclude the site search/merchandising robot.
You can learn more about web robots and the Robots Exclusion Protocol at the following:
Remove or modify the meta robots tag on the web pages that you want indexed on your website.

Does your website use a robots exclusion file?

Sometimes a website has a page called robots.txt that excludes all or certain robots from crawling it. To see if your website has a robots.txt file, look for it just under the top-level domain as shown in the following:
https://www.yourdomain.com/robots.txt
The contents of the robots.txt file looks similar to the following text:
User-agent: * 
Disallow: /

Because the site search/merchandising robot is itself a web robot, it follows the directions in the robots.txt file-it excludes the site search/merchandising robot. To work around this issue, edit the robots exclusion file (robots.txt) to permit the site search/merchandising robot to crawl and index your website as follows:
User-agent: Atomz/1.0 
Disallow: 
 
User-agent: * 
Disallow: /

Microsoft Office

A frequently asked questions page that discusses support of the indexing and searching of Microsoft® Office files on a website.
The following are common questions regarding Microsoft Office files:

What gets indexed in a Microsoft Office file?

The full content of Microsoft Word files, Microsoft Excel files, and Microsoft PowerPoint files are indexed.
The following parts of a Microsoft Word file are indexed:
  • Title
  • Keywords
  • Subject (Description)
  • Text-based content
  • Hyperlinks to other documents
The following parts of a Microsoft Excel file are indexed:
  • Title
  • Keywords
  • Subject (Description)
  • Text in cells
  • Values from numeric formulas in cells
The following parts of a Microsoft PowerPoint file are indexed:
  • Title
  • Keywords
  • Subject (Description)
  • Text on each slide

What does not get indexed in a Microsoft Office file?

Graphics that are contained in Microsoft Office files, or any text that is part of a contained graphic are not indexed. Custom property definitions are not indexed as metadata. Some text in special fields, such as headers and footers in a PowerPoint file, are also not indexed.

How are Microsoft Office files indexed differently from HTML pages?

The difference between how the search robot indexes Microsoft Office files and HTML files is that each HTML file is an individual page and a single Microsoft Office file can represent hundreds of pages. For this reason, each page is counted within a Microsoft Office file as a separate page under your search account.

How do I prevent Microsoft Office files from being indexed on my website?

If you do not want the search robot to crawl and index Microsoft Office files, deselect the content type Microsoft Office Files ( Settings > Crawling > Content Types ).
You can also use URL Masks to disable the indexing of Microsoft Office files.
Enter the following URL masks:
If you are not using regular expressions
  • exclude *.doc
  • exclude *.xls
  • exclude *.ppt
If you are using regular expressions
  • exclude regexp ^.*\.doc$
  • exclude regexp ^.*\.xls$
  • exclude regexp ^.*\.ppt$

When is an MP3 file crawled and indexed?

MP3 files are crawled and index in one of two ways. The most common way is from an anchor href tag in an HTML file:
<a href="MP3-file-URL"></a>
A second way is to enter the URL of the MP3 file as a URL entrypoint.

What do I have to do to crawl and index the MP3 files on my site?

To activate MP3 crawling and indexing for your account, on the product menu, click Settings > Crawling > Content Types . On the Staged Content Types page, select Text in MP3 Music Files .

How is an MP3 file recognized?

An MP3 file is recognized by its MIME type which is "audio/mpeg".

What is indexed in an MP3 file?

MP3 files optionally store a small amount of textual information. That information can include the album name, artist name, song title, song genre, year of release, and a comment. This information is stored at the very end of the file in what is called the TAG. MP3 files that contain TAG information are indexed by in the following way:
  • The song title is treated like the title of an HTML page.
  • The comment is treated like a description defined for an HTML page.
  • The genre is treated like a keyword defined for an HTML page.
  • The artist name, album name, and year of release is treated like the body of an HTML document.

Does an MP3 file count as a page?

Yes, each MP3 file that is crawled and indexed on your website is counted as one page.

How do I prevent the indexing of individual MP3 files?

Surround the anchor tags that link to the MP3 files with <nofollow> and </nofollow> tags. The search robot does not follow links between those tags.
Another method is to add the URLs of the MP3 files as exclude masks.

How do I prevent MP3 files from being indexed?

The easiest way to control MP3 indexing for your account is by deselecting Text in MP3 Music Files on the Staged Content Types page.
You can also use the URL Masks feature to disable MP3 indexing by file extension. To do this, on the product menu, click Settings > Crawling > URL Masks . Enter one of the following masks:
If your account...
Enter the following URL mask
Does not use regular expressions
exclude *.mp3
Uses regular expressions
exclude regexp ^.*\.mp3$

Why can't I search the Chinese, Japanese, or Korean MP3 files on my site?

To search Chinese, Japanese, or Korean MP3 files, on the product menu, click Settings > Crawling > Content Types > Text in MP3 Music Files . Then, click Settings > Metadata > Injections , and specify the character set that is used to encode the MP3 files.

What gets indexed in a PDF file?

The full content of PDF files are indexed. The following parts of a PDF file are indexed:
  • Title
  • Keywords
  • Subject (Description)
  • Text-based content

What does not get indexed in a PDF file?

The PDF table of contents, any graphics within the file, or any text that is part of a contained graphic are not indexed.

How are indexed PDF files counted?

Each PDF file is counted, including PDFs that contain multiple pages, as a single document.

Can the search results display a PDF icon?

Yes. Use the <search-if-link-extension> tag within your template to include a PDF icon, or other graphics or text, in the search results:
<search-results> 
  ... 
  <search-if-link-extension value=".pdf"> 
    <img src="/search/i/pdficon.gif"> 
  </search-if-link-extension> 
  ... 
</search-results>

PDF icons help your customers know that a search result links to a PDF file that might be very large. File size may matter to customers who are accessing your website over a modem or on a mobile device.

Can the search results link to a particular page in a PDF file?

Yes. Using the smart links template tag ( <search-smart-link>...</search-smart-link> ), customers can click to open the first PDF page that contains the search result.
To use smart links, replace the <search-link>...</search-link> tags in the search results section of your template with <search-smart-link>...</search-smart-link> tags. When a customer clicks a link that the smart-link tags generate, they go to the first PDF page relevant to their search query.
To use this feature, the customer must use a recent version of the Adobe Acrobat, or Adobe Acrobat Reader, which must include the highlight plug-in and the External Window Handler (EWH) plug-in. In addition, their web browser must use the Adobe Acrobat plug-in for Netscape Navigator (you can use any browser that accepts this Netscape Navigator plug-in) or the Acrobat ActiveX control for Internet Explorer 4.0 and later.

How do I prevent PDF files from being indexed on my website?

If you do not want the search robot to crawl and index PDF files, deselect the content type PDF Documents ( Settings > Crawling > Content Types ).
You can also choose to use URL Masks to disable PDF indexing.
To disable PDF indexing, enter one of the following URL masks:
  • exclude *.pdf (if you are not using regular expressions)
  • exclude regexp ^.*\.pdf$ (if you are using regular expressions)

How come I cannot search the Chinese, Japanese, or Korean PDF files on my website?

Site search/merchandising obtains UTF-8 from PDF files with no indication of language. If you selected the content type PDF Documents ( Settings > Crawling > Content Types ), you must use metadata injections to specify the language that is used in the PDF file.

Too many pages

A frequently asked questions page that explains some of the reasons why the indexer has counted more pages than you actually have, and what the solution is in each case.
If you are certain your website is below your page limit, but the indexer is telling you that the limit is reached, you should review these common questions and answers for possible solutions.

Have you examined your various index logs?

The index log contains detailed information collected by the site search/merchandising robot as it indexes your website. The log includes a list of all crawled links and encountered errors. Examining the index log is the best place to start when you are trying to determine which pages are getting indexed.

Are CGI programs being indexed on your website?

CGI programs use URL parameters that sometimes cause the indexer to crawl multiple "fake" URLs. If site search/merchandising is reading your CGI programs and following URLs with CGI parameters in them, there are probably several multiples of pages being crawled and indexed that are not useful for your search index. Typical CGI parameters appear in URLs with ? or & characters.
You can mask the CGI programs from being indexed using the URL Masks feature. You can mask a URL prefix or use regular expressions to mask your CGI scripts.

Does your server have directory browsing enabled?

When a web server has directory browsing enabled and there is no index.html file present in a given directory, a visit to that directory can show the listing of files in that directory. Usually, there are links at the top of the page to let you sort the list in different ways just by clicking Name , Last modified , Size , and so on. Typically, these appear in the site search/merchandising index log as URLs with characters such as ?M=A at the end. The site search/merchandising indexer follows these as links, and this can lead to indexing multiple "fake" URLs.
Typically, a well-designed website either has index files located in every directory, or it has directory browsing disabled for those directories without index files. Fortunately, there is an easy way to mask these "fake" URLs if you are unable to change your pages or disable directory listings on the server side.
To accomplish this task, click Settings > Crawling > URL Masks . Add a mask to mask any URL that contains the character ? . You can do this task by entering the following regular expression mask:
exclude regexp ^.*\?.*$
After you have created the mask, be sure that you reindex your website.

Are there forums or newsgroups on your website?

If forums or newsgroups are being crawled on your website, it might be following URLs for different display options or sort options. This behavior means that the same page is indexed multiple times.
Typically, forums or newsgroups come with their own search engines. In such case, you can use URL Masks to mask the forums from site search/merchandising.
On the product menu, click Settings > Crawling > URL Masks . On the Staged URL Masks page, mask your forums by entering their URLs as exclude URL masks.
After you have created the masks, be sure that you re-index your website.

Are there PDF or Microsoft Office files on your website?

If you have PDF files or Microsoft Office files on your website, you might notice that the index size of only a few files counts many pages. The reason there are more pages getting indexed than documents you have is because each page in a PDF or Microsoft Office file is counted as a separate page.
On the product menu, click Index > Full Index > Live Index . On the Full Index page, select Count All Pages , and then click Full Index Now to see a total page count. If you do not want PDF files or Microsoft Office files indexed, you can disable this content type under Settings > Crawling > Content Types .

Do you have multiple URL entrypoints?

The site search/merchandising robot begins crawling at specified URL entrypoints and follows all found links to all content in that particular domain. If you have specified many URL entrypoints, a significant number of pages may be crawled.
Use the Robots Exclusion Protocol's nofollow tag in the headers of the entrypoint documents on the additional domains as follows:
<html> 
<head> 
<meta name="robots" content="nofollow"> 
</head>

The code above tells the site search/merchandising robot to index the contents of the page, but not to follow the links to additional pages.
You can learn more about web robots and the Robots Exclusion Protocol at the following:
If you do not have access to the source of the pages on additional domains, you can remove the multiple URL entrypoints. Doing so helps you to limit indexing activity only to those domains whose content you want customers to be able to search.

Have you exceeded the internal bytes or time limits of site search/merchandising?

Check to see if your account has reached its limit on the "Full Index Status" screen. If the status reports that your index is larger than allowed or that it took more time than allowed, your website is not fully indexed. You can correct this error so that you get proper coverage and count of website pages.
To protect site search/merchandising servers, there are internal limits on bytes and time. Only when crawled files are very large, or when the server that site search/merchandising is trying to reach is slow are these limits reached.
If you hit a time limit, make sure that your server is online, and attempt the index again at a later time. If you hit a bytes limit, check the crawled files by viewing your index log. Are they unusually large? Contact Technical Support if you see either of these messages.