![]() |
|
Admin Console Help
Home |
Content Sources > Web Crawl > Case-Insensitive PatternsUse the Content Sources > Web Crawl > Case-Insensitive Patterns page to specify URL patterns to be treated case-insensitively. All URLs that match the patterns entered on this page are converted to lowercase before crawling or feeding. For example, suppose your company has documents under http://example.com/Folder1/ that are also linked by means of http://example.com/folder1/. By entering http://example.com/folder1/ on this page, you ensure that both forms of the URLs that match the pattern are treated as the same URL (all lowercase). Take note that patterns entered on this page are treated case-insensitively, so both http://example.com/folder1/ and http://example.com/Folder1/ work. When you set patterns on this page, the entire URL is converted to lower case. Therefore, make sure that the Follow and Crawl URL patterns on the Content Sources > Web Crawl > Start and Block URLs page have a URL pattern that includes the lower case version of the URL. For example, suppose the search appliance is crawling http://example.com/Folder1/, which has links to a number of pages. The linked pages are all crawled as http://example.com/folder1/page.html. In this case, you need to make sure that the Follow and Crawl URL patterns match the full host (http://example.com/) or a lower case version of that URL (http://example.com/folder1/). Characters that are escaped values of other characters are not converted to lowercase. For example, the URL http://example.com/a|B is converted to http://example.com/a%7Cb. In this example, | becomes %7C ( not %7c), and B becomes b. To include all URLs for case-insensitive crawling, specify the following URL pattern: regexp:.* You can also enter exception patterns. To specify exception patterns, prefix the expression with a hyphen (-). For example, the following configuration transforms all URLs under website.com to lower case except everything under website.com/importantstuff/. website.com/ URL patterns entered on this page do not have an impact on already indexed documents. To remove incorrect URLs, either add appropriate Do Not Crawl URL patterns on the Content Sources > Web Crawl > Start and Block URLs page or reset the index by using the Index > Reset Index page. Specifying URL Patterns as Case-InsensitiveTo specify URL patterns as case-insensitive:
|
||
© Google Inc. |