The following apply when creating rules for the Sharepoint Sites (including a Portal Site):
- Sharepoint indexer parses sites in an internal format. If you take a look at logs (I really recommend to enable logging of all events when configuring indexing rules) you will see URIs like Sts2://amonlanc:8080/sites/kb/webid=000/listid={5ED439BA-05A2-42A6-BE40-527E7619D82A}/itemid=12 (amonlanc:8080 is the Portal Server) or sps://amonlanc:8080/site$$$site/scope=$$$default$$$/bucketid=3/.
- Sharepoint indexer also parses the URL associated with these resources, for certain entities: Sites (http://amonlanc:8080/sites/kb) and List Items (http://amonlanc:8080/sites/kb/Templates/Template%202.html).
You can write rules to include or exclude content to be indexed or searched by specifing Sites and Item URL Patterns. Here are some examples (amonlanc:8080 is a Sharepoint Portal Server):
- http://amonlanc:8080/ - Exclude. http://amonlanc:8080/ (with a slash) is the root site (or Portal Site). You can decide to include or exclude it.
- http://amonlanc:8080/sites/*/templates/* - Exclude. Do not index items following this URL pattern.
- http://amonlanc:8080/sites/test/* - Exclude. Do not index items following this URL pattern.
- http://amonlanc:8080/sites/archive/ - Exclude. Do not index this site.
- http://amonlanc:8080/* - Include. Index all items following this URL Pattern.
In the index management page you can see the number of documents in the index. This is not really the number of documents you can search information, but the number of documents processed during the indexing process. In the example before, no items where processed for rule [http://amonlanc:8080/sites/archive/ - Exclude] but the site itself; however it will count all the items excluded by the rule [http://amonlanc:8080/sites/test/* - Exclude].
Rules apply from first to last, being the first the most prioritary. This is important, as:
- http://amonlanc:8080/sites/kb* - Excluded
- http://amonlanc:8080/* - Included
means to exclude one site (kb) while
- http://amonlanc:8080/* - Included
- http://amonlanc:8080/sites/kb* - Excluded
means to index all contents, including kb site. So order matters.
I'm still working on it so I may publish more details. If you have any doubt or comment please add them to the article and hopefully I will be able to answer it.
