-
Documentation navigation
Crawler Settings
The crawler may be adjusted as you prefer for each project/website. All changes are being saved only for the current project.
General Settings
User-Agent тАУ Choose the User-Agent the crawler will use during the crawling process. Also, you may select your own User-Agent.
Crawling Depth - Enter the depth level for the crawler, if you want to limit the crawling depth.
Examples of Values: 0 - no limitations, 1 - only root-web page, 2 - all webpages that refer from root web pages, etc.
Priorities by default тАУ Here you can adjust priorities by default which crawler will apply to found pages.
Principal of using: 0 - homepage of the website, 1 - All pages which refer from homepage, 2 тАУ All pages that refer from pages, which in their turn refer from homepage and so on, it is possible to add 3, 4, etc.
File Extension
List of file extensions spider has to crawl. For instance if your websiteтАЩs pages have specific extensions, like .file, then in the list of extensions you need to select file for the spider to crawl the site. You need to add the exact extension without dots and asterix. It is possible to add your own extensions or remove unnecessary ones.
Exceptions
Spider will skip all those websites, in which your mentioned words or symbols will be found. You can see the examples no screenshot.
Spider exceptions can also be adjusted on basis of robots.txt site. For that you need to press the Import from robots.txt button and select the address from robots.txt file.
Inclusions
Spider will index only those websites, which addresses contain texts from that list. See the using example in the screenshot.
Remove Parameters
If certain parameters will be found in URL, they will be removed from it, before the URL will be placed in search. This function can be used for discarding Session-ID or similar one-time parameters.
Example:
If spider indicated such link: http://community.invisionpower.com/forum/297-ips-company-feedback/?session=02e0a436b7555ee760af1a1a70c266cb and in the list you selected session, then the program will delete the following from the link?session=02e0a436b7555ee760af1a1a70c266cb and will transfer to Sitemap file the clear link: http://community.invisionpower.com/forum/297-ips-company-feedback/.
Content Types
Enter the content type of files that spider has to index. Example: text/html, text/plain.
Ready-to-use Settings
We have prepared complete spider settings for popular CMS and forum engines. These settings will keep you away from indexing spam that those engines usually contain. If you employ one of these settings, the program will automatically add all necessary spider settings into Remove Parameters and Exceptions sections. If you want other popular engines included in the list, contact us and we will consider your offer.
Processing Attributes
Choose the attributes which spider needs to process and where it should look for references to other pages of a site.