Creates a project by crawling your website. This project provides you with an inventory of pages and a website content audit.
Wizard
The Wizard will ask you a series of questions you will need to answer to crawl your website.
Answer each question, then click NEXT until you have completed all questions.
Follow this configuration wizard to be guided through each option. If you are familiar with our manual settings, you can close the wizard if you desire more control over the options. If you do not complete all of the questions, the settings will not save for you to refine them later.
To exit the Wizard to see and use the manual settings, click the CLOSE WIZARD button.
Manual Settings
Start by entering the URL. Customize the crawler behavior and refine the results with basic and advanced options.
-
-
Enter your URL
-
Basic Options - use default or customize your settings.
-
Follow robots.txt
- A robots.txt file can provide bots and crawlers with instructions on what pages to ignore while crawling your project. A robots.txt file’s primary purpose is to block search engine indexing. This can be used while developing a new website or just to keep some pages private. Turn this setting ON to ignore this file's instruction.
- A robots.txt file can provide bots and crawlers with instructions on what pages to ignore while crawling your project. A robots.txt file’s primary purpose is to block search engine indexing. This can be used while developing a new website or just to keep some pages private. Turn this setting ON to ignore this file's instruction.
-
Follow subdomains
- A sub-domain is a domain that is part of another primary domain. For example, north.example.com and south.example.com are subdomains of the example.com domain. When crawling a primary domain, any linked pages that belong to a subdomain will be included in your project by default. Turn this feature ON if you would like the crawler to also visit subdomains for full inclusion into your project. If you would like to exclude all subdomains, block each using the Filter Links / Skip Links.
- A sub-domain is a domain that is part of another primary domain. For example, north.example.com and south.example.com are subdomains of the example.com domain. When crawling a primary domain, any linked pages that belong to a subdomain will be included in your project by default. Turn this feature ON if you would like the crawler to also visit subdomains for full inclusion into your project. If you would like to exclude all subdomains, block each using the Filter Links / Skip Links.
-
Ignore trailing slash
- If your website code is handwritten, your code may not be inconsistent and contain certain problems even though the page will still appear to work properly in a web browser. This feature has a default setting of ON. Leave this ON to let the crawler be more forgiving and overlook problems similar to how web browsers operate.
- If your website code is handwritten, your code may not be inconsistent and contain certain problems even though the page will still appear to work properly in a web browser. This feature has a default setting of ON. Leave this ON to let the crawler be more forgiving and overlook problems similar to how web browsers operate.
-
Ignore query string
- A query string is info within a URL that follows a '?' - for example www.example.com/index.html?thisis=thequerystring. The option is important if you have dynamic sections within your website like calendars. Dynamic sections within a website that use query strings can consist of many pages that you might want to omit from your sitemap. Turn this feature ON to ignore query string links.
- A query string is info within a URL that follows a '?' - for example www.example.com/index.html?thisis=thequerystring. The option is important if you have dynamic sections within your website like calendars. Dynamic sections within a website that use query strings can consist of many pages that you might want to omit from your sitemap. Turn this feature ON to ignore query string links.
-
Include redirects
- A redirect can send both users and search engines to a different URL than the one that is clicked. Turn this feature ON if you would like both the redirected link and the destination link included in your project.
- A redirect can send both users and search engines to a different URL than the one that is clicked. Turn this feature ON if you would like both the redirected link and the destination link included in your project.
-
Include bad links
-
Using Include Bad Links will include all pages with a 4xx or 5xx status code. The 4xx class of status codes indicates that a request for a server resource either contains bad syntax or cannot be opened for one reason or another.
Here are some of the most common HTTP 4xx codes. 400 - Bad Request 401 - Unauthorized 402 - Payment Required 403 - Forbidden 404 - File Not Found
The 5xx class of codes are responses to requests that servers cannot fulfill. Here are some of the most common HTTP 5xx codes. 500 - Internal Server Error 501 - Not Implemented 502 - Bad Gateway 503 - Service Unavailable 504 - Gateway Time-out
-
Using Include Bad Links will include all pages with a 4xx or 5xx status code. The 4xx class of status codes indicates that a request for a server resource either contains bad syntax or cannot be opened for one reason or another.
-
Render javascript
-
Use this feature to render a page's JavaScript, css, images, and other assets during the crawl. If your domain uses a lot of JavaScript, this feature can be turned ON to aide the crawling process.
-
Use this feature to render a page's JavaScript, css, images, and other assets during the crawl. If your domain uses a lot of JavaScript, this feature can be turned ON to aide the crawling process.
-
Include Content
- Download the entire body of a page. You can override and refine the content by setting Import Content by XPath rules in Advanced Options
- Download the entire body of a page. You can override and refine the content by setting Import Content by XPath rules in Advanced Options
-
Unique page title
- If your website’s pages all have a unique title, then turning this feature ON will give you a very accurate crawl. Some content management systems will dynamically create multiple urls with the same title and these published pages can cause duplicates within your project. Only use this feature is you are certain that all pages have a unique page title.
- If your website’s pages all have a unique title, then turning this feature ON will give you a very accurate crawl. Some content management systems will dynamically create multiple urls with the same title and these published pages can cause duplicates within your project. Only use this feature is you are certain that all pages have a unique page title.
-
Ignore scheme
- Depending on a server’s configuration, a website can have duplicate pages if multiple schemes are allowed of individual pages. For example, if you do not ignore scheme of www.example.com - the pages http://www.example.com & https://www.example.com would be considered two different pages. This feature is set to ON by default.
- Depending on a server’s configuration, a website can have duplicate pages if multiple schemes are allowed of individual pages. For example, if you do not ignore scheme of www.example.com - the pages http://www.example.com & https://www.example.com would be considered two different pages. This feature is set to ON by default.
-
Arrange links by URL
- Keep this feature ON to arrange your sitemap based on the parent/child relationship that exists between urls of a domain or subdomain.
- Keep this feature ON to arrange your sitemap based on the parent/child relationship that exists between urls of a domain or subdomain.
-
Domain as root
- If you crawl a subdomain or subdirectory, setting this feature to ON will make the primary domain the root of the sitemap tree.
- If you crawl a subdomain or subdirectory, setting this feature to ON will make the primary domain the root of the sitemap tree.
-
Include PDF
- The PDF or Portable Document Format is a file format developed by Adobe to present documents, including text formatting and images. PDFs are independent of application software, hardware, and operating systems. Turn this feature ON to Include all PDF links in your sitemap. Use this feature if you plan to test PDFs for accessibility.
- The PDF or Portable Document Format is a file format developed by Adobe to present documents, including text formatting and images. PDFs are independent of application software, hardware, and operating systems. Turn this feature ON to Include all PDF links in your sitemap. Use this feature if you plan to test PDFs for accessibility.
-
Start URL as root
- If you are crawling a subdirectory or subdomain, you can set the root of your hierarchy tree to the URL that you designate as the starting URL. The "Arrange Links by URL" option must also be set to ON to use this feature.
- If you are crawling a subdirectory or subdomain, you can set the root of your hierarchy tree to the URL that you designate as the starting URL. The "Arrange Links by URL" option must also be set to ON to use this feature.
-
Follow robots.txt
-
Authentication Options
-
-
-
Website Authentication
Use website authentication if you want to include private front-end pages within your sitemap and also if you want to test private front-end pages for accessibility.
- Create a Custom System Login
- Click the Advanced Custom System Login ICON
-
Click CREATE or IMPORT
- Contact Support via ticket with the login URL, and a temporary username/password and we will send you an import code if your site is compatible. Read these best practices before sending your login information.
-
Website Authentication
-
-
-
-
Basic Authentication
The Basic authentication scheme is a widely used, industry-standard method for collecting user name and password information. Basic authentication transmits user names and passwords across the network in an unencrypted form.
-
Basic Authentication
-
-
-
Advanced Options
-
-
-
Filter Links After Crawl (Most Accurate)
-
Only Subdirectories - Crawl only links that match the subdirectory part of the URL.
-
Skip Links - Skip links that match the following rules. This matches the full URL of the sitemap links.
-
-
Filter Links After Crawl (Most Accurate)
-
-
-
Include Links - Exclude everything except links that match the following rules. This matches the full URL of the sitemap links.
- Matching
- Select Sitemap Folder or Item
-
Exclude Links - Exclude links that match the following rules. This matches the full URL of the sitemap links.
- Matching
- Select Sitemap Folder or Item
-
Import Content by XPath - Configure XPath rules for scraping the content of a page body.
- Include XPath
- Exclude XPath
-
Include/Exclude Rules
-
-
Include Links - Exclude everything except links that match the following rules. This matches the full URL of the sitemap links.
-
-
-
-
-
-
Other Options
-
-
-
-
-
-
-
-
-
-
- Maximum Pages - Select the maximum pages crawled
- Maximum Depth
-
Speed
- Default
- Slower
- Slowest
- User Agent
-
-
-
-
-
-
-
-
Comments
0 comments
Article is closed for comments.