Update crawler configuration
Updates the configuration of the specified crawler.
editSettingsAuthorizations
Basic authentication header of the form Basic <encoded-value>, where <encoded-value> is the base64-encoded string username:password.
Path Parameters
Crawler ID. Universally unique identifier (UUID) of the crawler.
"e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
Body
Crawler configuration to update.
You can only update top-level configuration properties.
To update a nested configuration, such as actions.recordExtractor,
you must provide the complete top-level object such as actions.
A list of actions.
1 - 30 elementsAlgolia application ID where the crawler creates and updates indices.
Determines the number of concurrent tasks per second that can run for this configuration.
A higher rate limit means more crawls per second. Algolia prevents system overload by ensuring the number of URLs added in the last second and the number of URLs being processed is less than the rate limit:
max(new_urls_added, active_urls_processing) <= rateLimitStart with a low value (for example, 2) and increase it if you need faster crawling.
A high rateLimit can significantly increase bandwidth cost and server resource consumption.
The number of pages processed per second depends on the average time it takes to fetch, process, and upload a URL.
For a given rateLimit, if fetching, processing, and uploading URLs takes (on average):
- Less than a second, your crawler processes up to
rateLimitpages per second. - Four seconds, your crawler processes up to
rateLimit / 4pages per second.
In the latter case, increasing rateLimit improves performance up to a point.
If the processing time remains at four seconds, increasing rateLimit won't increase the number of pages processed per second.
1 <= x <= 1004
The Algolia API key the crawler uses for indexing records. If you don't provide an API key, one will be generated by the Crawler when you create a configuration.
The API key must have:
- These rights and restrictions:
search,addObject,deleteObject,deleteIndex,settings,editSettings,listIndexes,browse - Access to the correct set of indices, based on the crawler's
indexPrefix. For example, if the prefix iscrawler_, the API key must have access tocrawler_*.
Don't use your Admin API key.
URLs to exclude from crawling.
100Use micromatch for negation, wildcards, and more.
[
"https://www.example.com/excluded",
"!https://www.example.com/this-one-url",
"https://www.example.com/exclude/**"
]References to external data sources for enriching the extracted records.
10For more information, see Enrich extracted records with external data.
The Crawler treats extraUrls the same as startUrls.
Specify extraUrls if you want to differentiate between URLs you manually added to fix site crawling from those you initially specified in startUrls.
9999Determines if the crawler should extract records from a page with a canonical URL.
If ignoreCanonicalTo is set to:
trueall canonical URLs are ignored.- One or more URL patterns, the crawler will ignore the canonical URL if it matches a pattern.
Determines if the crawler should follow links with a nofollow directive.
If true, the crawler will ignore the nofollow directive and crawl links on the page.
The crawler always ignores links that don't match your configuration settings.
ignoreNoFollowTo applies to:
- Links that are ignored because the
robotsmeta tag containsnofollowornone. - Links with a
relattribute containing thenofollowdirective.
Whether to ignore the noindex robots meta tag.
If true, pages with this meta tag will be crawled.
Whether the crawler should follow rel="prev" and rel="next" pagination links in the <head> section of an HTML page.
- If
true, the crawler ignores the pagination links. - If
false, the crawler follows the pagination links.
Query parameters to ignore while crawling.
All URLs with the matching query parameters are treated as identical. This prevents indexing URLs that just differ by their query parameters.
9999Use wildcards to match multiple query parameters.
["ref", "utm_*"]Whether to ignore rules defined in your robots.txt file.
A prefix for all indices created by this crawler. It's combined with the indexName for each action to form the complete index name.
64"crawler_"
Crawler index settings.
These index settings are only applied during the first crawl of an index.
Any subsequent changes won't be applied to the index. Instead, make changes to your index settings in the Algolia dashboard.
Function for extracting URLs from links on crawled pages.
For more information, see the linkExtractor documentation.
Authorization method and credentials for crawling protected content.
The Crawler supports these authentication methods:
- Basic authentication. The Crawler obtains a session cookie from the login page.
- OAuth 2.0 authentication (
oauthRequest). The Crawler uses OAuth 2.0 client credentials to obtain an access token for authentication.
Basic authentication
The Crawler extracts the Set-Cookie response header from the login page, stores that cookie,
and sends it in the Cookie header when crawling all pages defined in the configuration.
This cookie is retrieved only at the start of each full crawl. If it expires, it isn't automatically renewed.
The Crawler can obtain the session cookie in one of two ways:
- HTTP request authentication (
fetchRequest). The Crawler sends a direct request with your credentials to the login endpoint, similar to acurlcommand. - Browser-based authentication (
browserRequest). The Crawler emulates a web browser by loading the login page, entering the credentials, and submitting the login form as a real user would.
OAuth 2.0
The crawler supports OAuth 2.0 client credentials grant flow:
- It performs an access token request with the provided credentials
- Stores the fetched token in an
Authorizationheader - Sends the token when crawling site pages.
This token is only fetched at the beginning of each complete crawl. If it expires, it isn't automatically renewed.
Client authentication passes the credentials (client_id and client_secret) in the request body.
The Azure AD v1.0 provider is supported.
- HTTP request
- Browser-based
- OAuth 2.0
{
"url": "https://example.com/secure/login-with-post",
"requestOptions": {
"method": "POST",
"headers": {
"Content-Type": "application/x-www-form-urlencoded"
},
"body": "id=my-id&password=my-password",
"timeout": 5000
}
}Determines the maximum path depth of crawled URLs.
Path depth is calculated based on the number of slash characters (/) after the domain (starting at 1).
For example:
- 1
http://example.com - 1
http://example.com/ - 1
http://example.com/foo - 2
http://example.com/foo/ - 2
http://example.com/foo/bar - 3
http://example.com/foo/bar/
URLs added with startUrls and sitemaps aren't checked for maxDepth..
1 <= x <= 1005
Limits the number of URLs your crawler processes.
Change it to a low value, such as 100, for short crawling tests.
Change it to a higher explicit value for full crawls to prevent it from getting "lost" in complex site structures.
Because the Crawler works on many pages simultaneously, maxUrls doesn't guarantee finding the same pages each time it runs.
1 <= x <= 15000000250
If true, use a Chrome headless browser to crawl pages.
Because crawling JavaScript-based web pages is slower than crawling regular HTML pages, you can apply this setting to a specific list of pages. Use micromatch to define URL patterns, including negations and wildcards.
Lets you add options to HTTP requests made by the crawler.
Checks to ensure the crawl was successful.
For more information, see the Safety checks documentation.
Whether to back up your index before the crawler overwrites it with new records.
Schedule for running the crawl.
Instead of manually starting a crawl each time, you can set up a schedule for automatic crawls.
Use the visual UI or add the schedule parameter to your configuration.
schedule uses Later.js syntax to specify when to crawl your site.
Here are some key things to keep in mind when using Later.js syntax with the Crawler:
- The interval between two scheduled crawls must be at least 24 hours.
- To crawl daily, use "every 1 day" instead of "everyday" or "every day".
- If you don't specify a time, the crawl can happen any time during the scheduled day.
- Specify times for the UTC (GMT+0) timezone
- Include minutes when specifying a time. For example, "at 3:00 pm" instead of "at 3pm".
- Use "at 12:00 am" to specify midnight, not "at 00:00 am".
"every weekday at 12:00 pm"
Sitemaps with URLs from where to start crawling.
9999URLs from where to start crawling.
9999Response
OK
Universally unique identifier (UUID) of the task.
"98458796-b7bb-4703-8b1b-785c1080b110"