Crawl behavior
By the end of this page you'll know exactly how many pages MimicBot will fetch, whether it honors robots.txt, and how to trigger a manual re-crawl when your content changes.
Before you start
- A bot you've already created (see Create your first bot).
- An API token with write access to that bot.
The crawl fields
These live under crawlConfig on the bot record and are shallow-merged on every PATCH.
| Field | Type | Default | Description |
|---|---|---|---|
schedule | 'manual' | 'daily' | 'weekly' | manual | How often automatic re-crawls run. Only manual is currently wired. |
maxPages | number (1–5000, integer) | 500 | Ceiling on pages fetched per crawl, across all sources on this bot. |
respectRobotsTxt | boolean | true | Whether the crawler obeys Disallow rules in your site's robots.txt. |
Picking maxPages
- Start at the default of 500 for a typical marketing site or small documentation corpus.
- Push toward 1000–2000 for large docs sites or blogs. Expect the first crawl to take longer.
- The hard ceiling is 5000 per bot. If you need more, split content across multiple bots by audience.
The counter applies across every source on the bot — adding a second source doesn't double the budget. Sources covered on the next page.
respectRobotsTxt — when to turn it off
Leave this at true unless you own the site and have a specific reason. Disabling it is useful when:
- Your staging site blocks all crawlers via
robots.txtbut you want MimicBot to index it. - You control the site and serve it from an environment where
robots.txtis misconfigured.
Never disable this for sites you don't own.
A note on schedule
The schedule enum accepts manual, daily, and weekly, but only manual is wired end-to-end today. Setting daily or weekly is accepted and stored, but no automatic re-crawl will fire yet. Trigger re-crawls via the API or dashboard until scheduled crawls ship in a future release.
Updating crawl behavior
curl -X PATCH https://api.mimicbot.app/api/bots/{botId} \
-H "Authorization: Bearer $MIMICBOT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"crawlConfig": {
"maxPages": 1000,
"respectRobotsTxt": true
}
}'
The PATCH endpoint shallow-merges crawlConfig — fields you omit keep their previous values.
Triggering a manual re-crawl
curl -X POST https://api.mimicbot.app/api/bots/{botId}/crawl \
-H "Authorization: Bearer $MIMICBOT_TOKEN"
This enqueues a Temporal workflow that re-crawls every source on the bot. The bot's status flips to indexing for the duration and back to ready when the new pages are embedded. Track progress under GET /api/bots/{botId}/crawl-jobs.