Skip to main content

Crawl behavior

By the end of this page you'll know exactly how many pages MimicBot will fetch, whether it honors robots.txt, and how to trigger a manual re-crawl when your content changes.

Before you start

The crawl fields

These live under crawlConfig on the bot record and are shallow-merged on every PATCH.

FieldTypeDefaultDescription
schedule'manual' | 'daily' | 'weekly'manualHow often automatic re-crawls run. Only manual is currently wired.
maxPagesnumber (1–5000, integer)500Ceiling on pages fetched per crawl, across all sources on this bot.
respectRobotsTxtbooleantrueWhether the crawler obeys Disallow rules in your site's robots.txt.

Picking maxPages

  • Start at the default of 500 for a typical marketing site or small documentation corpus.
  • Push toward 1000–2000 for large docs sites or blogs. Expect the first crawl to take longer.
  • The hard ceiling is 5000 per bot. If you need more, split content across multiple bots by audience.

The counter applies across every source on the bot — adding a second source doesn't double the budget. Sources covered on the next page.

respectRobotsTxt — when to turn it off

Leave this at true unless you own the site and have a specific reason. Disabling it is useful when:

  • Your staging site blocks all crawlers via robots.txt but you want MimicBot to index it.
  • You control the site and serve it from an environment where robots.txt is misconfigured.

Never disable this for sites you don't own.

A note on schedule

The schedule enum accepts manual, daily, and weekly, but only manual is wired end-to-end today. Setting daily or weekly is accepted and stored, but no automatic re-crawl will fire yet. Trigger re-crawls via the API or dashboard until scheduled crawls ship in a future release.

Updating crawl behavior

curl -X PATCH https://api.mimicbot.app/api/bots/{botId} \
-H "Authorization: Bearer $MIMICBOT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"crawlConfig": {
"maxPages": 1000,
"respectRobotsTxt": true
}
}'

The PATCH endpoint shallow-merges crawlConfig — fields you omit keep their previous values.

Triggering a manual re-crawl

curl -X POST https://api.mimicbot.app/api/bots/{botId}/crawl \
-H "Authorization: Bearer $MIMICBOT_TOKEN"

This enqueues a Temporal workflow that re-crawls every source on the bot. The bot's status flips to indexing for the duration and back to ready when the new pages are embedded. Track progress under GET /api/bots/{botId}/crawl-jobs.

Next

→ Sources