Crawl behavior

By the end of this page you'll know exactly how many pages MimicBot will fetch, whether it honors robots.txt, and how to trigger a manual re-crawl when your content changes.

Before you start

A bot you've already created (see Create your first bot).
An API token with write access to that bot.

The crawl fields

These live under crawlConfig on the bot record and are shallow-merged on every PATCH.

Field	Type	Default	Description
`schedule`	`'manual' \| 'daily' \| 'weekly'`	`manual`	How often automatic re-crawls run. Only `manual` is currently wired.
`maxPages`	`number` (1–5000, integer)	`500`	Ceiling on pages fetched per crawl, across all sources on this bot.
`respectRobotsTxt`	`boolean`	`true`	Whether the crawler obeys `Disallow` rules in your site's `robots.txt`.

Picking `maxPages`

Start at the default of 500 for a typical marketing site or small documentation corpus.
Push toward 1000–2000 for large docs sites or blogs. Expect the first crawl to take longer.
The hard ceiling is 5000 per bot. If you need more, split content across multiple bots by audience.

The counter applies across every source on the bot — adding a second source doesn't double the budget. Sources covered on the next page.

`respectRobotsTxt` — when to turn it off

Leave this at true unless you own the site and have a specific reason. Disabling it is useful when:

Your staging site blocks all crawlers via robots.txt but you want MimicBot to index it.
You control the site and serve it from an environment where robots.txt is misconfigured.

Never disable this for sites you don't own.

A note on `schedule`

The schedule enum accepts manual, daily, and weekly, but only manual is wired end-to-end today. Setting daily or weekly is accepted and stored, but no automatic re-crawl will fire yet. Trigger re-crawls via the API or dashboard until scheduled crawls ship in a future release.

Updating crawl behavior

curl -X PATCH https://api.mimicbot.app/api/bots/{botId} \
  -H "Authorization: Bearer $MIMICBOT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "crawlConfig": {
      "maxPages": 1000,
      "respectRobotsTxt": true
    }
  }'

The PATCH endpoint shallow-merges crawlConfig — fields you omit keep their previous values.

Triggering a manual re-crawl

curl -X POST https://api.mimicbot.app/api/bots/{botId}/crawl \
  -H "Authorization: Bearer $MIMICBOT_TOKEN"

This enqueues a Temporal workflow that re-crawls every source on the bot. The bot's status flips to indexing for the duration and back to ready when the new pages are embedded. Track progress under GET /api/bots/{botId}/crawl-jobs.

→ Sources

Before you start​

The crawl fields​

Picking maxPages​

respectRobotsTxt — when to turn it off​

A note on schedule​

Updating crawl behavior​

Triggering a manual re-crawl​

Next​