Skip to content

Recrawl (Force Fresh GitHub Fetch)

Recrawl tells the crawler to throw away its “this file is already indexed” tracking and fetch repos and docs from GitHub again. Use it when content is gone from Meilisearch and PostgreSQL doesn’t have it either.

The crawler ships doc content (markdown files in your repos) directly to Meilisearch. PostgreSQL only stores crawl-tracking metadata: which file was last seen, what its SHA was, whether it’s already indexed. So if Meilisearch loses data, Backfill can rebuild entities, repositories, K8s workloads, and molds from PostgreSQL — but docs are gone. Recrawl is the recovery path for docs.

  • Meilisearch was wiped or corrupted, and you need docs back
  • Doc content in search is stale or missing entire repos
  • After a Meilisearch upgrade where you didn’t import a dump

If only metadata is missing (entities, repositories, etc.), use Backfill — it’s faster and doesn’t hit the GitHub API.

The endpoint runs three steps:

  1. Sets document_crawl_status.indexed = false for the caller’s tenant. The crawler short-circuits on indexed = true rows, so this unblocks doc re-indexing.
  2. Sets repository_crawl_status.next_crawl_at = NOW() and status = 'pending' so the scheduler picks the repos up immediately.
  3. Calls the crawler over gRPC to start a doc crawl right away. Without this kick the next scheduler tick can be up to an hour out.

Both DB updates run in a single transaction under the caller’s tenant context (RLS-scoped). The crawler runs the actual fetch async.

Terminal window
curl -X POST https://your-domain/api/v1/admin/recrawl \
-H "Authorization: Bearer <admin-token>" \
-H "Content-Type: application/json"

Response on success:

{
"message": "Recrawl started",
"scope": "tenant",
"reset_documents": 7629,
"reset_repositories": 57,
"operation_id": "recrawl-f6b08153-..."
}

reset_documents counts rows where indexed was flipped from true to false. On a second call right after the first, this number is 0 because nothing else needs flipping.

Terminal window
curl -X POST https://your-domain/api/v1/admin/recrawl \
-H "Authorization: Bearer <admin-token>" \
-H "Content-Type: application/json" \
-d '{"repositories": ["adaptive-labs/demo-1", "adaptive-labs/docs-site"]}'

Response:

{
"message": "Recrawl started",
"scope": "filtered",
"repositories": ["adaptive-labs/demo-1", "adaptive-labs/docs-site"],
"reset_documents": 412,
"reset_repositories": 2,
"operation_id": "recrawl-..."
}

Repository names use owner/repo form. Validation enforces a permissive regex (letters, digits, dots, hyphens, underscores) and rejects anything else with 400 INVALID_REQUEST.

LimitValue
Max request body size64 KiB
Max repositories per call500
Max repository name length255 chars

If you need to recrawl more than 500 repos, omit the repositories field — that resets every repo in the caller’s tenant in one call.

CodeMeaning
202Reset committed. Crawler triggered (or queued for the next tick if the trigger failed).
400Invalid body, invalid repo name, or repo cap exceeded.
401No tenant context. The token is missing or malformed.
403Token isn’t authorized for admin:write.
500Reset failed. Check API logs.

If the immediate gRPC trigger to the crawler fails (network blip, crawler down), the response still returns 202 but with:

{
"message": "Recrawl state reset; immediate trigger failed (will run on next scheduler tick)",
"trigger_error": "could not reach crawler service"
}

The DB state is reset regardless, so the crawler picks the work up on its next periodic tick (within an hour). You don’t need to retry — but you can if you want immediate execution.

The DB reset only affects the caller’s tenant. RLS enforces this at the database layer, so even an attempt to specify another tenant’s repository would resolve to zero rows.

The crawler-side trigger currently runs against the crawler’s process tenant context. In a single-tenant-per-crawler deployment this is correct. Multi-tenant crawler deployments are not yet supported by this endpoint — track this on the platform repo if you need it.

Recrawl issues fresh GitHub API calls. A whole-tenant recrawl with 1,000 repos can consume a few thousand requests against your GitHub rate limit. Watch the crawler’s Rate limits refreshed log lines if you’re close to the cap.

There’s no per-tenant cooldown today — be deliberate about when you call this.

After a Meilisearch wipe, the recovery sequence is:

Terminal window
# 1. Rebuild entity/repo/k8s/mold indexes from PostgreSQL (fast, no GitHub calls)
curl -X POST https://your-domain/api/v1/admin/backfill \
-H "Authorization: Bearer <admin-token>"
# 2. Force docs to re-fetch from GitHub (slower, hits rate limit)
curl -X POST https://your-domain/api/v1/admin/recrawl \
-H "Authorization: Bearer <admin-token>"

Run them in that order. Backfill is fast and doesn’t depend on the crawler. Recrawl can take minutes to hours depending on how many docs you have.