Recrawl (Force Fresh GitHub Fetch)

Recrawl tells the crawler to throw away its “this file is already indexed” tracking and fetch repos and docs from GitHub again. Use it when content is gone from Meilisearch and PostgreSQL doesn’t have it either.

The crawler ships doc content (markdown files in your repos) directly to Meilisearch. PostgreSQL only stores crawl-tracking metadata: which file was last seen, what its SHA was, whether it’s already indexed. So if Meilisearch loses data, Backfill can rebuild entities, repositories, K8s workloads, and molds from PostgreSQL — but docs are gone. Recrawl is the recovery path for docs.

When to Use

Meilisearch was wiped or corrupted, and you need docs back
Doc content in search is stale or missing entire repos
After a Meilisearch upgrade where you didn’t import a dump

If only metadata is missing (entities, repositories, etc.), use Backfill — it’s faster and doesn’t hit the GitHub API.

What It Does

The endpoint runs three steps:

Sets document_crawl_status.indexed = false for the caller’s tenant. The crawler short-circuits on indexed = true rows, so this unblocks doc re-indexing.
Sets repository_crawl_status.next_crawl_at = NOW() and status = 'pending' so the scheduler picks the repos up immediately.
Calls the crawler over gRPC to start a doc crawl right away. Without this kick the next scheduler tick can be up to an hour out.

Both DB updates run in a single transaction under the caller’s tenant context (RLS-scoped). The crawler runs the actual fetch async.

Endpoint

Trigger Recrawl (Whole Tenant)

curl -X POST https://your-domain/api/v1/admin/recrawl \
  -H "Authorization: Bearer <admin-token>" \
  -H "Content-Type: application/json"

Response on success:

{
  "message": "Recrawl started",
  "scope": "tenant",
  "reset_documents": 7629,
  "reset_repositories": 57,
  "operation_id": "recrawl-f6b08153-..."
}

reset_documents counts rows where indexed was flipped from true to false. On a second call right after the first, this number is 0 because nothing else needs flipping.

Trigger Recrawl (Specific Repos)

curl -X POST https://your-domain/api/v1/admin/recrawl \
  -H "Authorization: Bearer <admin-token>" \
  -H "Content-Type: application/json" \
  -d '{"repositories": ["adaptive-labs/demo-1", "adaptive-labs/docs-site"]}'

Response:

{
  "message": "Recrawl started",
  "scope": "filtered",
  "repositories": ["adaptive-labs/demo-1", "adaptive-labs/docs-site"],
  "reset_documents": 412,
  "reset_repositories": 2,
  "operation_id": "recrawl-..."
}

Repository names use owner/repo form. Validation enforces a permissive regex (letters, digits, dots, hyphens, underscores) and rejects anything else with 400 INVALID_REQUEST.

Limits

Limit	Value
Max request body size	64 KiB
Max repositories per call	500
Max repository name length	255 chars

If you need to recrawl more than 500 repos, omit the repositories field — that resets every repo in the caller’s tenant in one call.

Response Codes

Code	Meaning
`202`	Reset committed. Crawler triggered (or queued for the next tick if the trigger failed).
`400`	Invalid body, invalid repo name, or repo cap exceeded.
`401`	No tenant context. The token is missing or malformed.
`403`	Token isn’t authorized for `admin:write`.
`500`	Reset failed. Check API logs.

When the Crawler Trigger Fails

If the immediate gRPC trigger to the crawler fails (network blip, crawler down), the response still returns 202 but with:

{
  "message": "Recrawl state reset; immediate trigger failed (will run on next scheduler tick)",
  "trigger_error": "could not reach crawler service"
}

The DB state is reset regardless, so the crawler picks the work up on its next periodic tick (within an hour). You don’t need to retry — but you can if you want immediate execution.

Tenant Scope

The DB reset only affects the caller’s tenant. RLS enforces this at the database layer, so even an attempt to specify another tenant’s repository would resolve to zero rows.

The crawler-side trigger currently runs against the crawler’s process tenant context. In a single-tenant-per-crawler deployment this is correct. Multi-tenant crawler deployments are not yet supported by this endpoint — track this on the platform repo if you need it.

Cost Awareness

Recrawl issues fresh GitHub API calls. A whole-tenant recrawl with 1,000 repos can consume a few thousand requests against your GitHub rate limit. Watch the crawler’s Rate limits refreshed log lines if you’re close to the cap.

There’s no per-tenant cooldown today — be deliberate about when you call this.

Combining with Backfill

After a Meilisearch wipe, the recovery sequence is:

# 1. Rebuild entity/repo/k8s/mold indexes from PostgreSQL (fast, no GitHub calls)
curl -X POST https://your-domain/api/v1/admin/backfill \
  -H "Authorization: Bearer <admin-token>"

# 2. Force docs to re-fetch from GitHub (slower, hits rate limit)
curl -X POST https://your-domain/api/v1/admin/recrawl \
  -H "Authorization: Bearer <admin-token>"

Run them in that order. Backfill is fast and doesn’t depend on the crawler. Recrawl can take minutes to hours depending on how many docs you have.