Module: database/repositories/crawlBaselineRepo

AUTO-002 persistence layer for per-project page fingerprints. Two write strategies are intentionally exposed:

  • replaceProjectBaselines — full DELETE + re-INSERT. Use only when the caller is certain the new fingerprint set is complete (e.g. after a fresh first-ever crawl), because any URL absent from fingerprints is treated as removed from the site.
  • mergeProjectBaselines — upsert + targeted-delete. Preferred for every diff-aware crawl: a partial crawl (page N fails with a transient 503) won't silently drop page N's baseline and force an unnecessary regen on the next run.
Source:

Methods

(static) mergeProjectBaselines(projectId, fingerprints, removedPageUrlsopt)

Upsert the current crawl's fingerprints into the baseline table without wiping pages that weren't observed this time. removedPageUrls (URLs the diff reported as removedPages) are explicitly deleted — this is the only path that drops a baseline row, and it requires the caller to prove the URL is genuinely gone (absent from the current crawl AND present in the previous baseline). Transient failures that produce a subset crawl don't hit this branch because their URLs never reach the removedPages list.

Parameters:
Name Type Attributes Description
projectId string
fingerprints Record.<string, string>

URL → new fingerprint for pages observed in the current crawl.

removedPageUrls Array.<string> <optional>

URLs classified as removedPages by diffCrawlSnapshots. Optional; defaults to none.

Source: