Module: pipeline/crawlDiff

AUTO-002 diff-aware crawling primitive. Compares the current crawl's page snapshots against the persisted baseline map and classifies each URL into added / changed / unchanged / removed buckets.

Fingerprinting reuses stateFingerprint.js (no new hashing scheme) so a page's fingerprint is stable across the state-explorer and link-crawl discovery paths.

Source:

Methods

(static) buildPageFingerprint(snapshot) → {string}

Parameters:
Name Type Description
snapshot object

page snapshot { url, title, elements[], ... }.

Source:
Returns:

content-addressed fingerprint for the page.

Type
string

(static) diffCrawlSnapshots(previousByUrl, currentSnapshots, optsopt) → {Object}

Classify each URL in the current crawl against the previous baseline.

Parameters:
Name Type Attributes Description
previousByUrl Record.<string, {fingerprint: string}> | null | undefined

URL → baseline row ({ fingerprint, capturedAt, ... }) from crawlBaselineRepo.getMapByProjectId(). null / undefined / {} are all treated equivalently as "no previous baseline" — every current URL is classified as added.

currentSnapshots Array.<{url: string}> | null | undefined

Raw snapshots from the crawl. null / undefined → no URLs.

opts object <optional>
Properties
Name Type Attributes Description
fingerprintOf function <optional>

AUTO-002b: optional override for fingerprint computation. State-mode callers pass a function that returns a pre-computed fingerprint keyed off the original snapshot identity, because the default buildPageFingerprint recomputes from snap.url — which would embed the composite url#fp=<fp> key in the new fingerprint and make every state-mode re-crawl look "changed". Link-crawl callers omit this and get the default URL-derived fingerprint.

Source:
Returns:
Type
Object