Module: utils/robotsSitemap

Fetches and parses robots.txt rules and sitemap.xml URLs for crawl compliance. Zero external dependencies — uses the global fetch() API available in Node 18+.

Design decisions

  • Only the Sentri and * user-agent groups are evaluated.
  • Allow directives take precedence over Disallow when both match at the same specificity (longest prefix wins), matching Google's interpretation.
  • Sitemap parsing handles both <sitemapindex> (recursive) and <urlset> formats. Gzip sitemaps are NOT supported (would require zlib); they are silently skipped.
  • All network errors are swallowed — a missing or unreachable robots.txt means "allow everything", per the standard.

Exports

  • loadRobotsRules — fetch + parse robots.txt → rules object
  • isAllowed — check a URL against parsed rules
  • loadSitemapUrls — fetch + parse sitemap.xml → URL list
Source:
See:

Methods

(static) isAllowed(url, robotsRules) → {boolean}

Check whether a URL is allowed by the parsed robots.txt rules.

Uses longest-prefix matching: the rule whose pattern is the longest prefix of the URL path wins. If no rule matches, the URL is allowed (default).

Parameters:
Name Type Description
url string

— full URL to check

robotsRules RobotsRules

— from loadRobotsRules

Source:
Returns:
Type
boolean

(static) loadRobotsRules(baseUrl, optsopt) → {Promise.<RobotsRules>}

Fetch and parse robots.txt from a base URL.

Parameters:
Name Type Attributes Description
baseUrl string

— site origin (e.g. "https://example.com")

opts object <optional>
Properties
Name Type Attributes Default Description
timeoutMs number <optional>
5000
Source:
Returns:
Type
Promise.<RobotsRules>

(static) loadSitemapUrls(baseUrl, declaredSitemapsopt, optsopt) → {Promise.<Array.<string>>}

Fetch and parse sitemap URLs from a base URL.

Tries URLs declared in robots.txt Sitemap: directives first, then falls back to the conventional /sitemap.xml location. Follows one level of sitemap index indirection.

Parameters:
Name Type Attributes Description
baseUrl string

— site origin

declaredSitemaps Array.<string> <optional>

— Sitemap URLs from robots.txt

opts object <optional>
Properties
Name Type Attributes Default Description
timeoutMs number <optional>
5000
maxUrls number <optional>
200

— cap to avoid memory issues on huge sitemaps

Source:
Returns:

— deduplicated list of page URLs

Type
Promise.<Array.<string>>

(static) parseRobotsTxt(text) → {RobotsRules}

Parse raw robots.txt content into structured rules.

Only rules for User-agent: Sentri or User-agent: * are kept. The Sentri-specific group takes priority if present.

Parameters:
Name Type Description
text string

— raw robots.txt content

Source:
Returns:
Type
RobotsRules

(static) parseSitemapXml(xml) → {Object}

Extract URLs from a sitemap XML string.

Handles both <urlset> (leaf sitemap) and <sitemapindex> (index pointing to child sitemaps). Uses regex extraction instead of a full XML parser to avoid adding a dependency.

Parameters:
Name Type Description
xml string

— raw sitemap XML content

Source:
Returns:
Type
Object

Type Definitions

RobotsRules

Type:
  • Object
Properties:
Name Type Description
rules Array.<{pattern: string, allow: boolean}>

— sorted longest-first

sitemaps Array.<string>

— Sitemap URLs declared in robots.txt

Source: