JSDoc: Module: utils/robotsSitemap

Module: utils/robotsSitemap

Fetches and parses robots.txt rules and sitemap.xml URLs for crawl compliance. Zero external dependencies — uses the global fetch() API available in Node 18+.

Design decisions

Only the Sentri and * user-agent groups are evaluated.
Allow directives take precedence over Disallow when both match at the same specificity (longest prefix wins), matching Google's interpretation.
Sitemap parsing handles both <sitemapindex> (recursive) and <urlset> formats. Gzip sitemaps are NOT supported (would require zlib); they are silently skipped.
All network errors are swallowed — a missing or unreachable robots.txt means "allow everything", per the standard.

Exports

loadRobotsRules — fetch + parse robots.txt → rules object
isAllowed — check a URL against parsed rules
loadSitemapUrls — fetch + parse sitemap.xml → URL list

Source:

utils/robotsSitemap.js, line 1

See:

https://datatracker.ietf.org/doc/html/rfc9309

Methods

(static) isAllowed(url, robotsRules) → {boolean}

Check whether a URL is allowed by the parsed robots.txt rules.

Uses longest-prefix matching: the rule whose pattern is the longest prefix of the URL path wins. If no rule matches, the URL is allowed (default).

Parameters:

Name	Type	Description
`url`	string	— full URL to check
`robotsRules`	RobotsRules	— from `loadRobotsRules`

Source:

utils/robotsSitemap.js, line 136

Returns:

Type: boolean

(static) loadRobotsRules(baseUrl, optsopt) → {Promise.<RobotsRules>}

Fetch and parse robots.txt from a base URL.

Parameters:

Name Type Attributes Description

baseUrl

string

— site origin (e.g. "https://example.com")

opts

object

Properties

Name	Type	Attributes	Default	Description
`timeoutMs`	number	<optional>	5000

Source:

utils/robotsSitemap.js, line 107

Returns:

Type: Promise.<RobotsRules>

(static) loadSitemapUrls(baseUrl, declaredSitemapsopt, optsopt) → {Promise.<Array.<string>>}

Fetch and parse sitemap URLs from a base URL.

Tries URLs declared in robots.txt Sitemap: directives first, then falls back to the conventional /sitemap.xml location. Follows one level of sitemap index indirection.

Parameters:

Name Type Attributes Description

baseUrl

string

— site origin

declaredSitemaps

Array.<string>

— Sitemap URLs from robots.txt

opts

object

Properties

Name	Type	Attributes	Default	Description
`timeoutMs`	number	<optional>	5000
`maxUrls`	number	<optional>	200	— cap to avoid memory issues on huge sitemaps

Source:

utils/robotsSitemap.js, line 204

Returns:

— deduplicated list of page URLs

Type: Promise.<Array.<string>>

(static) parseRobotsTxt(text) → {RobotsRules}

Parse raw robots.txt content into structured rules.

Only rules for User-agent: Sentri or User-agent: * are kept. The Sentri-specific group takes priority if present.

Parameters:

Name	Type	Description
`text`	string	— raw robots.txt content

Source:

utils/robotsSitemap.js, line 42

Returns:

Type: RobotsRules

(static) parseSitemapXml(xml) → {Object}

Extract URLs from a sitemap XML string.

Handles both <urlset> (leaf sitemap) and <sitemapindex> (index pointing to child sitemaps). Uses regex extraction instead of a full XML parser to avoid adding a dependency.

Parameters:

Name	Type	Description
`xml`	string	— raw sitemap XML content

Source:

utils/robotsSitemap.js, line 165

Returns:

Type: Object

Type Definitions

RobotsRules

Type:

Object

Properties:

Name	Type	Description
`rules`	Array.<{pattern: string, allow: boolean}>	— sorted longest-first
`sitemaps`	Array.<string>	— Sitemap URLs declared in robots.txt

Source:

utils/robotsSitemap.js, line 27