Fetches and parses robots.txt rules and sitemap.xml URLs for
crawl compliance. Zero external dependencies — uses the global fetch() API
available in Node 18+.
Design decisions
- Only the
Sentriand*user-agent groups are evaluated. Allowdirectives take precedence overDisallowwhen both match at the same specificity (longest prefix wins), matching Google's interpretation.- Sitemap parsing handles both
<sitemapindex>(recursive) and<urlset>formats. Gzip sitemaps are NOT supported (would requirezlib); they are silently skipped. - All network errors are swallowed — a missing or unreachable robots.txt means "allow everything", per the standard.
Exports
loadRobotsRules— fetch + parse robots.txt → rules objectisAllowed— check a URL against parsed rulesloadSitemapUrls— fetch + parse sitemap.xml → URL list
Methods
(static) isAllowed(url, robotsRules) → {boolean}
Check whether a URL is allowed by the parsed robots.txt rules.
Uses longest-prefix matching: the rule whose pattern is the longest prefix of the URL path wins. If no rule matches, the URL is allowed (default).
Parameters:
| Name | Type | Description |
|---|---|---|
url |
string | — full URL to check |
robotsRules |
RobotsRules | — from |
- Source:
Returns:
- Type
- boolean
(static) loadRobotsRules(baseUrl, optsopt) → {Promise.<RobotsRules>}
Fetch and parse robots.txt from a base URL.
Parameters:
| Name | Type | Attributes | Description | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
baseUrl |
string | — site origin (e.g. "https://example.com") |
|||||||||||
opts |
object |
<optional> |
Properties
|
- Source:
Returns:
- Type
- Promise.<RobotsRules>
(static) loadSitemapUrls(baseUrl, declaredSitemapsopt, optsopt) → {Promise.<Array.<string>>}
Fetch and parse sitemap URLs from a base URL.
Tries URLs declared in robots.txt Sitemap: directives first, then falls
back to the conventional /sitemap.xml location. Follows one level of
sitemap index indirection.
Parameters:
| Name | Type | Attributes | Description | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
baseUrl |
string | — site origin |
||||||||||||||||
declaredSitemaps |
Array.<string> |
<optional> |
— Sitemap URLs from robots.txt |
|||||||||||||||
opts |
object |
<optional> |
Properties
|
- Source:
Returns:
— deduplicated list of page URLs
- Type
- Promise.<Array.<string>>
(static) parseRobotsTxt(text) → {RobotsRules}
Parse raw robots.txt content into structured rules.
Only rules for User-agent: Sentri or User-agent: * are kept.
The Sentri-specific group takes priority if present.
Parameters:
| Name | Type | Description |
|---|---|---|
text |
string | — raw robots.txt content |
- Source:
Returns:
- Type
- RobotsRules
(static) parseSitemapXml(xml) → {Object}
Extract URLs from a sitemap XML string.
Handles both <urlset> (leaf sitemap) and <sitemapindex> (index pointing
to child sitemaps). Uses regex extraction instead of a full XML parser to
avoid adding a dependency.
Parameters:
| Name | Type | Description |
|---|---|---|
xml |
string | — raw sitemap XML content |
- Source:
Returns:
- Type
- Object
Type Definitions
RobotsRules
Type:
- Object
Properties:
| Name | Type | Description |
|---|---|---|
rules |
Array.<{pattern: string, allow: boolean}> | — sorted longest-first |
sitemaps |
Array.<string> | — Sitemap URLs declared in robots.txt |
- Source: