Fix sitemap URL detection to require .xml extension (#611)

Resolves issue where URLs containing 'sitemap' in path (like
https://nx.dev/see-also/sitemap) were incorrectly treated as XML
sitemaps, causing XML parsing errors.

- Changed detection to require both .xml extension AND 'sitemap' in path
- Fixes XML parsing error: "not well-formed (invalid token)"
- Maintains compatibility with existing test cases
- Now correctly identifies only actual XML sitemap files

Fixes #607

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Claude <noreply@anthropic.com>
This commit is contained in:
DIY Smart Code 2025-09-12 17:07:22 +02:00 committed by GitHub
parent 3d5753f8a7
commit ce2f871ebb
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -29,7 +29,10 @@ class URLHandler:
True if URL is a sitemap, False otherwise
"""
try:
return url.endswith("sitemap.xml") or "sitemap" in urlparse(url).path
parsed = urlparse(url)
path = parsed.path.lower()
# Only match URLs that end with .xml and contain sitemap in the filename
return path.endswith(".xml") and "sitemap" in path
except Exception as e:
logger.warning(f"Error checking if URL is sitemap: {e}")
return False