Commit Graph

5 Commits

Author SHA1 Message Date
Rasmus Widing
8792a1b0dd Fix crawler timeout for JavaScript-heavy documentation sites
Remove wait_for='body' selector from documentation site crawling config.
The body element exists immediately in HTML, causing unnecessary timeouts
for JavaScript-rendered content. Now relies on domcontentloaded event
and delay_before_return_html for proper JavaScript execution.
2025-08-22 08:56:03 +03:00
Wirasm
8743c059bb
Merge pull request #218 from coleam00/fix/filter-binary-files-from-crawl
Fix crawler attempting to navigate to binary files
2025-08-16 00:39:17 +03:00
Rasmus Widing
8157670936 Fix crawler attempting to navigate to binary files
- Add is_binary_file() method to URLHandler to detect 40+ binary extensions
- Update RecursiveCrawlStrategy to filter binary URLs before crawl queue
- Add comprehensive unit tests for binary file detection
- Prevents net::ERR_ABORTED errors when crawler encounters ZIP, PDF, etc.

This fixes the issue where the crawler was treating binary file URLs
(like .zip downloads) as navigable web pages, causing errors in crawl4ai.
2025-08-15 17:24:46 +03:00
Rasmus Widing
e98f52aa57 Address code review feedback: improve error handling and documentation
- Implement fail-fast error handling for configuration errors
- Distinguish between critical config errors (fail) and network issues (use defaults)
- Add detailed error logging with stack traces for debugging
- Document new crawler settings in .env.example
- Add inline comments explaining safe defaults

Critical configuration errors (ValueError, KeyError, TypeError) now fail fast
as per alpha principles, while transient errors still fall back to safe defaults
with prominent error logging.
2025-08-15 16:02:00 +03:00
Cole Medin
59084036f6 The New Archon (Beta) - The Operating System for AI Coding Assistants! 2025-08-13 07:58:24 -05:00