- Transform URLs to raw content (e.g., GitHub blob -> raw) before sending to crawler
- Maintain mapping dictionary to preserve original URLs in results
- Align progress callback signatures between batch and recursive strategies
- Add safety guards for missing links attribute
- Remove unused loop counter in batch strategy
- Optimize binary file checks to avoid duplicate calls
This ensures GitHub files are crawled as raw content instead of HTML pages,
fixing the issue where content extraction was degraded due to HTML wrapping.