Archon/docs/docs/knowledge-features.mdx

285 lines
9.7 KiB
Plaintext

---
title: Archon Knowledge Features
sidebar_position: 2
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import Admonition from '@theme/Admonition';
# 🧠 Archon Knowledge: Complete Feature Reference
<div className="hero hero--secondary">
<div className="container">
<h2 className="hero__subtitle">
Everything you can do with Archon Knowledge - from web crawling to semantic search, document processing to code extraction.
</h2>
</div>
</div>
## 🎯 Core Knowledge Management Features
### Content Acquisition
- **Smart Web Crawling**: Intelligent crawling with automatic content type detection
- **Document Upload**: Support for PDF, Word, Markdown, and text files
- **Sitemap Processing**: Crawl entire documentation sites efficiently
- **Recursive Crawling**: Follow links to specified depth
- **Batch Processing**: Handle multiple URLs simultaneously
### Content Processing
- **Smart Chunking**: Intelligent text splitting preserving context
- **Contextual Embeddings**: Enhanced embeddings for better retrieval
- **Code Extraction**: Automatic detection and indexing of code examples
- Only extracts substantial code blocks (≥ 1000 characters by default)
- Supports markdown code blocks (triple backticks)
- Falls back to HTML extraction for `<pre>` and `<code>` tags
- Generates AI summaries for each code example
- **Metadata Extraction**: Headers, sections, and document structure
- **Source Management**: Track and manage content sources
### Search & Retrieval
- **Semantic Search**: Vector similarity search with pgvector
- **Code Search**: Specialized search for code examples
- **Source Filtering**: Search within specific sources
- **Result Reranking**: Optional ML-based result reranking
- **Hybrid Search**: Combine keyword and semantic search
## 🤖 AI Integration Features
### MCP Tools Available
<div className="row">
<div className="col col--6">
<div className="card">
<div className="card__header">
<h4>Knowledge Tools</h4>
</div>
<div className="card__body">
<ul>
<li><code>perform_rag_query</code> - Semantic search</li>
<li><code>search_code_examples</code> - Code-specific search</li>
<li><code>crawl_single_page</code> - Index one page</li>
<li><code>smart_crawl_url</code> - Intelligent crawling</li>
<li><code>get_available_sources</code> - List sources</li>
<li><code>upload_document</code> - Process documents</li>
<li><code>delete_source</code> - Remove content</li>
</ul>
</div>
</div>
</div>
<div className="col col--6">
<div className="card">
<div className="card__header">
<h4>AI Capabilities</h4>
</div>
<div className="card__body">
<ul>
<li>AI can search your entire knowledge base</li>
<li>Automatic query refinement for better results</li>
<li>Context-aware answer generation</li>
<li>Code example recommendations</li>
<li>Source citation in responses</li>
</ul>
</div>
</div>
</div>
</div>
## 🎨 User Interface Features
### Knowledge Table View
- **Source Overview**: See all indexed sources at a glance
- **Document Count**: Track content volume per source
- **Last Updated**: Monitor content freshness
- **Quick Actions**: Delete or refresh sources
- **Search Interface**: Test queries directly in UI
### Upload Interface
- **Drag & Drop**: Easy file uploads
- **Progress Tracking**: Real-time processing status
- **Batch Upload**: Multiple files simultaneously
- **Format Detection**: Automatic file type handling
### Crawl Interface
- **URL Input**: Simple URL entry with validation
- **Crawl Options**: Configure depth, filters
- **Progress Monitoring**: Live crawl status via Socket.IO
- **Stop Functionality**: Cancel crawls immediately with proper cleanup
- **Result Preview**: See crawled content immediately
## 🔧 Technical Capabilities
### Backend Services
#### RAG Services
- **Crawling Service**: Web page crawling and parsing
- **Document Storage Service**: Chunking and storage
- **Search Service**: Query processing and retrieval
- **Source Management Service**: Source metadata handling
#### Embedding Services
- **Embedding Service**: OpenAI embeddings with rate limiting
- **Contextual Embedding Service**: Context-aware embeddings
#### Storage Services
- **Document Storage Service**: Parallel document processing
- **Code Storage Service**: Code extraction and indexing
### Processing Pipeline
1. **Content Acquisition**: Crawl or upload
2. **Text Extraction**: Parse HTML/PDF/DOCX
3. **Smart Chunking**: Split into optimal chunks
4. **Embedding Generation**: Create vector embeddings
5. **Storage**: Save to Supabase with pgvector
6. **Indexing**: Update search indices
### Performance Features
- **Rate Limiting**: 200k tokens/minute OpenAI limit
- **Parallel Processing**: ThreadPoolExecutor optimization
- **Batch Operations**: Process multiple documents efficiently
- **Socket.IO Progress**: Real-time status updates
- **Caching**: Intelligent result caching
### Crawl Cancellation
- **Immediate Response**: Stop button provides instant UI feedback
- **Proper Cleanup**: Cancels both orchestration service and asyncio tasks
- **State Persistence**: Uses localStorage to prevent zombie crawls on refresh
- **Socket.IO Events**: Real-time cancellation status via `crawl:stopping` and `crawl:stopped`
- **Resource Management**: Ensures all server resources are properly released
## 🚀 Advanced Features
### Content Types
#### Web Crawling
- **Sitemap Support**: Process sitemap.xml files
- **Text File Support**: Direct .txt file processing
- **Markdown Support**: Native .md file handling
- **Dynamic Content**: JavaScript-rendered pages
- **Authentication**: Basic auth support
#### Document Processing
- **PDF Extraction**: Full text and metadata
- **Word Documents**: .docx and .doc support
- **Markdown Files**: Preserve formatting
- **Plain Text**: Simple text processing
- **Code Files**: Syntax-aware processing
### Search Features
- **Query Enhancement**: Automatic query expansion
- **Contextual Results**: Include surrounding context
- **Code Highlighting**: Syntax highlighting in results
- **Similarity Scoring**: Relevance scores for ranking
- **Faceted Search**: Filter by metadata
### Monitoring & Analytics
- **Logfire Integration**: Real-time performance monitoring
- **Search Analytics**: Track popular queries
- **Content Analytics**: Usage patterns
- **Performance Metrics**: Query times, throughput
### Debugging Tips
#### Code Extraction Not Working?
1. **Check code block size**: Only blocks ≥ 1000 characters are extracted
2. **Verify markdown format**: Code must be in triple backticks (```)
3. **Check crawl results**: Look for `backtick_count` in logs
4. **HTML fallback**: System will try HTML if markdown has no code blocks
#### Crawling Issues?
1. **Timeouts**: The system uses simple crawling without complex waits
2. **JavaScript sites**: Content should render without special interactions
3. **Progress stuck**: Check logs for batch processing updates
4. **Different content types**:
- `.txt` files: Direct text extraction
- `sitemap.xml`: Batch crawl all URLs
- Regular URLs: Recursive crawl with depth limit
## 📊 Usage Examples
### Crawling Documentation
```
AI Assistant: "Crawl the React documentation"
```
The system will:
1. Detect it's a documentation site
2. Find and process the sitemap
3. Crawl all pages with progress updates
4. Extract code examples
5. Create searchable chunks
### Searching Knowledge
```
AI Assistant: "Find information about React hooks"
```
The search will:
1. Generate embeddings for the query
2. Search across all React documentation
3. Return relevant chunks with context
4. Include code examples
5. Provide source links
### Document Upload
```
User uploads "system-design.pdf"
```
The system will:
1. Extract text from PDF
2. Identify sections and headers
3. Create smart chunks
4. Generate embeddings
5. Make searchable immediately
## 🔗 Real-time Features
### Socket.IO Progress
- **Crawl Progress**: Percentage completion with smooth updates
- **Current URL**: What's being processed in real-time
- **Batch Updates**: Processing batches with accurate progress
- **Stop Support**: Cancel crawls with immediate feedback
- **Error Reporting**: Failed URLs with detailed errors
- **Completion Status**: Final statistics and results
### Live Updates
```javascript
// Connect to Socket.IO and subscribe to progress
import { io } from 'socket.io-client';
const socket = io('/crawl');
socket.emit('subscribe', { progress_id: 'uuid' });
// Handle progress updates
socket.on('progress_update', (data) => {
console.log(`Progress: ${data.percentage}%`);
console.log(`Status: ${data.status}`);
console.log(`Current URL: ${data.currentUrl}`);
console.log(`Pages: ${data.processedPages}/${data.totalPages}`);
});
// Handle completion
socket.on('progress_complete', (data) => {
console.log('Crawl completed!');
console.log(`Chunks stored: ${data.chunksStored}`);
console.log(`Word count: ${data.wordCount}`);
});
// Stop a crawl
socket.emit('crawl_stop', { progress_id: 'uuid' });
// Handle stop events
socket.on('crawl:stopping', (data) => {
console.log('Crawl is stopping...');
});
socket.on('crawl:stopped', (data) => {
console.log('Crawl cancelled successfully');
});
```
## 🔗 Related Documentation
- [Knowledge Overview](./knowledge-overview) - High-level introduction
- [API Reference](./api-reference#knowledge-management-api) - Detailed API documentation
- [MCP Tools Reference](./mcp-tools#knowledge-tools) - MCP tool specifications
- [RAG Agent Documentation](./agent-rag) - How the AI searches knowledge
- [Server Services](./server-services#rag-services) - Technical service details