285 lines
9.7 KiB
Plaintext
285 lines
9.7 KiB
Plaintext
---
|
|
title: Archon Knowledge Features
|
|
sidebar_position: 2
|
|
---
|
|
|
|
import Tabs from '@theme/Tabs';
|
|
import TabItem from '@theme/TabItem';
|
|
import Admonition from '@theme/Admonition';
|
|
|
|
# 🧠 Archon Knowledge: Complete Feature Reference
|
|
|
|
<div className="hero hero--secondary">
|
|
<div className="container">
|
|
<h2 className="hero__subtitle">
|
|
Everything you can do with Archon Knowledge - from web crawling to semantic search, document processing to code extraction.
|
|
</h2>
|
|
</div>
|
|
</div>
|
|
|
|
## 🎯 Core Knowledge Management Features
|
|
|
|
### Content Acquisition
|
|
- **Smart Web Crawling**: Intelligent crawling with automatic content type detection
|
|
- **Document Upload**: Support for PDF, Word, Markdown, and text files
|
|
- **Sitemap Processing**: Crawl entire documentation sites efficiently
|
|
- **Recursive Crawling**: Follow links to specified depth
|
|
- **Batch Processing**: Handle multiple URLs simultaneously
|
|
|
|
### Content Processing
|
|
- **Smart Chunking**: Intelligent text splitting preserving context
|
|
- **Contextual Embeddings**: Enhanced embeddings for better retrieval
|
|
- **Code Extraction**: Automatic detection and indexing of code examples
|
|
- Only extracts substantial code blocks (≥ 1000 characters by default)
|
|
- Supports markdown code blocks (triple backticks)
|
|
- Falls back to HTML extraction for `<pre>` and `<code>` tags
|
|
- Generates AI summaries for each code example
|
|
- **Metadata Extraction**: Headers, sections, and document structure
|
|
- **Source Management**: Track and manage content sources
|
|
|
|
### Search & Retrieval
|
|
- **Semantic Search**: Vector similarity search with pgvector
|
|
- **Code Search**: Specialized search for code examples
|
|
- **Source Filtering**: Search within specific sources
|
|
- **Result Reranking**: Optional ML-based result reranking
|
|
- **Hybrid Search**: Combine keyword and semantic search
|
|
|
|
## 🤖 AI Integration Features
|
|
|
|
### MCP Tools Available
|
|
|
|
<div className="row">
|
|
<div className="col col--6">
|
|
<div className="card">
|
|
<div className="card__header">
|
|
<h4>Knowledge Tools</h4>
|
|
</div>
|
|
<div className="card__body">
|
|
<ul>
|
|
<li><code>perform_rag_query</code> - Semantic search</li>
|
|
<li><code>search_code_examples</code> - Code-specific search</li>
|
|
<li><code>crawl_single_page</code> - Index one page</li>
|
|
<li><code>smart_crawl_url</code> - Intelligent crawling</li>
|
|
<li><code>get_available_sources</code> - List sources</li>
|
|
<li><code>upload_document</code> - Process documents</li>
|
|
<li><code>delete_source</code> - Remove content</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div className="col col--6">
|
|
<div className="card">
|
|
<div className="card__header">
|
|
<h4>AI Capabilities</h4>
|
|
</div>
|
|
<div className="card__body">
|
|
<ul>
|
|
<li>AI can search your entire knowledge base</li>
|
|
<li>Automatic query refinement for better results</li>
|
|
<li>Context-aware answer generation</li>
|
|
<li>Code example recommendations</li>
|
|
<li>Source citation in responses</li>
|
|
</ul>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
## 🎨 User Interface Features
|
|
|
|
### Knowledge Table View
|
|
- **Source Overview**: See all indexed sources at a glance
|
|
- **Document Count**: Track content volume per source
|
|
- **Last Updated**: Monitor content freshness
|
|
- **Quick Actions**: Delete or refresh sources
|
|
- **Search Interface**: Test queries directly in UI
|
|
|
|
### Upload Interface
|
|
- **Drag & Drop**: Easy file uploads
|
|
- **Progress Tracking**: Real-time processing status
|
|
- **Batch Upload**: Multiple files simultaneously
|
|
- **Format Detection**: Automatic file type handling
|
|
|
|
### Crawl Interface
|
|
- **URL Input**: Simple URL entry with validation
|
|
- **Crawl Options**: Configure depth, filters
|
|
- **Progress Monitoring**: Live crawl status via Socket.IO
|
|
- **Stop Functionality**: Cancel crawls immediately with proper cleanup
|
|
- **Result Preview**: See crawled content immediately
|
|
|
|
## 🔧 Technical Capabilities
|
|
|
|
### Backend Services
|
|
|
|
#### RAG Services
|
|
- **Crawling Service**: Web page crawling and parsing
|
|
- **Document Storage Service**: Chunking and storage
|
|
- **Search Service**: Query processing and retrieval
|
|
- **Source Management Service**: Source metadata handling
|
|
|
|
#### Embedding Services
|
|
- **Embedding Service**: OpenAI embeddings with rate limiting
|
|
- **Contextual Embedding Service**: Context-aware embeddings
|
|
|
|
#### Storage Services
|
|
- **Document Storage Service**: Parallel document processing
|
|
- **Code Storage Service**: Code extraction and indexing
|
|
|
|
### Processing Pipeline
|
|
1. **Content Acquisition**: Crawl or upload
|
|
2. **Text Extraction**: Parse HTML/PDF/DOCX
|
|
3. **Smart Chunking**: Split into optimal chunks
|
|
4. **Embedding Generation**: Create vector embeddings
|
|
5. **Storage**: Save to Supabase with pgvector
|
|
6. **Indexing**: Update search indices
|
|
|
|
### Performance Features
|
|
- **Rate Limiting**: 200k tokens/minute OpenAI limit
|
|
- **Parallel Processing**: ThreadPoolExecutor optimization
|
|
- **Batch Operations**: Process multiple documents efficiently
|
|
- **Socket.IO Progress**: Real-time status updates
|
|
- **Caching**: Intelligent result caching
|
|
|
|
### Crawl Cancellation
|
|
- **Immediate Response**: Stop button provides instant UI feedback
|
|
- **Proper Cleanup**: Cancels both orchestration service and asyncio tasks
|
|
- **State Persistence**: Uses localStorage to prevent zombie crawls on refresh
|
|
- **Socket.IO Events**: Real-time cancellation status via `crawl:stopping` and `crawl:stopped`
|
|
- **Resource Management**: Ensures all server resources are properly released
|
|
|
|
## 🚀 Advanced Features
|
|
|
|
### Content Types
|
|
|
|
#### Web Crawling
|
|
- **Sitemap Support**: Process sitemap.xml files
|
|
- **Text File Support**: Direct .txt file processing
|
|
- **Markdown Support**: Native .md file handling
|
|
- **Dynamic Content**: JavaScript-rendered pages
|
|
- **Authentication**: Basic auth support
|
|
|
|
#### Document Processing
|
|
- **PDF Extraction**: Full text and metadata
|
|
- **Word Documents**: .docx and .doc support
|
|
- **Markdown Files**: Preserve formatting
|
|
- **Plain Text**: Simple text processing
|
|
- **Code Files**: Syntax-aware processing
|
|
|
|
### Search Features
|
|
- **Query Enhancement**: Automatic query expansion
|
|
- **Contextual Results**: Include surrounding context
|
|
- **Code Highlighting**: Syntax highlighting in results
|
|
- **Similarity Scoring**: Relevance scores for ranking
|
|
- **Faceted Search**: Filter by metadata
|
|
|
|
### Monitoring & Analytics
|
|
- **Logfire Integration**: Real-time performance monitoring
|
|
- **Search Analytics**: Track popular queries
|
|
- **Content Analytics**: Usage patterns
|
|
- **Performance Metrics**: Query times, throughput
|
|
|
|
### Debugging Tips
|
|
|
|
#### Code Extraction Not Working?
|
|
1. **Check code block size**: Only blocks ≥ 1000 characters are extracted
|
|
2. **Verify markdown format**: Code must be in triple backticks (```)
|
|
3. **Check crawl results**: Look for `backtick_count` in logs
|
|
4. **HTML fallback**: System will try HTML if markdown has no code blocks
|
|
|
|
#### Crawling Issues?
|
|
1. **Timeouts**: The system uses simple crawling without complex waits
|
|
2. **JavaScript sites**: Content should render without special interactions
|
|
3. **Progress stuck**: Check logs for batch processing updates
|
|
4. **Different content types**:
|
|
- `.txt` files: Direct text extraction
|
|
- `sitemap.xml`: Batch crawl all URLs
|
|
- Regular URLs: Recursive crawl with depth limit
|
|
|
|
## 📊 Usage Examples
|
|
|
|
### Crawling Documentation
|
|
```
|
|
AI Assistant: "Crawl the React documentation"
|
|
```
|
|
The system will:
|
|
1. Detect it's a documentation site
|
|
2. Find and process the sitemap
|
|
3. Crawl all pages with progress updates
|
|
4. Extract code examples
|
|
5. Create searchable chunks
|
|
|
|
### Searching Knowledge
|
|
```
|
|
AI Assistant: "Find information about React hooks"
|
|
```
|
|
The search will:
|
|
1. Generate embeddings for the query
|
|
2. Search across all React documentation
|
|
3. Return relevant chunks with context
|
|
4. Include code examples
|
|
5. Provide source links
|
|
|
|
### Document Upload
|
|
```
|
|
User uploads "system-design.pdf"
|
|
```
|
|
The system will:
|
|
1. Extract text from PDF
|
|
2. Identify sections and headers
|
|
3. Create smart chunks
|
|
4. Generate embeddings
|
|
5. Make searchable immediately
|
|
|
|
## 🔗 Real-time Features
|
|
|
|
### Socket.IO Progress
|
|
- **Crawl Progress**: Percentage completion with smooth updates
|
|
- **Current URL**: What's being processed in real-time
|
|
- **Batch Updates**: Processing batches with accurate progress
|
|
- **Stop Support**: Cancel crawls with immediate feedback
|
|
- **Error Reporting**: Failed URLs with detailed errors
|
|
- **Completion Status**: Final statistics and results
|
|
|
|
### Live Updates
|
|
```javascript
|
|
// Connect to Socket.IO and subscribe to progress
|
|
import { io } from 'socket.io-client';
|
|
|
|
const socket = io('/crawl');
|
|
socket.emit('subscribe', { progress_id: 'uuid' });
|
|
|
|
// Handle progress updates
|
|
socket.on('progress_update', (data) => {
|
|
console.log(`Progress: ${data.percentage}%`);
|
|
console.log(`Status: ${data.status}`);
|
|
console.log(`Current URL: ${data.currentUrl}`);
|
|
console.log(`Pages: ${data.processedPages}/${data.totalPages}`);
|
|
});
|
|
|
|
// Handle completion
|
|
socket.on('progress_complete', (data) => {
|
|
console.log('Crawl completed!');
|
|
console.log(`Chunks stored: ${data.chunksStored}`);
|
|
console.log(`Word count: ${data.wordCount}`);
|
|
});
|
|
|
|
// Stop a crawl
|
|
socket.emit('crawl_stop', { progress_id: 'uuid' });
|
|
|
|
// Handle stop events
|
|
socket.on('crawl:stopping', (data) => {
|
|
console.log('Crawl is stopping...');
|
|
});
|
|
|
|
socket.on('crawl:stopped', (data) => {
|
|
console.log('Crawl cancelled successfully');
|
|
});
|
|
```
|
|
|
|
## 🔗 Related Documentation
|
|
|
|
- [Knowledge Overview](./knowledge-overview) - High-level introduction
|
|
- [API Reference](./api-reference#knowledge-management-api) - Detailed API documentation
|
|
- [MCP Tools Reference](./mcp-tools#knowledge-tools) - MCP tool specifications
|
|
- [RAG Agent Documentation](./agent-rag) - How the AI searches knowledge
|
|
- [Server Services](./server-services#rag-services) - Technical service details |