Archon/docs/docs/knowledge-features.mdx

---
title: Archon Knowledge Features
sidebar_position: 2
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import Admonition from '@theme/Admonition';

# 🧠 Archon Knowledge: Complete Feature Reference

<div className="hero hero--secondary">
  <div className="container">
    <h2 className="hero__subtitle">
      Everything you can do with Archon Knowledge - from web crawling to semantic search, document processing to code extraction.
    </h2>
  </div>
</div>

## 🎯 Core Knowledge Management Features

### Content Acquisition
- **Smart Web Crawling**: Intelligent crawling with automatic content type detection
- **Document Upload**: Support for PDF, Word, Markdown, and text files
- **Sitemap Processing**: Crawl entire documentation sites efficiently
- **Recursive Crawling**: Follow links to specified depth
- **Batch Processing**: Handle multiple URLs simultaneously

### Content Processing
- **Smart Chunking**: Intelligent text splitting preserving context
- **Contextual Embeddings**: Enhanced embeddings for better retrieval
- **Code Extraction**: Automatic detection and indexing of code examples
  - Only extracts substantial code blocks (≥ 1000 characters by default)
  - Supports markdown code blocks (triple backticks)
  - Falls back to HTML extraction for `<pre>` and `<code>` tags
  - Generates AI summaries for each code example
- **Metadata Extraction**: Headers, sections, and document structure
- **Source Management**: Track and manage content sources

### Search & Retrieval
- **Semantic Search**: Vector similarity search with pgvector
- **Code Search**: Specialized search for code examples
- **Source Filtering**: Search within specific sources
- **Result Reranking**: Optional ML-based result reranking
- **Hybrid Search**: Combine keyword and semantic search

## 🤖 AI Integration Features

### MCP Tools Available

<div className="row">
  <div className="col col--6">
    <div className="card">
      <div className="card__header">
        <h4>Knowledge Tools</h4>
      </div>
      <div className="card__body">
        <ul>
          <li><code>perform_rag_query</code> - Semantic search</li>
          <li><code>search_code_examples</code> - Code-specific search</li>
          <li><code>crawl_single_page</code> - Index one page</li>
          <li><code>smart_crawl_url</code> - Intelligent crawling</li>
          <li><code>get_available_sources</code> - List sources</li>
          <li><code>upload_document</code> - Process documents</li>
          <li><code>delete_source</code> - Remove content</li>
        </ul>
      </div>
    </div>
  </div>
  <div className="col col--6">
    <div className="card">
      <div className="card__header">
        <h4>AI Capabilities</h4>
      </div>
      <div className="card__body">
        <ul>
          <li>AI can search your entire knowledge base</li>
          <li>Automatic query refinement for better results</li>
          <li>Context-aware answer generation</li>
          <li>Code example recommendations</li>
          <li>Source citation in responses</li>
        </ul>
      </div>
    </div>
  </div>
</div>

## 🎨 User Interface Features

### Knowledge Table View
- **Source Overview**: See all indexed sources at a glance
- **Document Count**: Track content volume per source
- **Last Updated**: Monitor content freshness
- **Quick Actions**: Delete or refresh sources
- **Search Interface**: Test queries directly in UI

### Upload Interface
- **Drag & Drop**: Easy file uploads
- **Progress Tracking**: Real-time processing status
- **Batch Upload**: Multiple files simultaneously
- **Format Detection**: Automatic file type handling

### Crawl Interface
- **URL Input**: Simple URL entry with validation
- **Crawl Options**: Configure depth, filters
- **Progress Monitoring**: Live crawl status via Socket.IO
- **Stop Functionality**: Cancel crawls immediately with proper cleanup
- **Result Preview**: See crawled content immediately

## 🔧 Technical Capabilities

### Backend Services

#### RAG Services
- **Crawling Service**: Web page crawling and parsing
- **Document Storage Service**: Chunking and storage
- **Search Service**: Query processing and retrieval
- **Source Management Service**: Source metadata handling

#### Embedding Services
- **Embedding Service**: OpenAI embeddings with rate limiting
- **Contextual Embedding Service**: Context-aware embeddings

#### Storage Services
- **Document Storage Service**: Parallel document processing
- **Code Storage Service**: Code extraction and indexing

### Processing Pipeline
1. **Content Acquisition**: Crawl or upload
2. **Text Extraction**: Parse HTML/PDF/DOCX
3. **Smart Chunking**: Split into optimal chunks
4. **Embedding Generation**: Create vector embeddings
5. **Storage**: Save to Supabase with pgvector
6. **Indexing**: Update search indices

### Performance Features
- **Rate Limiting**: 200k tokens/minute OpenAI limit
- **Parallel Processing**: ThreadPoolExecutor optimization
- **Batch Operations**: Process multiple documents efficiently
- **Socket.IO Progress**: Real-time status updates
- **Caching**: Intelligent result caching

### Crawl Cancellation
- **Immediate Response**: Stop button provides instant UI feedback
- **Proper Cleanup**: Cancels both orchestration service and asyncio tasks
- **State Persistence**: Uses localStorage to prevent zombie crawls on refresh
- **Socket.IO Events**: Real-time cancellation status via `crawl:stopping` and `crawl:stopped`
- **Resource Management**: Ensures all server resources are properly released

## 🚀 Advanced Features

### Content Types

#### Web Crawling
- **Sitemap Support**: Process sitemap.xml files
- **Text File Support**: Direct .txt file processing
- **Markdown Support**: Native .md file handling
- **Dynamic Content**: JavaScript-rendered pages
- **Authentication**: Basic auth support

#### Document Processing
- **PDF Extraction**: Full text and metadata
- **Word Documents**: .docx and .doc support
- **Markdown Files**: Preserve formatting
- **Plain Text**: Simple text processing
- **Code Files**: Syntax-aware processing

### Search Features
- **Query Enhancement**: Automatic query expansion
- **Contextual Results**: Include surrounding context
- **Code Highlighting**: Syntax highlighting in results
- **Similarity Scoring**: Relevance scores for ranking
- **Faceted Search**: Filter by metadata

### Monitoring & Analytics
- **Logfire Integration**: Real-time performance monitoring
- **Search Analytics**: Track popular queries
- **Content Analytics**: Usage patterns
- **Performance Metrics**: Query times, throughput

### Debugging Tips

#### Code Extraction Not Working?
1. **Check code block size**: Only blocks ≥ 1000 characters are extracted
2. **Verify markdown format**: Code must be in triple backticks (```)
3. **Check crawl results**: Look for `backtick_count` in logs
4. **HTML fallback**: System will try HTML if markdown has no code blocks

#### Crawling Issues?
1. **Timeouts**: The system uses simple crawling without complex waits
2. **JavaScript sites**: Content should render without special interactions
3. **Progress stuck**: Check logs for batch processing updates
4. **Different content types**:
   - `.txt` files: Direct text extraction
   - `sitemap.xml`: Batch crawl all URLs
   - Regular URLs: Recursive crawl with depth limit

## 📊 Usage Examples

### Crawling Documentation
```
AI Assistant: "Crawl the React documentation"
```
The system will:
1. Detect it's a documentation site
2. Find and process the sitemap
3. Crawl all pages with progress updates
4. Extract code examples
5. Create searchable chunks

### Searching Knowledge
```
AI Assistant: "Find information about React hooks"
```
The search will:
1. Generate embeddings for the query
2. Search across all React documentation
3. Return relevant chunks with context
4. Include code examples
5. Provide source links

### Document Upload
```
User uploads "system-design.pdf"
```
The system will:
1. Extract text from PDF
2. Identify sections and headers
3. Create smart chunks
4. Generate embeddings
5. Make searchable immediately

## 🔗 Real-time Features

### Socket.IO Progress
- **Crawl Progress**: Percentage completion with smooth updates
- **Current URL**: What's being processed in real-time
- **Batch Updates**: Processing batches with accurate progress
- **Stop Support**: Cancel crawls with immediate feedback
- **Error Reporting**: Failed URLs with detailed errors
- **Completion Status**: Final statistics and results

### Live Updates
```javascript
// Connect to Socket.IO and subscribe to progress
import { io } from 'socket.io-client';

const socket = io('/crawl');
socket.emit('subscribe', { progress_id: 'uuid' });

// Handle progress updates
socket.on('progress_update', (data) => {
  console.log(`Progress: ${data.percentage}%`);
  console.log(`Status: ${data.status}`);
  console.log(`Current URL: ${data.currentUrl}`);
  console.log(`Pages: ${data.processedPages}/${data.totalPages}`);
});

// Handle completion
socket.on('progress_complete', (data) => {
  console.log('Crawl completed!');
  console.log(`Chunks stored: ${data.chunksStored}`);
  console.log(`Word count: ${data.wordCount}`);
});

// Stop a crawl
socket.emit('crawl_stop', { progress_id: 'uuid' });

// Handle stop events
socket.on('crawl:stopping', (data) => {
  console.log('Crawl is stopping...');
});

socket.on('crawl:stopped', (data) => {
  console.log('Crawl cancelled successfully');
});
```

## 🔗 Related Documentation

- [Knowledge Overview](./knowledge-overview) - High-level introduction
- [API Reference](./api-reference#knowledge-management-api) - Detailed API documentation
- [MCP Tools Reference](./mcp-tools#knowledge-tools) - MCP tool specifications
- [RAG Agent Documentation](./agent-rag) - How the AI searches knowledge
- [Server Services](./server-services#rag-services) - Technical service details