* Fix race condition in concurrent crawling with unique source IDs - Add unique hash-based source_id generation to prevent conflicts - Separate source identification from display with three fields: - source_id: 16-char SHA256 hash for unique identification - source_url: Original URL for tracking - source_display_name: Human-friendly name for UI - Add comprehensive test suite validating the fix - Migrate existing data with backward compatibility * Fix title generation to use source_display_name for better AI context - Pass source_display_name to title generation function - Use display name in AI prompt instead of hash-based source_id - Results in more specific, meaningful titles for each source * Skip AI title generation when display name is available - Use source_display_name directly as title to avoid unnecessary AI calls - More efficient and predictable than AI-generated titles - Keep AI generation only as fallback for backward compatibility * Fix critical issues from code review - Add missing os import to prevent NameError crash - Remove unused imports (pytest, Mock, patch, hashlib, urlparse, etc.) - Fix GitHub API capitalization consistency - Reuse existing DocumentStorageService instance - Update test expectations to match corrected capitalization Addresses CodeRabbit review feedback on PR #472 * Add safety improvements from code review - Truncate display names to 100 chars when used as titles - Document hash collision probability (negligible for <1M sources) Simple, pragmatic fixes per KISS principle * Fix code extraction to use hash-based source_ids and improve display names - Fixed critical bug where code extraction was using old domain-based source_ids - Updated code extraction service to accept source_id as parameter instead of extracting from URL - Added special handling for llms.txt and sitemap.xml files in display names - Added comprehensive tests for source_id handling in code extraction - Removed unused urlparse import from code_extraction_service.py This fixes the foreign key constraint errors that were preventing code examples from being stored after the source_id architecture refactor. Co-Authored-By: Claude <noreply@anthropic.com> * Fix critical variable shadowing and source_type determination issues - Fixed variable shadowing in document_storage_operations.py where source_url parameter was being overwritten by document URLs, causing incorrect source_url in database - Fixed source_type determination to use actual URLs instead of hash-based source_id - Added comprehensive tests for source URL preservation - Ensure source_type is correctly set to "file" for file uploads, "url" for web crawls The variable shadowing bug was causing sitemap sources to have the wrong source_url (last crawled page instead of sitemap URL). The source_type bug would mark all sources as "url" even for file uploads due to hash-based IDs not starting with "file_". Co-Authored-By: Claude <noreply@anthropic.com> * Fix URL canonicalization and document metrics calculation - Implement proper URL canonicalization to prevent duplicate sources - Remove trailing slashes (except root) - Remove URL fragments - Remove tracking parameters (utm_*, gclid, fbclid, etc.) - Sort query parameters for consistency - Remove default ports (80 for HTTP, 443 for HTTPS) - Normalize scheme and domain to lowercase - Fix avg_chunks_per_doc calculation to avoid division by zero - Track processed_docs count separately from total crawl_results - Handle all-empty document sets gracefully - Show processed/total in logs for better visibility - Add comprehensive tests for both fixes - 10 test cases for URL canonicalization edge cases - 4 test cases for document metrics calculation This prevents database constraint violations when crawling the same content with URL variations and provides accurate metrics in logs. * Fix synchronous extract_source_summary blocking async event loop - Run extract_source_summary in thread pool using asyncio.to_thread - Prevents blocking the async event loop during AI summary generation - Preserves exact error handling and fallback behavior - Variables (source_id, combined_content) properly passed to thread Added comprehensive tests verifying: - Function runs in thread without blocking - Error handling works correctly with fallback - Multiple sources can be processed - Thread safety with variable passing * Fix synchronous update_source_info blocking async event loop - Run update_source_info in thread pool using asyncio.to_thread - Prevents blocking the async event loop during database operations - Preserves exact error handling and fallback behavior - All kwargs properly passed to thread execution Added comprehensive tests verifying: - Function runs in thread without blocking - Error handling triggers fallback correctly - All kwargs are preserved when passed to thread - Existing extract_source_summary tests still pass * Fix race condition in source creation using upsert - Replace INSERT with UPSERT for new sources to prevent PRIMARY KEY violations - Handles concurrent crawls attempting to create the same source - Maintains existing UPDATE behavior for sources that already exist Added comprehensive tests verifying: - Concurrent source creation doesn't fail - Upsert is used for new sources (not insert) - Update is still used for existing sources - Async concurrent operations work correctly - Race conditions with delays are handled This prevents database constraint errors when multiple crawls target the same URL simultaneously. * Add migration detection UI components Add MigrationBanner component with clear user instructions for database schema updates. Add useMigrationStatus hook for periodic health check monitoring with graceful error handling. * Integrate migration banner into main app Add migration status monitoring and banner display to App.tsx. Shows migration banner when database schema updates are required. * Enhance backend startup error instructions Add detailed Docker restart instructions and migration script guidance. Improves user experience when encountering startup failures. * Add database schema caching to health endpoint Implement smart caching for schema validation to prevent repeated database queries. Cache successful validations permanently and throttle failures to 30-second intervals. Replace debug prints with proper logging. * Clean up knowledge API imports and logging Remove duplicate import statements and redundant logging. Improves code clarity and reduces log noise. * Remove unused instructions prop from MigrationBanner Clean up component API by removing instructions prop that was accepted but never rendered. Simplifies the interface and eliminates dead code while keeping the functional hardcoded migration steps. * Add schema_valid flag to migration_required health response Add schema_valid: false flag to health endpoint response when database schema migration is required. Improves API consistency without changing existing behavior. --------- Co-authored-by: Claude <noreply@anthropic.com>
269 lines
10 KiB
Python
269 lines
10 KiB
Python
"""
|
|
Test race condition handling in source creation.
|
|
|
|
This test ensures that concurrent source creation attempts
|
|
don't fail with PRIMARY KEY violations.
|
|
"""
|
|
|
|
import asyncio
|
|
import threading
|
|
from concurrent.futures import ThreadPoolExecutor
|
|
from unittest.mock import Mock, patch
|
|
import pytest
|
|
|
|
from src.server.services.source_management_service import update_source_info
|
|
|
|
|
|
class TestSourceRaceCondition:
|
|
"""Test that concurrent source creation handles race conditions properly."""
|
|
|
|
def test_concurrent_source_creation_no_race(self):
|
|
"""Test that concurrent attempts to create the same source don't fail."""
|
|
# Track successful operations
|
|
successful_creates = []
|
|
failed_creates = []
|
|
|
|
def mock_execute():
|
|
"""Mock execute that simulates database operation."""
|
|
return Mock(data=[])
|
|
|
|
def track_upsert(data):
|
|
"""Track upsert calls."""
|
|
successful_creates.append(data["source_id"])
|
|
return Mock(execute=mock_execute)
|
|
|
|
# Mock Supabase client
|
|
mock_client = Mock()
|
|
|
|
# Mock the SELECT (existing source check) - always returns empty
|
|
mock_client.table.return_value.select.return_value.eq.return_value.execute.return_value.data = []
|
|
|
|
# Mock the UPSERT operation
|
|
mock_client.table.return_value.upsert = track_upsert
|
|
|
|
def create_source(thread_id):
|
|
"""Simulate creating a source from a thread."""
|
|
try:
|
|
update_source_info(
|
|
client=mock_client,
|
|
source_id="test_source_123",
|
|
summary=f"Summary from thread {thread_id}",
|
|
word_count=100,
|
|
content=f"Content from thread {thread_id}",
|
|
knowledge_type="documentation",
|
|
tags=["test"],
|
|
update_frequency=0,
|
|
source_url="https://example.com",
|
|
source_display_name=f"Example Site {thread_id}" # Will be used as title
|
|
)
|
|
except Exception as e:
|
|
failed_creates.append((thread_id, str(e)))
|
|
|
|
# Run 5 threads concurrently trying to create the same source
|
|
with ThreadPoolExecutor(max_workers=5) as executor:
|
|
futures = []
|
|
for i in range(5):
|
|
futures.append(executor.submit(create_source, i))
|
|
|
|
# Wait for all to complete
|
|
for future in futures:
|
|
future.result()
|
|
|
|
# All should succeed (no failures due to PRIMARY KEY violation)
|
|
assert len(failed_creates) == 0, f"Some creates failed: {failed_creates}"
|
|
assert len(successful_creates) == 5, "All 5 attempts should succeed"
|
|
assert all(sid == "test_source_123" for sid in successful_creates)
|
|
|
|
def test_upsert_vs_insert_behavior(self):
|
|
"""Test that upsert is used instead of insert for new sources."""
|
|
mock_client = Mock()
|
|
|
|
# Track which method is called
|
|
methods_called = []
|
|
|
|
def track_insert(data):
|
|
methods_called.append("insert")
|
|
# Simulate PRIMARY KEY violation
|
|
raise Exception("duplicate key value violates unique constraint")
|
|
|
|
def track_upsert(data):
|
|
methods_called.append("upsert")
|
|
return Mock(execute=Mock(return_value=Mock(data=[])))
|
|
|
|
# Source doesn't exist
|
|
mock_client.table.return_value.select.return_value.eq.return_value.execute.return_value.data = []
|
|
|
|
# Set up mocks
|
|
mock_client.table.return_value.insert = track_insert
|
|
mock_client.table.return_value.upsert = track_upsert
|
|
|
|
update_source_info(
|
|
client=mock_client,
|
|
source_id="new_source",
|
|
summary="Test summary",
|
|
word_count=100,
|
|
content="Test content",
|
|
knowledge_type="documentation",
|
|
source_display_name="Test Display Name" # Will be used as title
|
|
)
|
|
|
|
# Should use upsert, not insert
|
|
assert "upsert" in methods_called, "Should use upsert for new sources"
|
|
assert "insert" not in methods_called, "Should not use insert to avoid race conditions"
|
|
|
|
def test_existing_source_uses_update(self):
|
|
"""Test that existing sources still use UPDATE (not affected by change)."""
|
|
mock_client = Mock()
|
|
|
|
methods_called = []
|
|
|
|
def track_update(data):
|
|
methods_called.append("update")
|
|
return Mock(eq=Mock(return_value=Mock(execute=Mock(return_value=Mock(data=[])))))
|
|
|
|
def track_upsert(data):
|
|
methods_called.append("upsert")
|
|
return Mock(execute=Mock(return_value=Mock(data=[])))
|
|
|
|
# Source exists
|
|
existing_source = {
|
|
"source_id": "existing_source",
|
|
"title": "Existing Title",
|
|
"metadata": {"knowledge_type": "api"}
|
|
}
|
|
mock_client.table.return_value.select.return_value.eq.return_value.execute.return_value.data = [existing_source]
|
|
|
|
# Set up mocks
|
|
mock_client.table.return_value.update = track_update
|
|
mock_client.table.return_value.upsert = track_upsert
|
|
|
|
update_source_info(
|
|
client=mock_client,
|
|
source_id="existing_source",
|
|
summary="Updated summary",
|
|
word_count=200,
|
|
content="Updated content",
|
|
knowledge_type="documentation"
|
|
)
|
|
|
|
# Should use update for existing sources
|
|
assert "update" in methods_called, "Should use update for existing sources"
|
|
assert "upsert" not in methods_called, "Should not use upsert for existing sources"
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_async_concurrent_creation(self):
|
|
"""Test concurrent source creation in async context."""
|
|
mock_client = Mock()
|
|
|
|
# Track operations
|
|
operations = []
|
|
|
|
def track_upsert(data):
|
|
operations.append(("upsert", data["source_id"]))
|
|
return Mock(execute=Mock(return_value=Mock(data=[])))
|
|
|
|
# No existing sources
|
|
mock_client.table.return_value.select.return_value.eq.return_value.execute.return_value.data = []
|
|
mock_client.table.return_value.upsert = track_upsert
|
|
|
|
async def create_source_async(task_id):
|
|
"""Async wrapper for source creation."""
|
|
await asyncio.to_thread(
|
|
update_source_info,
|
|
client=mock_client,
|
|
source_id=f"async_source_{task_id % 2}", # Only 2 unique sources
|
|
summary=f"Summary {task_id}",
|
|
word_count=100,
|
|
content=f"Content {task_id}",
|
|
knowledge_type="documentation"
|
|
)
|
|
|
|
# Create 10 tasks, but only 2 unique source_ids
|
|
tasks = [create_source_async(i) for i in range(10)]
|
|
await asyncio.gather(*tasks)
|
|
|
|
# All operations should succeed
|
|
assert len(operations) == 10, "All 10 operations should complete"
|
|
|
|
# Check that we tried to upsert the two sources multiple times
|
|
source_0_count = sum(1 for op, sid in operations if sid == "async_source_0")
|
|
source_1_count = sum(1 for op, sid in operations if sid == "async_source_1")
|
|
|
|
assert source_0_count == 5, "async_source_0 should be upserted 5 times"
|
|
assert source_1_count == 5, "async_source_1 should be upserted 5 times"
|
|
|
|
def test_race_condition_with_delay(self):
|
|
"""Test race condition with simulated delay between check and create."""
|
|
import time
|
|
|
|
mock_client = Mock()
|
|
|
|
# Track timing of operations
|
|
check_times = []
|
|
create_times = []
|
|
source_created = threading.Event()
|
|
|
|
def delayed_select(*args):
|
|
"""Return a mock that simulates SELECT with delay."""
|
|
mock_select = Mock()
|
|
|
|
def eq_mock(*args):
|
|
mock_eq = Mock()
|
|
mock_eq.execute = lambda: delayed_check()
|
|
return mock_eq
|
|
|
|
mock_select.eq = eq_mock
|
|
return mock_select
|
|
|
|
def delayed_check():
|
|
"""Simulate SELECT execution with delay."""
|
|
check_times.append(time.time())
|
|
result = Mock()
|
|
# First thread doesn't see the source
|
|
if not source_created.is_set():
|
|
time.sleep(0.01) # Small delay to let both threads check
|
|
result.data = []
|
|
else:
|
|
# Subsequent checks would see it (but we use upsert so this doesn't matter)
|
|
result.data = [{"source_id": "race_source", "title": "Existing", "metadata": {}}]
|
|
return result
|
|
|
|
def track_upsert(data):
|
|
"""Track upsert and set event."""
|
|
create_times.append(time.time())
|
|
source_created.set()
|
|
return Mock(execute=Mock(return_value=Mock(data=[])))
|
|
|
|
# Set up table mock to return our custom select mock
|
|
mock_client.table.return_value.select = delayed_select
|
|
mock_client.table.return_value.upsert = track_upsert
|
|
|
|
errors = []
|
|
|
|
def create_with_error_tracking(thread_id):
|
|
try:
|
|
update_source_info(
|
|
client=mock_client,
|
|
source_id="race_source",
|
|
summary="Race summary",
|
|
word_count=100,
|
|
content="Race content",
|
|
knowledge_type="documentation",
|
|
source_display_name="Race Display Name" # Will be used as title
|
|
)
|
|
except Exception as e:
|
|
errors.append((thread_id, str(e)))
|
|
|
|
# Run 2 threads that will both check before either creates
|
|
with ThreadPoolExecutor(max_workers=2) as executor:
|
|
futures = [
|
|
executor.submit(create_with_error_tracking, 1),
|
|
executor.submit(create_with_error_tracking, 2)
|
|
]
|
|
for future in futures:
|
|
future.result()
|
|
|
|
# Both should succeed with upsert (no errors)
|
|
assert len(errors) == 0, f"No errors should occur with upsert: {errors}"
|
|
assert len(check_times) == 2, "Both threads should check"
|
|
assert len(create_times) == 2, "Both threads should attempt create/upsert" |