Archon/K8S_COMPLETE_ADJUSTMENTS.md
Luis Erlacher e2e1201d62
Some checks failed
Build Images / build-server-docker (push) Has been cancelled
Build Images / build-mcp-docker (push) Has been cancelled
Build Images / build-agents-docker (push) Has been cancelled
Build Images / build-frontend-docker (push) Has been cancelled
Build Images / build-server-k8s (push) Has been cancelled
Build Images / build-mcp-k8s (push) Has been cancelled
Build Images / build-agents-k8s (push) Has been cancelled
Build Images / build-frontend-k8s (push) Has been cancelled
feat: Enhance Playwright and MCP configuration for Kubernetes deployment
- Updated docker-compose.yml to include PLAYWRIGHT_BROWSERS_PATH and MCP_PUBLIC_URL environment variables.
- Modified k8s-manifests-complete.yaml to add Playwright and MCP configurations in the ConfigMap and deployment spec.
- Adjusted resource limits in k8s manifests for improved performance during crawling.
- Updated Dockerfiles to install Playwright browsers in accessible locations for appuser.
- Added HTTP health check endpoint in mcp_server.py for better monitoring.
- Enhanced MCP API to utilize MCP_PUBLIC_URL for generating client configuration.
- Created MCP_PUBLIC_URL_GUIDE.md for detailed configuration instructions.
- Documented changes and recommendations in K8S_COMPLETE_ADJUSTMENTS.md.
2025-11-04 15:38:32 -03:00

20 KiB

Kubernetes Complete Adjustments Guide

Executive Summary

Este documento descreve todas as mudanças necessárias para executar o Archon em produção no Kubernetes, não apenas o Playwright. As mudanças cobrem:

  • Playwright browser binaries (JÁ CORRIGIDO)
  • ⚠️ Variáveis de ambiente em K8s manifests
  • ⚠️ Resource limits para crawling
  • ⚠️ Nginx permissions e configuration
  • ⚠️ Security contexts avançados
  • ⚠️ Health checks otimizados
  • ⚠️ Init containers para warm-up

1. Playwright Browser Binaries ( JÁ CORRIGIDO)

Problema Identificado

Playwright instalava binários em /root/.cache/ms-playwright (root), mas container roda como appuser (UID 1001) e não tinha acesso.

Solução Aplicada

Dockerfile.k8s.server:

# Install Playwright browsers in a location accessible to appuser
ENV PATH=/venv/bin:$PATH
ENV PLAYWRIGHT_BROWSERS_PATH=/app/ms-playwright
RUN mkdir -p /app/ms-playwright && \
    playwright install chromium && \
    chown -R appuser:appuser /app/ms-playwright

# Runtime environment
ENV PLAYWRIGHT_BROWSERS_PATH=/app/ms-playwright

Dockerfile.server (Docker Compose):

ENV PLAYWRIGHT_BROWSERS_PATH=/tmp/ms-playwright
RUN mkdir -p /tmp/ms-playwright && \
    playwright install chromium && \
    chmod -R 777 /tmp/ms-playwright

ENV PLAYWRIGHT_BROWSERS_PATH=/tmp/ms-playwright

⚠️ AÇÃO NECESSÁRIA: Adicionar em K8s Manifests

Adicionar em k8s-manifests-complete.yaml - archon-server deployment:

spec:
  template:
    spec:
      containers:
      - name: server
        env:
        # ... outras variáveis ...

        # ADICIONAR ESTA LINHA:
        - name: PLAYWRIGHT_BROWSERS_PATH
          value: "/app/ms-playwright"

2. Resource Limits para Crawling com Chromium

Problema

Chromium consome significativa memória e CPU durante crawling. Os limites atuais podem ser insuficientes:

Atual:

resources:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "1Gi"
    cpu: "1000m"

Solução Recomendada

Atualizar em k8s-manifests-complete.yaml - archon-server:

resources:
  requests:
    memory: "768Mi"      # Aumentado de 512Mi
    cpu: "500m"
  limits:
    memory: "2Gi"        # Aumentado de 1Gi (Chromium pode usar 1.5Gi em picos)
    cpu: "2000m"         # Aumentado de 1000m (crawling paralelo)

    # ADICIONAR: Limitar uso de ephemeral storage
    ephemeral-storage: "5Gi"

Justificativa

  • Chromium headless consome ~300-600MB por instância
  • Crawling paralelo pode executar múltiplas instâncias
  • Processamento de documentos grandes precisa de memória
  • Margem de segurança para evitar OOMKilled

3. Nginx Configuration e Permissions

Status Atual

Nginx já configurado para rodar como non-root (user nginx, UID 101)

Dockerfile.k8s.production:

RUN chown -R nginx:nginx /usr/share/nginx/html /var/cache/nginx /var/log/nginx /etc/nginx/conf.d && \
    touch /var/run/nginx.pid && \
    chown -R nginx:nginx /var/run/nginx.pid

USER nginx

⚠️ Melhorias Recomendadas

Adicionar em k8s-manifests-complete.yaml - archon-frontend:

spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 101       # nginx user
        runAsGroup: 101
        fsGroup: 101
        # ADICIONAR:
        seccompProfile:
          type: RuntimeDefault

      containers:
      - name: frontend
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - ALL
            # Nginx não precisa de capabilities especiais na porta 3737
          readOnlyRootFilesystem: true  # MUDAR para true

        # ADICIONAR volumes para diretórios que nginx precisa escrever:
        volumeMounts:
        - name: nginx-cache
          mountPath: /var/cache/nginx
        - name: nginx-run
          mountPath: /var/run
        - name: nginx-logs
          mountPath: /var/log/nginx

      volumes:
      - name: nginx-cache
        emptyDir: {}
      - name: nginx-run
        emptyDir: {}
      - name: nginx-logs
        emptyDir: {}

4. Advanced Security Contexts

Problema

Security contexts estão básicos. Podem ser fortalecidos para melhor segurança.

Solução: Pod Security Standards

Adicionar em TODOS os deployments:

spec:
  template:
    metadata:
      labels:
        app: archon-server  # ou mcp, frontend, etc
        # ADICIONAR:
        pod-security.kubernetes.io/enforce: baseline
        pod-security.kubernetes.io/audit: restricted
        pod-security.kubernetes.io/warn: restricted

    spec:
      # Security context do pod
      securityContext:
        runAsNonRoot: true
        runAsUser: 1001
        runAsGroup: 1001
        fsGroup: 1001
        # ADICIONAR:
        seccompProfile:
          type: RuntimeDefault
        supplementalGroups: []

      # Security context do container
      containers:
      - name: server
        securityContext:
          allowPrivilegeEscalation: false
          runAsNonRoot: true
          runAsUser: 1001
          capabilities:
            drop:
              - ALL
          # ADICIONAR (se possível - testar primeiro):
          readOnlyRootFilesystem: false  # true após configurar volumes
          seccompProfile:
            type: RuntimeDefault

Arquivos que Precisam Escrever

archon-server:

  • /app/ms-playwright - Playwright browser cache (já configurado com ownership correto)
  • /tmp - Temporary files (já acessível para appuser)
  • Nenhum volume persistente necessário (tudo vai para Supabase)

archon-mcp e archon-agents:

  • Nenhum arquivo local necessário
  • Podem usar readOnlyRootFilesystem: true

5. Health Checks Otimizados

Problema Atual

Health checks podem ser muito agressivos durante operações pesadas (crawling).

Solução

Atualizar em k8s-manifests-complete.yaml - archon-server:

livenessProbe:
  httpGet:
    path: /health
    port: 8181
  initialDelaySeconds: 60      # Aumentado de 40 (tempo para Playwright inicializar)
  periodSeconds: 30            # OK
  timeoutSeconds: 15           # Aumentado de 10 (crawling pode deixar servidor lento)
  failureThreshold: 5          # Aumentado de 3 (mais tolerante)
  successThreshold: 1

readinessProbe:
  httpGet:
    path: /health
    port: 8181
  initialDelaySeconds: 15      # Aumentado de 10
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3
  successThreshold: 1

# ADICIONAR startup probe para não matar pod durante startup lento:
startupProbe:
  httpGet:
    path: /health
    port: 8181
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 12         # 12 x 10s = 2 minutos para startup
  successThreshold: 1

6. Init Container para Playwright Warm-up (Opcional mas Recomendado)

Problema

Primeira requisição de crawling é lenta porque Playwright precisa inicializar.

Solução

Adicionar em k8s-manifests-complete.yaml - archon-server:

spec:
  template:
    spec:
      # ADICIONAR antes de containers:
      initContainers:
      - name: playwright-warmup
        image: git.automatizase.com.br/luis.erlacher/archon/server:k8s-latest
        imagePullPolicy: Always
        command:
        - sh
        - -c
        - |
          echo "Verificando instalação do Playwright..."
          python -c "from playwright.sync_api import sync_playwright; print('Playwright OK')" || exit 1
          echo "Playwright inicializado com sucesso"          
        env:
        - name: PLAYWRIGHT_BROWSERS_PATH
          value: "/app/ms-playwright"
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - ALL
          readOnlyRootFilesystem: false

      containers:
      - name: server
        # ... resto da configuração ...

7. ConfigMap Updates

Adicionar Playwright e outras configurações

Atualizar em k8s-manifests-complete.yaml - ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: archon-config
  namespace: archon
data:
  # Existing configs...
  SERVICE_DISCOVERY_MODE: "kubernetes"
  LOG_LEVEL: "INFO"
  ARCHON_SERVER_PORT: "8181"
  ARCHON_MCP_PORT: "8051"
  ARCHON_UI_PORT: "3737"
  ARCHON_HOST: "localhost"
  TRANSPORT: "sse"
  AGENTS_ENABLED: "false"

  # ADICIONAR:
  PLAYWRIGHT_BROWSERS_PATH: "/app/ms-playwright"

  # MCP Public URL - IMPORTANTE: Configure com seu domínio!
  # Format: "domain.com:8051" or "localhost:8051"
  # Examples:
  #   - Development: localhost:8051
  #   - Production: archon.automatizase.com.br:8051
  #   - Custom: mcp.mycompany.com:8051
  # This is used to generate MCP client configuration JSON
  MCP_PUBLIC_URL: "archon.automatizase.com.br:8051"  # ← CHANGE THIS!

  # Chromium optimization flags (já configurados no código, mas podem ser sobrescritos):
  CHROMIUM_DISABLE_DEV_SHM: "true"
  CHROMIUM_HEADLESS: "true"

8. Network Policies (Segurança Adicional)

Criar Network Policy para isolar pods

Criar arquivo k8s-network-policies.yaml:

---
# Network Policy - Archon Server
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: archon-server-netpol
  namespace: archon
spec:
  podSelector:
    matchLabels:
      app: archon-server
  policyTypes:
  - Ingress
  - Egress
  ingress:
  # Permite tráfego do frontend
  - from:
    - podSelector:
        matchLabels:
          app: archon-frontend
    ports:
    - protocol: TCP
      port: 8181
  # Permite tráfego do MCP
  - from:
    - podSelector:
        matchLabels:
          app: archon-mcp
    ports:
    - protocol: TCP
      port: 8181
  egress:
  # Permite DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
  # Permite Supabase (internet)
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443
  # Permite comunicação com MCP
  - to:
    - podSelector:
        matchLabels:
          app: archon-mcp
    ports:
    - protocol: TCP
      port: 8051

---
# Network Policy - Archon MCP
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: archon-mcp-netpol
  namespace: archon
spec:
  podSelector:
    matchLabels:
      app: archon-mcp
  policyTypes:
  - Ingress
  - Egress
  ingress:
  # Permite tráfego do server
  - from:
    - podSelector:
        matchLabels:
          app: archon-server
    ports:
    - protocol: TCP
      port: 8051
  egress:
  # Permite DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
  # Permite Supabase
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443
  # Permite comunicação com server
  - to:
    - podSelector:
        matchLabels:
          app: archon-server
    ports:
    - protocol: TCP
      port: 8181

---
# Network Policy - Archon Frontend
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: archon-frontend-netpol
  namespace: archon
spec:
  podSelector:
    matchLabels:
      app: archon-frontend
  policyTypes:
  - Ingress
  - Egress
  ingress:
  # Permite tráfego de qualquer lugar (public-facing)
  - {}
  egress:
  # Permite DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
  # Permite comunicação com server (para API calls)
  - to:
    - podSelector:
        matchLabels:
          app: archon-server
    ports:
    - protocol: TCP
      port: 8181

9. Horizontal Pod Autoscaling (HPA)

Configurar autoscaling para server

Criar arquivo k8s-hpa.yaml:

---
# HPA - Archon Server (crawling pode ter spikes de carga)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: archon-server-hpa
  namespace: archon
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: archon-server
  minReplicas: 2
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Espera 5min antes de scale down
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 30   # Scale up rápido
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

---
# HPA - Frontend (menos crítico, pode ser fixo em 2 réplicas)
# Opcional se houver muito tráfego
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: archon-frontend-hpa
  namespace: archon
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: archon-frontend
  minReplicas: 2
  maxReplicas: 4
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 75

10. PodDisruptionBudget (Alta Disponibilidade)

Garantir disponibilidade durante rolling updates

Criar arquivo k8s-pdb.yaml:

---
# PDB - Archon Server
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: archon-server-pdb
  namespace: archon
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: archon-server
  unhealthyPodEvictionPolicy: AlwaysAllow

---
# PDB - Archon Frontend
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: archon-frontend-pdb
  namespace: archon
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: archon-frontend
  unhealthyPodEvictionPolicy: AlwaysAllow

11. Persistent Volumes - NÃO NECESSÁRIO

Análise de Necessidade

Arquon NÃO precisa de volumes persistentes porque:

  1. Uploads de documentos: Processados em memória e salvos no Supabase
  2. Crawling results: Salvos diretamente no Supabase
  3. Playwright cache: Reinstalado na inicialização do pod (stateless)
  4. Logs: Enviados para stdout/stderr (capturados pelo K8s)
  5. Credenciais: Armazenadas no Supabase (encrypted)
  6. Session data: Gerenciado por Socket.IO em memória

📊 Arquitetura Stateless:

Pod → Processa dados → Salva no Supabase → Pod morre → Novo pod funciona igual

⚠️ Exceção: Se precisar de cache local para performance:

# Opcional: Volume efêmero para cache de embeddings (não persiste entre restarts)
volumes:
- name: embedding-cache
  emptyDir:
    sizeLimit: 1Gi

12. Monitoring e Observability

Prometheus Metrics (Recomendado)

Adicionar annotations nos deployments:

spec:
  template:
    metadata:
      annotations:
        # ADICIONAR:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8181"    # ou 8051 para MCP
        prometheus.io/path: "/metrics"  # Se implementar endpoint

Logfire Integration

Verificar em k8s-manifests-complete.yaml - Secrets:

apiVersion: v1
kind: Secret
metadata:
  name: archon-secrets
  namespace: archon
type: Opaque
stringData:
  SUPABASE_URL: "https://seu-projeto.supabase.co"
  SUPABASE_SERVICE_KEY: "sua-service-role-key-aqui"
  OPENAI_API_KEY: "sua-openai-key-aqui"
  LOGFIRE_TOKEN: "seu-logfire-token-aqui"  # CONFIGURAR se usar Logfire

Checklist de Implementação

🔴 PRIORIDADE CRÍTICA (Impede funcionamento)

  • Corrigir Playwright browser path nos Dockerfiles
  • Adicionar PLAYWRIGHT_BROWSERS_PATH env var no deployment K8s
  • Adicionar MCP_PUBLIC_URL no ConfigMap e deployment K8s
  • Aumentar resource limits (memory: 2Gi, cpu: 2000m)
  • ⚠️ Configurar MCP_PUBLIC_URL com o domínio correto no ConfigMap
  • ⚠️ Rebuild e push das imagens K8s

🟡 PRIORIDADE ALTA (Segurança e estabilidade)

  • Atualizar health checks (startup probe, failureThreshold)
  • ⚠️ Adicionar security contexts avançados (seccompProfile, readOnlyRootFilesystem)
  • ⚠️ Configurar volumes para nginx (cache, run, logs)
  • ⚠️ Implementar Network Policies

🟢 PRIORIDADE MÉDIA (Performance e observabilidade)

  • 🔄 Adicionar init container para Playwright warm-up
  • 🔄 Configurar HPA para server
  • 🔄 Configurar PodDisruptionBudget
  • 🔄 Adicionar Prometheus annotations

🔵 PRIORIDADE BAIXA (Melhoria contínua)

  • 📝 Implementar /metrics endpoint para Prometheus
  • 📝 Configurar Logfire token
  • 📝 Testar readOnlyRootFilesystem: true no server
  • 📝 Considerar resource quotas por namespace

Comandos para Deploy

1. Rebuild e Push das Imagens

# Server
cd /home/lperl/Archon
docker build -f python/Dockerfile.k8s.server -t git.automatizase.com.br/luis.erlacher/archon/server:k8s-latest python/
docker push git.automatizase.com.br/luis.erlacher/archon/server:k8s-latest

# MCP (não mudou, mas rebuild para garantir)
docker build -f python/Dockerfile.k8s.mcp -t git.automatizase.com.br/luis.erlacher/archon/mcp:k8s-latest python/
docker push git.automatizase.com.br/luis.erlacher/archon/mcp:k8s-latest

# Frontend (não mudou, mas rebuild para garantir)
docker build -f archon-ui-main/Dockerfile.k8s.production -t git.automatizase.com.br/luis.erlacher/archon/frontend:k8s-latest archon-ui-main/
docker push git.automatizase.com.br/luis.erlacher/archon/frontend:k8s-latest

# Agents (se usado)
docker build -f python/Dockerfile.k8s.agents -t git.automatizase.com.br/luis.erlacher/archon/agents:k8s-latest python/
docker push git.automatizase.com.br/luis.erlacher/archon/agents:k8s-latest

2. Aplicar K8s Manifests

# Namespace e secrets (se ainda não existir)
kubectl apply -f k8s-manifests-complete.yaml

# Network policies (criar arquivo primeiro)
kubectl apply -f k8s-network-policies.yaml

# HPA (criar arquivo primeiro)
kubectl apply -f k8s-hpa.yaml

# PDB (criar arquivo primeiro)
kubectl apply -f k8s-pdb.yaml

3. Rolling Restart

# Restart server (vai pegar nova imagem)
kubectl rollout restart deployment/archon-server -n archon
kubectl rollout status deployment/archon-server -n archon

# Restart MCP
kubectl rollout restart deployment/archon-mcp -n archon
kubectl rollout status deployment/archon-mcp -n archon

# Restart frontend
kubectl rollout restart deployment/archon-frontend -n archon
kubectl rollout status deployment/archon-frontend -n archon

4. Verificar Status

# Ver pods
kubectl get pods -n archon -w

# Ver logs do server
kubectl logs -f deployment/archon-server -n archon

# Ver eventos
kubectl get events -n archon --sort-by='.lastTimestamp'

# Testar crawling
kubectl port-forward -n archon svc/archon-server-service 8181:8181
# Em outro terminal:
curl -X POST http://localhost:8181/api/knowledge/crawl \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Troubleshooting

Problema: Pod crashando com OOMKilled

Solução: Aumentar memory limits para 2Gi ou mais

Problema: Playwright ainda não encontra browser

Verificar:

kubectl exec -it deployment/archon-server -n archon -- bash
echo $PLAYWRIGHT_BROWSERS_PATH
ls -la /app/ms-playwright

Problema: Health check falhando

Solução: Aumentar initialDelaySeconds e failureThreshold

Problema: Rolling update com downtime

Solução: Verificar PodDisruptionBudget e garantir minAvailable: 1


Referências