Debug Problemi Complessi: Troubleshooting Sistemico

Fonte: Capitolo 6 - Settore Developer & Tech
Categoria: Domini Specialistici
Livello: Avanzato
URL: prmpt.onl/203

Quando usarlo

Per debugging sistematico di problemi complessi che richiedono investigazione metodica e root cause analysis. Trasforma debug session chaotiche in processi strutturati con resolution efficace.

Ideale per:

Production bugs difficili da replicare
Performance issues con multiple cause potenziali
System failures distributed/multi-component
Intermittent problems che appaiono random

💡 PERCHÉ QUESTO TEMPLATE È IN INGLESE Il debugging segue metodologie internazionali standardizzate. Error messages, stack traces, e debugging tools sono tipicamente in inglese. Mantenere il linguaggio tecnico originale facilita la correlazione con logs e documentazione tecnica.

Template

ASSISTENTE DEBUGGING - TROUBLESHOOTING SISTEMATICO

Prima di analizzare il problema, ho bisogno di queste informazioni per il contesto completo:

DOMANDE DI CHIARIMENTO:

1. **SPECIFICHE ERRORE**: Messaggio di errore esatto/sintomi? Quando è iniziato?
2. **AMBIENTE**: Production/staging/locale? Specifiche OS/browser/dispositivo?
3. **RIPRODUZIONE**: Passi per riprodurre? Costante o intermittente?
4. **CAMBIAMENTI RECENTI**: Deploy, modifiche config, migrazioni database?
5. **SCOPE IMPATTO**: Tutti gli utenti o subset specifico? Altri sistemi coinvolti?
6. **LOG DISPONIBILI**: Log degli errori, log applicazione, metriche di sistema?

Dopo le tue risposte, applicherò la metodologia Chain of Thought per il debugging:

METODOLOGIA DEBUGGING:

1. **Formazione Ipotesi**: Basata sui sintomi e il contesto raccolto
2. **Raccolta Evidenze**: Analisi log, review metriche, esecuzione test
3. **Analisi Causa Radice**: Eliminazione sistematica delle possibilità
4. **Validazione Soluzione**: Test della correzione in ambiente controllato
5. **Strategia Prevenzione**: Come evitare problemi simili in futuro

Descrivimi il tuo problema e inizierò con le domande di chiarimento.

Varianti

Per Performance Issues

Aggiungi: “Include profiling data analysis, bottleneck identification methodology, load testing approach, e scalability assessment framework.”

Per Distributed Systems

Aggiungi: “Focus su distributed tracing, service mesh analysis, network latency investigation, e cross-service dependency mapping.”

Per Data Corruption Issues

Aggiungi: “Include data integrity checks, backup validation, transaction log analysis, e data recovery planning methodology.”

Esempio di risultato

Input: “Production API intermittently returning 500 errors, ~5% of requests, started 3 days ago, no recent deployments”

Output AI (estratto):

DEBUGGING SESSION: Intermittent API 500 Errors

## PROBLEM CHARACTERIZATION

**EXACT SYMPTOMS:**
- HTTP 500 Internal Server Error on various API endpoints
- Frequency: ~5% of all requests (200-300 errors/hour)
- Pattern: No obvious correlation with specific endpoints or user types
- Duration: Started 72 hours ago, consistent rate since then

**SCOPE ANALYSIS:**
- Affected: All API endpoints, all user types
- NOT affected: Static content serving, background jobs
- Environment: Production only (staging shows no issues)
- Impact: ~2000 affected requests daily, customer complaints increasing

**TIMELINE CORRELATION:**
- Started: January 21, 2:30 PM UTC
- No deployments in past week
- Infrastructure changes: None reported
- External dependencies: Need to verify third-party service status

## INFORMATION GATHERING PRIORITIES

**CRITICAL DATA NEEDED:**
1. **Application logs**: Full stack traces for 500 errors
2. **System metrics**: CPU, memory, disk I/O during error spikes
3. **Database performance**: Query times, connection pool status
4. **External API calls**: Response times, failure rates from dependencies
5. **Load balancer logs**: Request distribution, backend health checks

**QUESTIONS TO INVESTIGATE:**
- Are errors evenly distributed across app servers?
- Any correlation with request payload size or complexity?
- Database connection pool exhaustion?
- Third-party API rate limiting or timeouts?

## HYPOTHESIS FORMATION

**HYPOTHESIS A: Database Connection Pool Exhaustion (70% probability)**
- Evidence: 5% error rate suggests resource limitation, not code bug
- Test: Monitor connection pool metrics, check for connection leaks
- Expected: Connection count spikes correlate with error occurrences

**HYPOTHESIS B: External API Timeout/Rate Limiting (60% probability)**  
- Evidence: Intermittent pattern suggests dependency issue
- Test: Analyze third-party API response times and error codes
- Expected: 500 errors correlate with upstream API failures

**HYPOTHESIS C: Memory Leak Causing GC Pressure (40% probability)**
- Evidence: Started 3 days ago, could be gradual memory accumulation
- Test: Memory usage trends, GC frequency and duration
- Expected: Memory usage trending upward, GC pauses during errors

**HYPOTHESIS D: Load Balancer Health Check Issues (30% probability)**
- Evidence: Random distribution could be failing backend servers
- Test: Load balancer logs, server health check status
- Expected: Specific backends intermittently marked unhealthy

## SYSTEMATIC INVESTIGATION PLAN

**PHASE 1 (Next 30 minutes): Quick Wins**
1. Check application logs for consistent stack trace patterns
2. Monitor database connection pool metrics in real-time
3. Verify external API status pages and response times
4. Review load balancer backend health status

**PHASE 2 (Next 60 minutes): Deep Analysis**
1. Correlation analysis: error timestamps vs system metrics
2. Database query performance analysis during error periods
3. Application memory usage trend analysis (past 5 days)
4. Network connectivity tests to external dependencies

**PHASE 3 (If needed): Advanced Investigation**
1. Enable detailed request tracing on subset of traffic
2. Application profiling during high error rate periods
3. Database query execution plan analysis
4. Full external dependency health check

## IMMEDIATE MITIGATION OPTIONS

**OPTION 1: Increase Resource Limits (Low Risk)**
- Increase database connection pool size
- Adjust application memory limits
- Impact: May reduce errors if resource exhaustion is cause

**OPTION 2: Circuit Breaker for External APIs (Medium Risk)**
- Implement timeouts and fallback behavior
- Graceful degradation for non-critical external calls
- Impact: Prevents cascade failures from dependencies

**OPTION 3: Load Balancer Adjustment (High Risk)**
- Adjust health check parameters
- Remove potentially failing backends temporarily
- Impact: May reduce capacity but improve success rate

## RESOLUTION TRACKING

**SUCCESS CRITERIA:**
- Error rate drops below 1% consistently for 4+ hours
- No new error patterns introduced
- System performance remains stable under normal load

**ROLLBACK TRIGGERS:**
- Error rate increases above current 5%
- New error types introduced
- System performance degradation (response time >2x baseline)

**MONITORING ALERTS TO ADD:**
- Database connection pool utilization >80%
- External API response time >5 seconds
- Application error rate >2% for 10+ minutes
- Memory usage trend upward >10% per day

Would you like me to start with Phase 1 investigation, or do you have additional context that might help prioritize the hypotheses?

Tips per risultati migliori

Comprehensive context: Fornisci error messages completi, timeline precisi, e system architecture details
Hypothesis prioritization: Rank theories per probability e impact per efficient investigation
Structured testing: Define clear success/failure criteria per ogni hypothesis test
Documentation focus: Capture learning per future similar issues e team knowledge sharing

Template correlati

prmpt.onl/201 - Architecture Review per system design context
prmpt.onl/202 - Code Review per code quality issues
prmpt.onl/101 - Chain of Thought per systematic reasoning

📖 Da “Prompt Engineering: Il Nuovo Skill” di Marco Milani
🔗 Tutti i template: prmpt.onl/index
📱 Accesso rapido: [QR Code]