Debug Problemi Complessi: Troubleshooting Sistemico

Fonte: Capitolo 6 - Settore Developer & Tech
Categoria: Domini Specialistici
Livello: Avanzato
URL: prmpt.onl/203

Quando usarlo

Per debugging sistematico di problemi complessi che richiedono investigazione metodica e root cause analysis. Trasforma debug session chaotiche in processi strutturati con resolution efficace.

Ideale per:

Production bugs difficili da replicare
Performance issues con multiple cause potenziali
System failures distributed/multi-component
Intermittent problems che appaiono random

💡 PERCHÉ QUESTO TEMPLATE È IN INGLESE Il debugging segue metodologie internazionali standardizzate. Error messages, stack traces, e debugging tools sono tipicamente in inglese. Mantenere il linguaggio tecnico originale facilita la correlazione con logs e documentazione tecnica.

Template

SENIOR DEBUGGING SPECIALIST

Problem context: [system/application type, environment, criticality]
Issue description: [what's happening vs expected behavior]
Business impact: [user impact, revenue impact, urgency level]
Available resources: [logs, monitoring, reproduction steps]

SYSTEMATIC DEBUGGING FRAMEWORK:

## 1. PROBLEM CHARACTERIZATION
- Exact symptom description and error manifestation
- Frequency pattern: consistent, intermittent, specific conditions
- Scope analysis: affected users, components, environments
- Timeline: when started, pattern changes, related deployments
- Success/failure boundaries: what works vs what doesn't

## 2. INFORMATION GATHERING
- Error logs and stack traces analysis
- System metrics during problem occurrence
- Recent changes: code deployments, configuration, infrastructure
- Environmental factors: load, network, dependencies
- User behavior patterns when issue occurs

## 3. HYPOTHESIS FORMATION
Generate top 3-5 most probable root causes:
- Hypothesis A: [specific cause] - Evidence: [supporting data]
- Hypothesis B: [alternative cause] - Evidence: [indicators]
- Hypothesis C: [third possibility] - Evidence: [clues]
Priority ranking based on probability and evidence strength

## 4. SYSTEMATIC INVESTIGATION
For each hypothesis, define:
- Specific tests to validate/eliminate the hypothesis
- Required tools, access, or reproduction steps
- Expected results that confirm/deny the theory
- Time estimate for investigation
- Risk assessment of testing approach

## 5. RESOLUTION STRATEGY
Once root cause identified:
- Immediate mitigation to reduce impact
- Permanent fix implementation plan
- Testing strategy to validate resolution
- Rollback plan if fix creates new issues
- Prevention measures for future occurrences

## 6. KNOWLEDGE CAPTURE
- Problem summary and resolution documentation
- Lessons learned and process improvements
- Monitoring alerts to catch similar issues early
- Team knowledge sharing recommendations

Focus on methodical approach rather than random trial-and-error.

Varianti

Per Performance Issues

Aggiungi: “Include profiling data analysis, bottleneck identification methodology, load testing approach, e scalability assessment framework.”

Per Distributed Systems

Aggiungi: “Focus su distributed tracing, service mesh analysis, network latency investigation, e cross-service dependency mapping.”

Per Data Corruption Issues

Aggiungi: “Include data integrity checks, backup validation, transaction log analysis, e data recovery planning methodology.”

Esempio di risultato

Input: “Production API intermittently returning 500 errors, ~5% of requests, started 3 days ago, no recent deployments”

Output AI (estratto):

DEBUGGING SESSION: Intermittent API 500 Errors

## PROBLEM CHARACTERIZATION

**EXACT SYMPTOMS:**
- HTTP 500 Internal Server Error on various API endpoints
- Frequency: ~5% of all requests (200-300 errors/hour)
- Pattern: No obvious correlation with specific endpoints or user types
- Duration: Started 72 hours ago, consistent rate since then

**SCOPE ANALYSIS:**
- Affected: All API endpoints, all user types
- NOT affected: Static content serving, background jobs
- Environment: Production only (staging shows no issues)
- Impact: ~2000 affected requests daily, customer complaints increasing

**TIMELINE CORRELATION:**
- Started: January 21, 2:30 PM UTC
- No deployments in past week
- Infrastructure changes: None reported
- External dependencies: Need to verify third-party service status

## INFORMATION GATHERING PRIORITIES

**CRITICAL DATA NEEDED:**
1. **Application logs**: Full stack traces for 500 errors
2. **System metrics**: CPU, memory, disk I/O during error spikes
3. **Database performance**: Query times, connection pool status
4. **External API calls**: Response times, failure rates from dependencies
5. **Load balancer logs**: Request distribution, backend health checks

**QUESTIONS TO INVESTIGATE:**
- Are errors evenly distributed across app servers?
- Any correlation with request payload size or complexity?
- Database connection pool exhaustion?
- Third-party API rate limiting or timeouts?

## HYPOTHESIS FORMATION

**HYPOTHESIS A: Database Connection Pool Exhaustion (70% probability)**
- Evidence: 5% error rate suggests resource limitation, not code bug
- Test: Monitor connection pool metrics, check for connection leaks
- Expected: Connection count spikes correlate with error occurrences

**HYPOTHESIS B: External API Timeout/Rate Limiting (60% probability)**  
- Evidence: Intermittent pattern suggests dependency issue
- Test: Analyze third-party API response times and error codes
- Expected: 500 errors correlate with upstream API failures

**HYPOTHESIS C: Memory Leak Causing GC Pressure (40% probability)**
- Evidence: Started 3 days ago, could be gradual memory accumulation
- Test: Memory usage trends, GC frequency and duration
- Expected: Memory usage trending upward, GC pauses during errors

**HYPOTHESIS D: Load Balancer Health Check Issues (30% probability)**
- Evidence: Random distribution could be failing backend servers
- Test: Load balancer logs, server health check status
- Expected: Specific backends intermittently marked unhealthy

## SYSTEMATIC INVESTIGATION PLAN

**PHASE 1 (Next 30 minutes): Quick Wins**
1. Check application logs for consistent stack trace patterns
2. Monitor database connection pool metrics in real-time
3. Verify external API status pages and response times
4. Review load balancer backend health status

**PHASE 2 (Next 60 minutes): Deep Analysis**
1. Correlation analysis: error timestamps vs system metrics
2. Database query performance analysis during error periods
3. Application memory usage trend analysis (past 5 days)
4. Network connectivity tests to external dependencies

**PHASE 3 (If needed): Advanced Investigation**
1. Enable detailed request tracing on subset of traffic
2. Application profiling during high error rate periods
3. Database query execution plan analysis
4. Full external dependency health check

## IMMEDIATE MITIGATION OPTIONS

**OPTION 1: Increase Resource Limits (Low Risk)**
- Increase database connection pool size
- Adjust application memory limits
- Impact: May reduce errors if resource exhaustion is cause

**OPTION 2: Circuit Breaker for External APIs (Medium Risk)**
- Implement timeouts and fallback behavior
- Graceful degradation for non-critical external calls
- Impact: Prevents cascade failures from dependencies

**OPTION 3: Load Balancer Adjustment (High Risk)**
- Adjust health check parameters
- Remove potentially failing backends temporarily
- Impact: May reduce capacity but improve success rate

## RESOLUTION TRACKING

**SUCCESS CRITERIA:**
- Error rate drops below 1% consistently for 4+ hours
- No new error patterns introduced
- System performance remains stable under normal load

**ROLLBACK TRIGGERS:**
- Error rate increases above current 5%
- New error types introduced
- System performance degradation (response time >2x baseline)

**MONITORING ALERTS TO ADD:**
- Database connection pool utilization >80%
- External API response time >5 seconds
- Application error rate >2% for 10+ minutes
- Memory usage trend upward >10% per day

Would you like me to start with Phase 1 investigation, or do you have additional context that might help prioritize the hypotheses?

Tips per risultati migliori

Comprehensive context: Fornisci error messages completi, timeline precisi, e system architecture details
Hypothesis prioritization: Rank theories per probability e impact per efficient investigation
Structured testing: Define clear success/failure criteria per ogni hypothesis test
Documentation focus: Capture learning per future similar issues e team knowledge sharing

Template correlati

prmpt.onl/201 - Architecture Review per system design context
prmpt.onl/202 - Code Review per code quality issues
prmpt.onl/101 - Chain of Thought per systematic reasoning

📖 Da “Prompt Engineering: Il Nuovo Skill” di Marco Milani
🔗 Tutti i template: prmpt.onl/index
📱 Accesso rapido: [QR Code]

Ultimo aggiornamento: 24 Gennaio 2025