Debug Problemi Complessi: Troubleshooting Sistemico
Fonte: Capitolo 6 - Settore Developer & Tech
Categoria: Domini Specialistici
Livello: Avanzato
URL: prmpt.onl/203
Quando usarlo
Per debugging sistematico di problemi complessi che richiedono investigazione metodica e root cause analysis. Trasforma debug session chaotiche in processi strutturati con resolution efficace.
Ideale per:
- Production bugs difficili da replicare
- Performance issues con multiple cause potenziali
- System failures distributed/multi-component
- Intermittent problems che appaiono random
💡 PERCHÉ QUESTO TEMPLATE È IN INGLESE Il debugging segue metodologie internazionali standardizzate. Error messages, stack traces, e debugging tools sono tipicamente in inglese. Mantenere il linguaggio tecnico originale facilita la correlazione con logs e documentazione tecnica.
Template
SENIOR DEBUGGING SPECIALIST
Problem context: [system/application type, environment, criticality]
Issue description: [what's happening vs expected behavior]
Business impact: [user impact, revenue impact, urgency level]
Available resources: [logs, monitoring, reproduction steps]
SYSTEMATIC DEBUGGING FRAMEWORK:
## 1. PROBLEM CHARACTERIZATION
- Exact symptom description and error manifestation
- Frequency pattern: consistent, intermittent, specific conditions
- Scope analysis: affected users, components, environments
- Timeline: when started, pattern changes, related deployments
- Success/failure boundaries: what works vs what doesn't
## 2. INFORMATION GATHERING
- Error logs and stack traces analysis
- System metrics during problem occurrence
- Recent changes: code deployments, configuration, infrastructure
- Environmental factors: load, network, dependencies
- User behavior patterns when issue occurs
## 3. HYPOTHESIS FORMATION
Generate top 3-5 most probable root causes:
- Hypothesis A: [specific cause] - Evidence: [supporting data]
- Hypothesis B: [alternative cause] - Evidence: [indicators]
- Hypothesis C: [third possibility] - Evidence: [clues]
Priority ranking based on probability and evidence strength
## 4. SYSTEMATIC INVESTIGATION
For each hypothesis, define:
- Specific tests to validate/eliminate the hypothesis
- Required tools, access, or reproduction steps
- Expected results that confirm/deny the theory
- Time estimate for investigation
- Risk assessment of testing approach
## 5. RESOLUTION STRATEGY
Once root cause identified:
- Immediate mitigation to reduce impact
- Permanent fix implementation plan
- Testing strategy to validate resolution
- Rollback plan if fix creates new issues
- Prevention measures for future occurrences
## 6. KNOWLEDGE CAPTURE
- Problem summary and resolution documentation
- Lessons learned and process improvements
- Monitoring alerts to catch similar issues early
- Team knowledge sharing recommendations
Focus on methodical approach rather than random trial-and-error.
Varianti
Per Performance Issues
Aggiungi: “Include profiling data analysis, bottleneck identification methodology, load testing approach, e scalability assessment framework.”
Per Distributed Systems
Aggiungi: “Focus su distributed tracing, service mesh analysis, network latency investigation, e cross-service dependency mapping.”
Per Data Corruption Issues
Aggiungi: “Include data integrity checks, backup validation, transaction log analysis, e data recovery planning methodology.”
Esempio di risultato
Input: “Production API intermittently returning 500 errors, ~5% of requests, started 3 days ago, no recent deployments”
Output AI (estratto):
DEBUGGING SESSION: Intermittent API 500 Errors
## PROBLEM CHARACTERIZATION
**EXACT SYMPTOMS:**
- HTTP 500 Internal Server Error on various API endpoints
- Frequency: ~5% of all requests (200-300 errors/hour)
- Pattern: No obvious correlation with specific endpoints or user types
- Duration: Started 72 hours ago, consistent rate since then
**SCOPE ANALYSIS:**
- Affected: All API endpoints, all user types
- NOT affected: Static content serving, background jobs
- Environment: Production only (staging shows no issues)
- Impact: ~2000 affected requests daily, customer complaints increasing
**TIMELINE CORRELATION:**
- Started: January 21, 2:30 PM UTC
- No deployments in past week
- Infrastructure changes: None reported
- External dependencies: Need to verify third-party service status
## INFORMATION GATHERING PRIORITIES
**CRITICAL DATA NEEDED:**
1. **Application logs**: Full stack traces for 500 errors
2. **System metrics**: CPU, memory, disk I/O during error spikes
3. **Database performance**: Query times, connection pool status
4. **External API calls**: Response times, failure rates from dependencies
5. **Load balancer logs**: Request distribution, backend health checks
**QUESTIONS TO INVESTIGATE:**
- Are errors evenly distributed across app servers?
- Any correlation with request payload size or complexity?
- Database connection pool exhaustion?
- Third-party API rate limiting or timeouts?
## HYPOTHESIS FORMATION
**HYPOTHESIS A: Database Connection Pool Exhaustion (70% probability)**
- Evidence: 5% error rate suggests resource limitation, not code bug
- Test: Monitor connection pool metrics, check for connection leaks
- Expected: Connection count spikes correlate with error occurrences
**HYPOTHESIS B: External API Timeout/Rate Limiting (60% probability)**
- Evidence: Intermittent pattern suggests dependency issue
- Test: Analyze third-party API response times and error codes
- Expected: 500 errors correlate with upstream API failures
**HYPOTHESIS C: Memory Leak Causing GC Pressure (40% probability)**
- Evidence: Started 3 days ago, could be gradual memory accumulation
- Test: Memory usage trends, GC frequency and duration
- Expected: Memory usage trending upward, GC pauses during errors
**HYPOTHESIS D: Load Balancer Health Check Issues (30% probability)**
- Evidence: Random distribution could be failing backend servers
- Test: Load balancer logs, server health check status
- Expected: Specific backends intermittently marked unhealthy
## SYSTEMATIC INVESTIGATION PLAN
**PHASE 1 (Next 30 minutes): Quick Wins**
1. Check application logs for consistent stack trace patterns
2. Monitor database connection pool metrics in real-time
3. Verify external API status pages and response times
4. Review load balancer backend health status
**PHASE 2 (Next 60 minutes): Deep Analysis**
1. Correlation analysis: error timestamps vs system metrics
2. Database query performance analysis during error periods
3. Application memory usage trend analysis (past 5 days)
4. Network connectivity tests to external dependencies
**PHASE 3 (If needed): Advanced Investigation**
1. Enable detailed request tracing on subset of traffic
2. Application profiling during high error rate periods
3. Database query execution plan analysis
4. Full external dependency health check
## IMMEDIATE MITIGATION OPTIONS
**OPTION 1: Increase Resource Limits (Low Risk)**
- Increase database connection pool size
- Adjust application memory limits
- Impact: May reduce errors if resource exhaustion is cause
**OPTION 2: Circuit Breaker for External APIs (Medium Risk)**
- Implement timeouts and fallback behavior
- Graceful degradation for non-critical external calls
- Impact: Prevents cascade failures from dependencies
**OPTION 3: Load Balancer Adjustment (High Risk)**
- Adjust health check parameters
- Remove potentially failing backends temporarily
- Impact: May reduce capacity but improve success rate
## RESOLUTION TRACKING
**SUCCESS CRITERIA:**
- Error rate drops below 1% consistently for 4+ hours
- No new error patterns introduced
- System performance remains stable under normal load
**ROLLBACK TRIGGERS:**
- Error rate increases above current 5%
- New error types introduced
- System performance degradation (response time >2x baseline)
**MONITORING ALERTS TO ADD:**
- Database connection pool utilization >80%
- External API response time >5 seconds
- Application error rate >2% for 10+ minutes
- Memory usage trend upward >10% per day
Would you like me to start with Phase 1 investigation, or do you have additional context that might help prioritize the hypotheses?
Tips per risultati migliori
- Comprehensive context: Fornisci error messages completi, timeline precisi, e system architecture details
- Hypothesis prioritization: Rank theories per probability e impact per efficient investigation
- Structured testing: Define clear success/failure criteria per ogni hypothesis test
- Documentation focus: Capture learning per future similar issues e team knowledge sharing
Template correlati
- prmpt.onl/201 - Architecture Review per system design context
- prmpt.onl/202 - Code Review per code quality issues
- prmpt.onl/101 - Chain of Thought per systematic reasoning
📖 Da “Prompt Engineering: Il Nuovo Skill” di Marco Milani
🔗 Tutti i template: prmpt.onl/index
📱 Accesso rapido: [QR Code]
Ultimo aggiornamento: 24 Gennaio 2025