Postmortem - November 10, 2025
Between October 31 and November 4, we experienced three separate webhook processing outages that disrupted GitHub webhook delivery to Cirun. This caused delays and failures in CI/CD pipeline triggers for our customers. Total affected time was approximately ~7 hours across the three incidents.
What happened:
Our webhook processing system became unstable following infrastructure changes we deployed on October 30. The issues manifested as intermittent worker crashes, causing webhooks to queue up or fail entirely.
Root cause:
We made three simultaneous changes to our infrastructure:
Migrated our Redis cluster to new infrastructure
Upgraded Celery (our distributed task queue library) to the latest version
Switched worker concurrency model from pre-fork (multi-process) to gevent (single-process, green threads) for better memory efficiency
The gevent migration exposed critical bugs in Celery (#8030, #9091) that weren't caught in our testing. Under production load, workers would crash and fail to restart properly, leaving webhooks unprocessed.
Why it happened multiple times:
Our initial fixes on October 31 were partial rollbacks that didn't address the core issue. We were trying to keep some performance improvements while fixing stability, which backfired. We should have done a complete rollback immediately.
Final resolution:
Complete rollback to pre-fork worker model
Reverted Celery to previous stable version
Isolated the Redis migration as a separate change (which wasn't causing issues)
Improved worker health checks and automatic restart mechanisms
Implemented better alerting for webhook queue depth
What we're doing differently:
No more simultaneous infrastructure changes - one at a time, with proper bake time
Production-scale load testing before any concurrency model changes
Faster rollback decision-making - don't try to fix forward during an active incident
Better status page messaging that explains customer impact, not just system status
Automatic escalation when webhook processing falls behind by more than 10 minutes
This was our worst operational week in 5 years. We're apologies for the inconvenience caused. We've learned from this and have made changes to ensure it doesn't happen again.
If you have questions about this incident or how it affected your workflows, reach out to us directly.