Webhook Processing Outages - October 31 - November 4, 2025 - Incident details - Cirun.io

Postmortem

November 10, 2025 at 12:38 AM

Postmortem

November 10, 2025 at 12:38 AM

Postmortem - November 10, 2025

Between October 31 and November 4, we experienced three separate webhook processing outages that disrupted GitHub webhook delivery to Cirun. This caused delays and failures in CI/CD pipeline triggers for our customers. Total affected time was approximately ~7 hours across the three incidents.

What happened:

Our webhook processing system became unstable following infrastructure changes we deployed on October 30. The issues manifested as intermittent worker crashes, causing webhooks to queue up or fail entirely.

Root cause:

We made three simultaneous changes to our infrastructure:

Migrated our Redis cluster to new infrastructure
Upgraded Celery (our distributed task queue library) to the latest version
Switched worker concurrency model from pre-fork (multi-process) to gevent (single-process, green threads) for better memory efficiency

The gevent migration exposed critical bugs in Celery (#8030, #9091) that weren't caught in our testing. Under production load, workers would crash and fail to restart properly, leaving webhooks unprocessed.

Why it happened multiple times:

Our initial fixes on October 31 were partial rollbacks that didn't address the core issue. We were trying to keep some performance improvements while fixing stability, which backfired. We should have done a complete rollback immediately.

Final resolution:

Complete rollback to pre-fork worker model
Reverted Celery to previous stable version
Isolated the Redis migration as a separate change (which wasn't causing issues)
Improved worker health checks and automatic restart mechanisms
Implemented better alerting for webhook queue depth

What we're doing differently:

No more simultaneous infrastructure changes - one at a time, with proper bake time
Production-scale load testing before any concurrency model changes
Faster rollback decision-making - don't try to fix forward during an active incident
Better status page messaging that explains customer impact, not just system status
Automatic escalation when webhook processing falls behind by more than 10 minutes

This was our worst operational week in 5 years. We're apologies for the inconvenience caused. We've learned from this and have made changes to ensure it doesn't happen again.

If you have questions about this incident or how it affected your workflows, reach out to us directly.

Resolved

November 04, 2025 at 3:13 PM

Resolved

November 04, 2025 at 3:13 PM

Webhook is now operational! This update was created by an automated monitoring service.

Investigating

November 04, 2025 at 3:08 PM

Investigating

November 04, 2025 at 3:08 PM

Webhook cannot be accessed at the moment. This incident was created by an automated monitoring service.

Cirun.io - Webhook Processing Outages - October 31 - November 4, 2025 – Incident details

All systems operational

Webhook Processing Outages - October 31 - November 4, 2025