Cirun.io - Notice history

Website - Operational

100% - uptime
Oct 2025 · 99.97%Nov · 100.0%Dec · 100.0%
Oct 2025
Nov 2025
Dec 2025
100% - uptime

Webhook - Operational

100% - uptime
Oct 2025 · 99.22%Nov · 99.86%Dec · 100.0%
Oct 2025
Nov 2025
Dec 2025

Notice history

Dec 2025

No notices reported this month

Nov 2025

Webhook Processing Outages - October 31 - November 4, 2025
  • Postmortem
    Postmortem

    Postmortem - November 10, 2025

    Between October 31 and November 4, we experienced three separate webhook processing outages that disrupted GitHub webhook delivery to Cirun. This caused delays and failures in CI/CD pipeline triggers for our customers. Total affected time was approximately ~7 hours across the three incidents.

    What happened:

    Our webhook processing system became unstable following infrastructure changes we deployed on October 30. The issues manifested as intermittent worker crashes, causing webhooks to queue up or fail entirely.

    Root cause:

    We made three simultaneous changes to our infrastructure:

    1. Migrated our Redis cluster to new infrastructure

    2. Upgraded Celery (our distributed task queue library) to the latest version

    3. Switched worker concurrency model from pre-fork (multi-process) to gevent (single-process, green threads) for better memory efficiency

    The gevent migration exposed critical bugs in Celery (#8030, #9091) that weren't caught in our testing. Under production load, workers would crash and fail to restart properly, leaving webhooks unprocessed.

    Why it happened multiple times:

    Our initial fixes on October 31 were partial rollbacks that didn't address the core issue. We were trying to keep some performance improvements while fixing stability, which backfired. We should have done a complete rollback immediately.

    Final resolution:

    • Complete rollback to pre-fork worker model

    • Reverted Celery to previous stable version

    • Isolated the Redis migration as a separate change (which wasn't causing issues)

    • Improved worker health checks and automatic restart mechanisms

    • Implemented better alerting for webhook queue depth

    What we're doing differently:

    1. No more simultaneous infrastructure changes - one at a time, with proper bake time

    2. Production-scale load testing before any concurrency model changes

    3. Faster rollback decision-making - don't try to fix forward during an active incident

    4. Better status page messaging that explains customer impact, not just system status

    5. Automatic escalation when webhook processing falls behind by more than 10 minutes

    This was our worst operational week in 5 years. We're apologies for the inconvenience caused. We've learned from this and have made changes to ensure it doesn't happen again.

    If you have questions about this incident or how it affected your workflows, reach out to us directly.

  • Resolved
    Resolved

    Webhook is now operational! This update was created by an automated monitoring service.

  • Investigating
    Investigating

    Webhook cannot be accessed at the moment. This incident was created by an automated monitoring service.

Oct 2025

Oct 2025 to Dec 2025

Next