Intermittent 502 Errors for POST Requests

Incident Report for Digibee

Monitoring

We have identified the root cause of the intermittent 502 errors affecting POST requests in both Test and Production environments. The issue was traced to an incompatibility between the Google Cloud Load Balancer and the NGINX controller. This incompatibility was related to the handling of the keepalive connection window, which resulted in the TCP connection sending a FIN (finish) signal from the gateway to the load balancer while connections were still active.

This condition created a race situation where the load balancer prematurely closed the connection, leading to 502 errors.

To address this, we have increased the NGINX keepalive time to exceed the configured timeout interval of the load balancer. This adjustment ensures that the keepalive window in NGINX remains open longer than the load balancer's timeout, preventing premature connection terminations.

We continue to monitor the system for stability and will provide further updates as needed. Thank you for your patience and understanding as we worked to resolve this issue.

We are pleased to report that in the past hour, we have observed a significant reduction in the number of 502 errors compared to our regular data. While this is an encouraging sign that the fix is effective, we will continue to monitor the system closely to ensure sustained improvement and stability.

Posted Nov 10, 2024 - 15:38 GMT-03:00

Identified

We have identified an underlying structural issue affecting our platform running on Google infrastructure (SaaS BR and US). This issue causes some POST requests originating from the internet to sporadically return 502 errors in both the Test and Production environments

While the overall impact affects a very low percentage of total requests, customers with high traffic volumes may experience this error more frequently. It is important to note that only POST requests are primarily impacted. Despite this, our platform’s availability remains above 99.95%, which is higher than the contracted SLA.

Current Status: We are actively working on a permanent resolution. In the meantime, we recommend customers experiencing higher impact implement a fast retry mechanism, as subsequent retries will successfully process the request.

Please note that not all 502 errors are related to this specific issue.

Posted Nov 08, 2024 - 15:08 GMT-03:00

This incident affects: SaaS BR (BR - Test Environment, BR - Prod Environment, BR - Core APIs) and SaaS US (US - Core APIs, US - Prod Environment, US - Test Environment).