Elevated connection errors
Incident Report for Community Boss
Postmortem

Parking Boss relies heavily on Microsoft Azure cloud services, include our primary database, storage, and API application nodes. On Thursday afternoon, we began to see elevated error rates as communications were intermittently disrupted.

Because this was at the cloud layer, there is little we can do but wait for Microsoft’s engineers to restore service. Here is their initial discussion of what happened:

Summary of impact: Between 19:43 and 22:35 UTC on 02 May 2019, customers may have experienced intermittent connectivity issues with Azure and other Microsoft services (including M365, Dynamics, DevOps, etc). Most services were recovered by 21:30 UTC with the remaining recovered by 22:35 UTC.

Preliminary root cause: Engineers identified the underlying root cause as a nameserver delegation change affecting DNS resolution and resulting in downstream impact to Compute, Storage, App Service, AAD, and SQL Database services. During the migration of a legacy DNS system to Azure DNS, some domains for Microsoft services were incorrectly updated. No customer DNS records were impacted during this incident, and the availability of Azure DNS remained at 100% throughout the incident. The problem impacted only records for Microsoft services.

Mitigation: To mitigate, engineers corrected the nameserver delegation issue. Applications and services that accessed the incorrectly configured domains may have cached the incorrect information, leading to a longer restoration time until their cached information expired.

Next steps: Engineers will continue to investigate to establish the full root cause and prevent future occurrences. A detailed RCA will be provided within approximately 72 hours.

Posted May 02, 2019 - 17:57 PDT

Resolved
We're continuing to monitor, but everything appears to be back to normal.
Posted May 02, 2019 - 14:35 PDT
Monitoring
We're seeing the error rate decrease across the board. Closely monitoring...
Posted May 02, 2019 - 14:29 PDT
Update
Microsoft is continuing to investigate this at the infrastructure level and we're awaiting a status update from them. In terms of impact, most activity appears to be normal.
Posted May 02, 2019 - 13:40 PDT
Identified
There appear to be overall Microsoft Azure cloud connectivity issues. More info as it becomes available...
Posted May 02, 2019 - 13:20 PDT
Update
We are continuing to investigate this issue.
Posted May 02, 2019 - 13:15 PDT
Investigating
We're investigated elevated database connection errors. Most requests are processed as normal, but we're continuing to monitor.
Posted May 02, 2019 - 13:13 PDT
This incident affected: API and Manager.