Severity: degraded performance | Status: resolved
Created: 2nd September 2016 08:45:07am | Updated: 6th September 2016 03:53:52pm
SendGrid have successfully mitigated last week’s service disruption and end-to-end delivery times are back into normal ranges.
Original Status update from SendGrid
I wanted you to know directly from me that all of us at SendGrid are deeply sorry for the email delivery issues we’ve been having and for the resulting adverse impact to your business.
I am reaching out to provide a more complete update on the service disruptions you’re experiencing and include more details on the root cause and our next steps. I want to keep you as up to date as possible on the latest details at a more frequent cadence. You can expect more updates as we work around the clock to resolve these issues.
Here are the details on why you’re experiencing these disruptions with our service:
- Last week we implemented a series of planned networking changes in our data centers ahead of a scheduled network upgrade.
- As part of the planned changes, we shifted our traffic away from the data center where the changes were to take place.
- We have shifted traffic in this manner many times this year without issue.
- This time, the added load to one of our data centers increased sufficiently to uncover a previously unknown scaling bottleneck in the mail sending system.
- This scaling bottleneck has been isolated to a single component within our software system.
- We’re working on a number of solutions to address the scaling bottleneck and we believe we have isolated the problem.
- The solutions are related to the way we balance the load of the email flow across multiple servers and to the way we cache information. We are putting these changes into production over the next 12 hours.
- Independent of these fixes, because of our email volume patterns, you should see improvements starting this evening as the load subsides and the sending system “catches up” or dequeues.
- The next key milestone will be tomorrow morning when volume picks back up again and we are able to assess if the fixes we put into place have worked.
- You can expect an update tomorrow with additional details once we’ve made that assessment.
We have our best engineers working on the problem overnight, and if for some reason the fix we put into place doesn’t work as we expect, we have three more, specific, additional optimizations to test right behind it.
Again, on behalf of SendGrid, I want to apologize for the inconvenience this is causing.
SVP of Engineering
More information is available on the SendGrid status pages
resolved on 6 Sep 2016