Starting 2019-09-17 11:17 UTC+2 we noticed that one type of our internal services showed unusual high response times and ultimately didn’t serve request anymore. The service instances which didn’t serve requests are setup in a high availability configuration in geo-redundant data-centers. Nevertheless all of them showed the same behavior. The impact of the outage was that no identifications could be performed, created or retrieved during this time.
Further analysis showed that the storage system which is provided by our data centers used by these services was malfunctioning. The storage system as such is also setup in a high availability configuration over multiple data-centers to provide geo-redundancy. Continuing the analysis showed that all nodes of the storage cluster show the same erratic behavior. The storage system was then restarted in a controlled fashion and corrective configuration measures - as advised by the vendor of the storage system - have been applied which ultimately solved the issue.
As of 2019-09-17 14:03 UTC+2 the storage system and the services using the storage system have been available again and are serving requests normally. The vendor of the storage system is currently analyzing the exact root cause for the erratic behavior of all storage nodes.
As we have seen from this outage, using a single technology, even if its redundant, can still lead to outages if the same issue appears on all nodes. As our geo-redundancy did not prevent the outage in this case, IDnow started the following activities in order to prevent similar issues in the future
We sincerely apology for the incident and that we couldn’t provide the services with the availability we aim for.