European Datacenter - Service Disruption
Incident Report for IDnow GmbH
Postmortem

Starting 2019-09-17 11:17 UTC+2 we noticed that one type of our internal services showed unusual high response times and ultimately didn’t serve request anymore. The service instances which didn’t serve requests are setup in a high availability configuration in geo-redundant data-centers. Nevertheless all of them showed the same behavior. The impact of the outage was that no identifications could be performed, created or retrieved during this time.

Further analysis showed that the storage system which is provided by our data centers used by these services was malfunctioning. The storage system as such is also setup in a high availability configuration over multiple data-centers to provide geo-redundancy. Continuing the analysis showed that all nodes of the storage cluster show the same erratic behavior. The storage system was then restarted in a controlled fashion and corrective configuration measures - as advised by the vendor of the storage system - have been applied which ultimately solved the issue.

As of 2019-09-17 14:03 UTC+2 the storage system and the services using the storage system have been available again and are serving requests normally. The vendor of the storage system is currently analyzing the exact root cause for the erratic behavior of all storage nodes.

As we have seen from this outage, using a single technology, even if its redundant, can still lead to outages if the same issue appears on all nodes. As our geo-redundancy did not prevent the outage in this case, IDnow started the following activities in order to prevent similar issues in the future

  • Setup idle fail-over services which can operate on an auxiliary storage system; this has already been performed
  • Reduce the dependencies of it’s services from this storage system to prevent relying on a single technology for the storage

We sincerely apology for the incident and that we couldn’t provide the services with the availability we aim for.

Posted Sep 20, 2019 - 18:35 CEST

Resolved
We monitored the system extensively and can confirm that the applied resolution has been successful. The system is available.

Incident start: 2019-09-17 11:17 UTC+2
Incident end: 2019-09-17 14:03 UTC+2
Posted Sep 17, 2019 - 16:02 CEST
Monitoring
We applied corrective measures. We see first calls being successfully completed. We are monitoring the system closely.

Incident start: 2019-09-17 11:17 UTC+2
Posted Sep 17, 2019 - 14:16 CEST
Update
Update: the team is still working on restoring the service availability.

Incident start: 2019-09-17 11:17 UTC+2
Posted Sep 17, 2019 - 12:23 CEST
Identified
We are experiencing issues with the service availability of our platform. The team is working on restoring the service availability.
Posted Sep 17, 2019 - 11:20 CEST
This incident affected: Europe - IDnow (Video-Ident, eSigning QES, eSigning AES, Photo-Ident, API, AutoIdent).