European Datacenter - Degraded Performance

Incident Report for IDnow GmbH

Postmortem

POST MORTEM ANALYSIS of Incidents on 2019-10-01 and 2019-10-09

Between 2019-10-01 15:44 UTC+2 and 2019-10-01 16:31 UTC+2 as well as between 2019-10-09 11:35 UTC+2 and 2019-10-09 12:59 UTC+2 we experienced partial service degradation.

During the aforementioned time frame the IDnow system served incoming requests only partially or with degraded performance. Only a limited number of identifications were processed during this time.

The root cause analysis which has been initiated on 2019-10-01 regarding this incident showed that the database queries to be executed were not performed in a timely manner and thousands of SQL statements waited for execution. On 2019-10-01 we then decided to isolate the systems executing the SQL statements and then restarted the database which restored the service availability. During this time, we collected additional logs and status information from the database and services interacting with the database to aid the root cause analysis. The analysis of the collected evidences showed that the database did not process some of the SQL statements in a timely manner although the SQL statements as such are trivial and when executed manually returned the results within a few milliseconds. We added additional logging and prepared additional analysis scripts which helped us during the incident on 2019-10-09 to collect additional information we required to better understand the root cause. During the incident on 2019-10-09 it became apparent that we are affected by a bug in the database. The query optimizer calculating the execution plan for SQL queries would sometimes lock-up and process the calculations only very slowly for some of the SQL statements. This caused the related services to serve requests slowly or not at all.

As a corrective measure we then changed at 2019-10-09 12:50 UTC+2 database configuration parameters which are related to the query optimizer forcing the query optimizer to spend less time on examining different execution plans. In addition to this we also released an emergency fix of our system which simplified the SQL statement causing the query optimizer to lock-up. This emergency fix was applied on 2019-10-09 16:30 UTC+2 on our systems.

In addition to this we are initiating activities to upgrade our database version. The upgrade plan will include a thorough testing phase and staged roll-out in our production environment.

We sincerely apology for the incident and that we couldn’t provide the services with the availability we aim for.

Posted Oct 14, 2019 - 13:16 CEST

Resolved

We have been monitoring the system extensively and the applied measures are successful. The root cause analysis is ongoing.

During the incident duration, the system was 3 times not fully available for approx. 10-15 minutes each time.

Incident start: 2019-10-09 11:35 UTC+2
Incident end: 2019-10-09 12:59 UTC+2

Posted Oct 09, 2019 - 16:06 CEST

Monitoring

We applied additional corrective measures and are monitoring the system closely right now. The system has been working properly for the last 50 minutes.

Posted Oct 09, 2019 - 13:49 CEST

Identified

The incident is being re-opened as we are experiencing again a service degradation. We are continuing our investigations.

Posted Oct 09, 2019 - 12:55 CEST

Monitoring

We applied corrective measures and the system is available again. We are starting with the root cause analysis of this incident.

Incident start: 2019-10-09 11:35 UTC+2
Incident end: 2019-10-09 11:45 UTC+2

Posted Oct 09, 2019 - 11:49 CEST

Investigating

We're experiencing an performance degradation and are currently looking into the issue.
We will provide regular updates on the issue.

Incident start: 2019-10-09 11:35 UTC+2

Posted Oct 09, 2019 - 11:40 CEST

This incident affected: IDnow Solutions (VideoIdent, AutoIdent, eSign).