POST MORTEM ANALYSIS of Incidents on 2019-10-01 and 2019-10-09
Between 2019-10-01 15:44 UTC+2 and 2019-10-01 16:31 UTC+2 as well as between 2019-10-09 11:35 UTC+2 and 2019-10-09 12:59 UTC+2 we experienced partial service degradation.
During the aforementioned time frame the IDnow system served incoming requests only partially or with degraded performance. Only a limited number of identifications were processed during this time.
The root cause analysis which has been initiated on 2019-10-01 regarding this incident showed that the database queries to be executed were not performed in a timely manner and thousands of SQL statements waited for execution. On 2019-10-01 we then decided to isolate the systems executing the SQL statements and then restarted the database which restored the service availability. During this time, we collected additional logs and status information from the database and services interacting with the database to aid the root cause analysis. The analysis of the collected evidences showed that the database did not process some of the SQL statements in a timely manner although the SQL statements as such are trivial and when executed manually returned the results within a few milliseconds. We added additional logging and prepared additional analysis scripts which helped us during the incident on 2019-10-09 to collect additional information we required to better understand the root cause. During the incident on 2019-10-09 it became apparent that we are affected by a bug in the database. The query optimizer calculating the execution plan for SQL queries would sometimes lock-up and process the calculations only very slowly for some of the SQL statements. This caused the related services to serve requests slowly or not at all.
As a corrective measure we then changed at 2019-10-09 12:50 UTC+2 database configuration parameters which are related to the query optimizer forcing the query optimizer to spend less time on examining different execution plans. In addition to this we also released an emergency fix of our system which simplified the SQL statement causing the query optimizer to lock-up. This emergency fix was applied on 2019-10-09 16:30 UTC+2 on our systems.
In addition to this we are initiating activities to upgrade our database version. The upgrade plan will include a thorough testing phase and staged roll-out in our production environment.
We sincerely apology for the incident and that we couldn’t provide the services with the availability we aim for.