POST MORTEM ANALYSIS of Incidents on 2019-11-15
Between 2019-11-15 10:06 UTC+1 and 2019-11-15 11:00 UTC+1 we experienced a full outage of our system.
During the aforementioned time frame the IDnow system was not serving API calls nor video calls.
The root cause analysis showed that within our database cluster the master node failed with a physical hardware failure of a DIMM module. The ECC memory couldn’t correct the failures as multiple bit errors were detected. As this hardware node has been in service for more than 1.5 years without any hardware issue we decided to restart the node to restore service and initiate the process of a master node switch-over. This process will move the master role to another node. Once the switch-over is completed, we will be able to have the DIMM replaced.
We also initiated a process to adjust our current slave nodes into a tree-cascade fashion in order to allow faster switch overs in the future.
We sincerely apology for the incident and that we couldn’t provide the services with the availability we aim for.