GitLab Data Loss Incident Report
Related issue: #47 (closed)
Status
1st April 2019: Resolution complete, some action items to be completed.
Summary
On 28/03/2019 at approximately 14:15 UTC a full backup of the GitLab database, repositories and Docker image registry from an earlier time was restored to the production instance instead of a test instance created for this purpose. An unrelated load-spike issue masked early warning signs that this had occurred until several minutes later when the loss was noticed. Confusion over the timeline of events meant that the existence of a later backup was not discovered until approximately an hour after the original incident and after database histories had diverged.
Users were notified of the data loss and a parallel instance of GitLab with our most recent backup was provided to users. Future mitigating actions were determined and are in the process of being carried out.
Impact
GitLab data from 28/03/2019, 16:09:50 UTC to approximately 29/03/2019, 14:05:00 UTC was lost from the production site although data up until 29/03/2019, 13:54:17 UTC was later restored to a testing instance.
Root Causes
As part of the technical work required to transition from alpha to beta, work was progressing on backup and restore mechanisms in order to refine the process from being ad hoc to being automated.
Due to a programming error, the default target release for the restore process was changed from "test" to "production".
An unrelated load-spike incident was happening at the same time meaning that early warning signs of a backup restore process running such as 502 responses from the web interface and increased load times were mis-ascribed to the load-spike incident.
A technical community event was happening in UIS at the same time which meant that the number of Engineers present was reduced. As such, Engineer attention was split between the disaster recovery test and mitigating the load-spike incident.
Since we were developing the disaster recovery process, it was unclear if a backup with a timestamp around that of the incident was a "true" backup or was a backup of the site post-incident.
Trigger
While finalising their changes to this restore process, an Engineer ran a test of the disaster recovery process not noticing that their programming change had changed the default target to the production release.
Resolution
Users were alerted as soon as the GitLab team became aware of the issue. There was initial confusion about which backup had been restored and whether the latest backup timestamp corresponded to the pre- or post-incident state of the site. We decided to not risk further data loss and to keep the post-incident state of the site.
Subsequent investigation showed that the backup with the later timestamp was indeed pre-incident. At that point we decided that users would have already begun to restore their projects and that restoring the backup again would be needlessly disruptive.
To give users access to the data lost from production, we created a parallel instance of GitLab from the most recent backup and directed users to that site so that they may determine what data was lost and manually migrate it back to the production site.
Detection
After mitigating the load-spike incident the various warning signs of the backup having been restored to the wrong site were noticed for what they were. At the same time we had user reports of data loss.
Action Items
The following actions have been completed as of the date of this report.
- Our backup and restore scripts now take an explicit release name rather than having a default one.
- Our restore process now requires manual confirmation of the target release: before restoration proceeds the user is asked to type in the name of the release they intend to restore. If this does not match the release specified when the process was started, the process is aborted.
The following actions will be completed.
- Specify that, when working on the disaster recovery process, a full site backup is always taken before work is started.
- Extend monitoring to distinguish between load spikes and backup processes.
- Extend our existing knowledge sharing process so that we can be confident an individual Engineer's attention need not be split when there are multiple GitLab-related tasks ongoing and some Engineers are not present.