A defective livepatch for kernel 4.4 in Ubuntu 16.04 LTS (Xenial) was not caught in internal testing processes because the defect was a race condition, triggered by workload-specific behaviour, under load.
The livepatch would cause the madvise system call to block indefinitely, and thus cause lockup to the processes using the call. These conditions were not replicated in our test environment.
After passing internal testing, the livepatch was published to our free tier users (typically personal systems). Canonical services also run in this tier as an early warning system, and the defect was noticed at that stage. Customers with systems configured for this tier were also impacted. The livepatch causing the defect was retracted one and a half hours after publication, however the standard update process is designed to patch all online systems within one hour.
The faulty livepatch was addressing a Medium severity CVE (CVE-2020-29372). This CVE fix came in as part of our normal SRU processes. The livepatch was tested in combination with an embargoed high severity CVE, and at no time did we see any issues with systems as we tested the combined livepatch. As part of the analysis of the lockup we have found that systems that are running under load may obtain a lock that is not handled correctly after the livepatch is applied. If the lock was obtained after the livepatch was applied then all is fine. Linux kernel livepatching is a complex process that we stabilize through our testing, and our tier deployment process.
In addition to our internal testing processes, we follow a tiered deployment policy that releases livepatches to our customers, to reduce the risks associated with kernel livepatching. Customers should only receive a livepatch after all the internal embargoed test systems and the free tier, successfully apply it.
Root cause analysis
The faulty livepatch was addressing a Medium severity CVE (CVE-2020-29372) and progressed through the following tiers of our process, (a) testing, (b) internal deployment [tier name: proposed], and (c) free subscription deployment [tier name: updates], where the problem with the patch was identified. The livepatch was removed and never reached the customers’ tier [tier name: stable].
The problem of the livepatch failure was that it would fail under certain race conditions that appear during load, causing the madvise system call, that was patched, to block indefinitely. In turn that caused lockup to the processes using the call. We have found that systems that are running under load may obtain a lock that is not handled correctly after the livepatch is applied. If the lock was obtained after the livepatch was applied then no issue was observed. The systems that the livepatch manifested the misbehavior were not in a quiescent state when the livepatch was applied.
The offending livepatch passed our testing, because of the following reasons:
- Our testing is done on newly provisioned systems, and the conditions of reaching the lockup required a running system that had processes with the offending lock on a different state.
The internal deployment testing of the livepatch at the ‘proposed’ tier has failed because there is no workload that could simulate failure conditions.
The impact on the free subscription tier was high because:
- There is no mechanism in place to gradually deploy within a tier to reduce the impact of a faulty livepatch. All systems in a tier get the patch once every hour; given that the livepatch in question was not removed before one and a half hours had passed, we estimate that all the active free tier systems got the patch.
- There is no automatic mechanism in place to recover from a livepatch that applies successfully but fails later.
- The manual mechanisms (see the section “How a customer can fix the issue”), were tedious, and required to disable livepatch service or the offending livepatch will be reapplied after reboot.
- The token this customer had, was assigned to the free subscription tier.
- Neither the customer, via the livepatch interface, nor our operations could identify that they were present on the free subscription tier. Our operations are prevented from mapping livepatch tokens to identities as part of the Personal Identification Information protection.
Addressing affected systems
- Via the grub2 boot menu:
- Select and boot your backup kernel.
- The livepatch client will remove the failed patch, and pick up the most recent livepatch for your backup kernel.
- If kernel command line can be accessed:
- If the system is running:
Improvements on the Canonical livepatch system
The recommendations are listed and classified on urgency. It is our intention to implement these recommendations to prevent similar issues from occurring.
- Restrict the CVEs addressed by livepatch to critical and high severity; do not include medium CVEs via this system to balance between risk of a faulty patch, and impact on operations.
- Enhance the internal testing tier [tier: proposed] with production or production replicating systems within Canonical or introduce a new tier in between. Include the status of these systems after the patch application in the tier qualification process to allow the patch to the next tier.
- Short term
- Enhance our livepatch testing phase to include long-running systems, and not rely only on newly provisioned ones.
- Make the removal of potential faulty patches, an easy process to perform in multiple systems and remember. Document it publicly. It is a process that is performed during a high stress period and must be simple for our customers to succeed.
- Improve the single tier deployment strategy for free subscriptions and customers tiers, to include
- Not providing the patches to all systems at the same time, to prevent the whole tier from applying a potentially faulty patch within an hour. Roll the patches to different systems with enough delay to protect very large groups of users from getting a potential faulty patch at the same time.
- In addition to the above, enhance the capabilities of the livepatch client application to not only prevent the application of known “faulty livepatches” but allow setting up various factors for deployment such as working hours, to prevent late night calls to IT systems.
- Long term
- Automate the faulty patch detection, alerting, and blacklisting. As this will be a heuristic detection, report and use data from the central service to make this reliable.
- Automatic detection and notification of customers with subscriptions that are incorrectly enrolled into the free subscription tier.