Ensuring the reliability of an embedded Linux system
In the last decade, the spread of Linux in the industrial world has undergone constant growth, stimulated by Moore’s famous law: ever more efficient processors are available on the market at increasingly competitive prices.
The creation of build systems such as Buildroot and Yocto, which are used to create a complete embedded Linux system in just a few clicks, has further stimulated this process. Nowadays, practically every manufacturer of processors and microcontrollers for embedded use provides some form of support for the integration of their products with Linux.
In the following paragraphs, we will analyse a problem that is particularly relevant for the systems that work in this context: the implementation of mechanisms that allow a system operating in the field to continue to work in (almost) every situation.
The critical points of a system in the field
One of the most desirable properties for an industrial system is reliability, that is the ability of the system to guarantee, over time, its operating conditions, especially in response to extraordinary events such as failures, anomalies or human errors.
There are a variety of problems that can occur during the lifetime of a system that can interfere with normal functioning, some even intrinsically linked to the infrastructure and to the organisation of the system itself, and unavoidable if particular functionalities are required. To cite some common examples:
- hardware problems related to design defects, component deterioration or malfunction, corruption of memory elements, etc.
- software errors that undermine the stability of the system (bugs, incorrect configurations, etc.)
- unforeseen and not tolerated events (sudden shutdowns due to loss of power, human error, etc.)
A classic example concerns the study of problems arising from the implementation of an update system, or part of it. It is in fact a feature that is rarely negligible, and often necessary for several reasons:
- bug fixes that arise during normal system operation
- discovery of critical vulnerabilities introduced by third parties
- addition of features not included in the system definition phase
However, the implementation of an update system brings with it a series of problems that, if not addressed, can subsequently cause unforeseen and often important maintenance costs.
A failed update can, in fact, leave the system in an inconsistent state that prevents it from working, or can cause unexpected and, in the worst of cases, dangerous behaviours; In other cases an update, even if it applies correctly, can introduce unexpected problems that cannot be corrected because a fallback mechanism has not been provided for these eventualities.
In both these scenarios, the only possible solution involves the intervention of an operator who must manually restore the system to a valid state, or even to its original state. This operation is not always feasible in the field, and often requires bringing the appliance back to service, a procedure that in particular cases can be expensive in terms both of time and money (just think of the classic case in which the malfunctioning appliance is part of a wider mechanical system).
The presence of mechanisms that make it possible to prevent these problems or in any case to restore the system to a consistent state is therefore often a necessity, and there are various techniques in this regard.
Typically the first step in implementing a reliable Linux system is to configure the system partitions in read-only mode. This arrangement, although very simple to implement, provides a first quite effective protection mechanism against common events such as sudden shutdowns and corruption due to software bugs.
In this configuration, a writable partition is generally provided in which the application data is saved and all those parts of the system that must be modified at runtime.
Given its cost-effectiveness, this technique is often used in conjunction with the others that will be explained below.
Multiple partition systems
The introduction of redundant copies of the system on the same storage is a common technique and often simple to implement, which mainly prevents problems related to the corruption of the system partition or of its content. It is particularly useful in cases where:
- the system is typically turned off by removing power from the electronic board
- there is an update mechanism that operates on the system partition
- the system is located in positions that are difficult to reach and therefore a manual intervention is particularly complicated
This mechanism has many advantages:
- very low implementation costs (no additional hardware required)
- it does not require the development of ad-hoc software components (except for a small part of partition management)
- it has no maintenance costs
This technique is generally implemented at the level of the second-level bootloader (typically U-Boot or Barebox for ARM-based Linux systems), which provide functionality for manipulating and accessing filesystems.
The basic idea is to divide up the storage used for the system (usually a flash memory or eMMC, but the same applies to any type of bootable device) into two partitions.
A first partition contains the filesystem marked as active, i.e. the one from which the next boot will be made. It is assumed that the active filesystem is in a consistent state (how to guarantee this will be presented later). The second partition contains the backup filesystem, which is also consistent, which is used as a fallback in the event that a destructive event of some kind occurs on the active filesystem. The image illustrates the classic partitioning scheme used in these cases.
At each boot, the bootloader checks the status of the active partition: if the check is successful, the partition is considered valid. Otherwise, the partition is marked as invalid, and the backup partition is marked as active
This simple mechanism is therefore used to tolerate at least one error that occurred during the normal operation of the system, providing the possibility of detecting its presence and resolving it, if possible, at the next start-up (for example by restoring the filesystem remotely).
Several techniques can be used to check the partition state:
- a state file in a known location on the filesystem, including information on the last known state of the system (startup, boot completed, shutdown completed, reboot)
- a state variable in the bootloader environment (if also accessible from Linux)
- a small dedicated partition
- a small external memory (e.g. an EEPROM)
This technique obviously has limitations to be considered:
- it is not resilient to hardware problems of the storage used (single point of failure)
- in its simplest form, it does not provide any recovery functionality, preventing a single fault event
- it effectively halves the availability of storage space for the system
Systems with multiple storage
A subsequent step with respect to what we have seen so far is to move the backup of the system out of the storage containing the main partition into a dedicated memory.
This solution solves two of the problems set out above: the available disk space is no longer limited by the presence of an “unused” partition and is instead completely exploitable, and the single point of failure is also eliminated, as this solution requires that both memories fail, so that the system becomes unusable.
There are basically two approaches to this solution, which differ in the type of storage used and the functionality they provide.
In the first approach, the backup memory contains an exact copy of the system present in the first. In this case, typically the memories are of the same type, and the bootloader selects one or the other at the time of booting, in a manner very similar to what we saw previously.
In the second approach, instead, the backup memory is generally only used as a container for a minimal system image, generally compressed, which is loaded and executed by RAM. It will then be the task of this system to restore the main partition, for example by downloading a copy of the system from a remote server. In this case, the backup memory is generally much smaller and uses a “slow” communication bus (usually SPI).
Both approaches present pros and cons: the first requires more work on the hardware part, which will also be more expensive; the second saves on hardware, but shifts the complexity to software and maintenance of a recovery infrastructure.
Overall this solution, although more robust than a simple dual partition system, is not without its disadvantages:
- higher costs, in terms of hardware or recovery infrastructure
- greater complexity in managing partitions and booting
- necessary support for (potentially) different storage technologies
Remote recovery systems
This is a revised version of the second approach set out above. In this configuration, the restore of the main partition, which was previously entrusted to a recovery system loaded by a small ad-hoc memory, is instead performed by the bootloader itself. The copy of the system is still recovered from a remote server, but the process of formatting the main partition and unpacking of the image is managed entirely by the bootloader.
This solution is considerably cheaper than the previous one from the hardware point of view, as it further reduces its complexity, but requires that the bootloader used implements a fairly rich set of features depending on the infrastructure used:
- support for the network hardware used
- a complete TCP / IP network stack
- support for the application protocol used to download the image (HTTP, FTP, etc.) and any authentication methods (SSL, HTTP Basic Auth, etc.)
- decompression and decoding software for the downloaded image
As we have just seen, there are several techniques to maximise the reliability of a system, each with its pros and cons. The list presented here is by no means exhaustive, and it is always necessary to evaluate the best solution on the one hand on the basis of hardware and software needs and constraints, on the other of the desired functionalities and the degree of tolerance required by the system.