Blog Technical posts

Ensuring the reliability of an embedded Linux system

27 March 2017 Pietro Lorefice · Tech Leader

Introduction

Over the past decade, Linux has steadily gained ground in the industrial world, fueled by Moore’s Law: increasingly powerful processors have become available at ever more competitive prices.

The advent of build systems like Buildroot and Yocto—which make it possible to create a complete embedded Linux system with just a few clicks—has further accelerated this adoption. Today, nearly every manufacturer of embedded processors and microcontrollers offers some form of support for Linux integration.

In this article, we’ll look at a key challenge for embedded systems in the field: how to ensure that they remain operational in (almost) any scenario.

Challenges for Systems in the Field

One of the most crucial features of any industrial system is reliability—the ability to maintain normal operations over time, especially in the face of unexpected events like failures, anomalies, or human error.

Several issues can arise over a system’s lifetime that may impact its functionality. Some of these problems are even inherent to the way the system is built and operated. Common examples include:

Hardware issues due to design flaws, component degradation, or memory corruption
Software bugs or misconfigurations that compromise stability
Unexpected events such as power loss or incorrect handling by users

These issues often come to light during system updates. Updating an embedded system is rarely optional—it’s often necessary to:

Fix bugs discovered during operation
Address security vulnerabilities introduced by third-party components
Add features that were not part of the original specification

However, updates can also introduce their own problems. If not managed correctly, they can leave the system in an inconsistent state, rendering it non-functional—or worse, causing erratic or dangerous behavior. Even a successful update may cause unforeseen issues if no fallback mechanism is in place.

In such cases, the only solution is manual intervention to restore the system—an option that might be costly or even impossible, especially if the device is deployed in the field or embedded in a product. This is why it’s essential to design systems with built-in safeguards and recovery mechanisms.

Read-Only Root Filesystems

A common first step toward system reliability is mounting the root file system as read-only. This simple but effective strategy helps protect against file system corruption caused by power failures or software bugs.

In this setup, all writable data—such as logs and application data—is stored on a separate writable partition. While this approach doesn’t prevent all failures, it offers a solid foundation and is often combined with other techniques described below.

Dual-Partition Systems

Maintaining multiple copies of the system on the same storage device is another widely used technique. It helps protect against corruption of the system partition, especially when:

The system is often shut down by cutting power
System updates modify the root partition
The device is difficult to access physically for maintenance

Benefits of this approach include:

Low implementation cost (no extra hardware needed)
Minimal custom software requirements (basic partition switching logic)
No additional maintenance burden

This mechanism is usually implemented in the second-stage bootloader (e.g. U-Boot or Barebox for ARM platforms), which can manage partitions and file systems.

The storage (typically flash or eMMC) is split into two partitions:

One contains the active file system used at boot time
The other serves as a backup, kept in a consistent state

At boot, the bootloader checks the status of the active partition. If the check fails, the system switches to the backup partition.

Several techniques can be used to validate the active partition:

A status file stored in a known location
Bootloader environment variables accessible from both Linux and the bootloader
A small dedicated status partition
An external memory device (e.g., EEPROM)

Limitations:

Does not protect against hardware failures in the storage device (single point of failure)
Provides limited recovery—only from one failure event
Halves available storage space

Systems with Multiple Storage Devices

To overcome the limitations of dual-partition systems, the next step is to store the backup system on a separate memory device. This brings two major advantages:

The main storage is fully available for the primary system
Redundancy across devices eliminates the single point of failure

There are two main approaches:

Full system copy

The backup memory contains a complete system image. Both memories are usually of the same type, and the bootloader selects one at boot.

Minimal recovery system

The backup memory holds a small, compressed system image that’s loaded into RAM and executed at boot. Its job is to restore the main partition, for example by downloading a clean system image from a remote server. These memories are typically smaller and connected via SPI or similar low-speed buses.

Pros and Cons

The first approach increases hardware cost and complexity
The second saves on hardware but requires a more sophisticated recovery infrastructure and software stack

Looking for a Linux embedded course?

Discover our courses

See details

Remote Recovery Systems

An evolution of the previous model moves the entire recovery process into the bootloader itself. Instead of booting a minimal recovery system from a separate memory, the bootloader is responsible for restoring the main system by:

Formatting the main partition
Downloading a new system image from a server
Decompressing and, if needed, decrypting the image

This approach simplifies hardware design but adds requirements for the bootloader, which must support:

Network hardware
A full TCP/IP stack
Protocols such as HTTP or FTP, plus authentication (e.g., SSL, Basic Auth)
Image decompression and decryption

Conclusion

There are many ways to improve the reliability of embedded Linux systems, each with trade-offs in complexity, cost, and capabilities. The techniques discussed here are some of the most common but by no means exhaustive.
The best approach depends on your specific hardware and software environment, the system’s intended use, and the level of fault tolerance required. Carefully evaluating these factors will help you build robust systems that keep running—no matter what happens in the field.