Linux 6.1 introduces new features that make it easier to identify faulty CPUs
Linux 6.1 introduces new features that make it easier to identify faulty CPUs.
For Linux production environments with multiple CPUs running at the same time (such as large servers), Linux 6.1 adds a very useful feature: in the event of a failure, an error message will inform you which CPU is at fault.
This feature comes from a patch for the x86/CPU branch of the Linux 6.1 merge window: in the event of a segfault, the failure message prints the “suspected” CPU number.
Patch author Rik van Riel explains how the feature works and how it works:
In a large enough computer cluster, there are usually several bad CPUs. It can usually be identified by looking at the running kernel code. If there is a faulty CPU, the kernel code runs fine elsewhere, but keeps crashing on a faulty CPU core.
Over the years, however, the failure mode of the CPU in question has been very specific, you may find segmentation faults in bash, python, or various system daemons, yet the failure message will not tell you which CPU is at fault. Now we add printk() to show_signal_msg() to print the corresponding CPU, core and socket on segmentation fault.
At present, this function is not perfect, and there may be false positives. Since the fault occurs until the corresponding error message is output, the task may be rescheduled on another CPU, resulting in the wrong CPU number being reported.
But it’s good enough to help people identify most of the faulty CPU cores.
Here is a functional example:
segfault: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in \ segfault[401000+1000] likely on CPU 0 (core 0 , socket 0 )
This printk can be controlled by.
According to Phoronix , the feature will be officially enabled in the Linux 6.1 stable release in October.
segfault: Segmentation fault/segmentation fault/segmentation fault, is a bug that is often encountered in software development, and is also the most common bug in the Linux kernel.
The error is caused by illegal memory accesses such as null pointer references, write operations in read-only memory regions, access to protected memory regions, etc.