Everything a computer does at some point involves RAM ("Memory"). Typical home computers, and small servers have between 4 and 16 gigabytes of it. Defects RAM is not easily detected because the symptoms are very unspecific. But in almost all cases bad RAM corrupts data that is saved to disk at some point. Corrupted data can be anything between a pixel in a video having a slightly different color than is should, documents/databases that are no longer readable, or -- worst -- data that you or your company is liable for has changed its meaning without anyone finding out until it is too late.
In mid 2009 Google and Bianca Schroeder from the University of Toronto published a study that showed that Google’s -- albeit commodity grade -- server hardware is prone to memory errors: Each year a third of the studied systems suffered at least on correctable memory error (CE). According to their study, a system that had a CE in the past is very likely to have much more CEs or even uncorrectable errors (UE) in the near future. It is safe to assume, that consumer grade hardware is likely to show worse behaviour. These results emphasize the importance of the early detection of defective memory modules.
Not only Google found this: From the NASA website, updated May 17, 2010 at 5:00 PT.
One flip of a bit in the memory of an onboard computer appears to have caused the change in the science data pattern returning from Voyager 2, engineers at NASA’s Jet Propulsion Laboratory said Monday, May 17. A value in a single memory location was changed from a 0 to a 1.
Broken RAM is bad. Finding defect RAM is a non trivial task, and can be done in hardware (e.g. ECC memory), software (e.g. memtest86+) or a combination of both. Hardware tests do not find all errors, and conventional software tests require many hour long downtimes of the machines. Other software tests can run while the computer is in normal use, e.g. editing spreadsheets or serving webpages. These programs have a different problem: They just randomly poke around the computer memory, and often test only very small parts of the system memory.
In my diploma thesis I developed an online memory test for the Linux kernel. This is a program (userspace & kernel module) that can test substantial (~70%, more with some programming) parts of the computers memory while the computer is in normal use, thus solving the biggest current problem with memory test.
If you want to know more, then read my blogpost. If that still is not enough, read the thesis or read the papers based on my thesis.
A detailed description, and measurements on both effectiveness and efficiency, can be found in the following two publications:
-
H. Schirmeier, J. Neuhalfen, I. Korb, O. Spinczyk, and M. Engel. RAMpage: Graceful degradation management for memory errors in commodity Linux servers. In Proceedings of the 17th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC '11), pages 89-98, Pasadena, CA, USA, Dec. 2011. IEEE Computer Society Press.
-
H. Schirmeier, I. Korb, O. Spinczyk, and M. Engel. Efficient online memory error assessment and circumvention for Linux with RAMpage. International Journal of Critical Computer-Based Systems, 4(3):227-247, 2013. Special Issue on PRDC 2011 Dependable Architecture and Analysis.
The implementation has three major parts:
- A kernel module that allows user space programs to request, and isolate specific page frames.
- A userspace implementation that drives the test: It decides which frame to test when, and how.
- Userpace implementations that implement the test algorithms to test a single page frame.
Further there are some tools to analyse the memory management:
Horst Schirmeier greatly extended the base laid by the thesis, e.g. by implementing the test algorithms in C
, and by claiming more types of memory.