NOTE: Over the first half of 2014 we will be adding more detail on how applications can be augmented to be more "Recovery Aware". Please subscribe to this page (button at the bottom) to receive notification when we have this updated information posted.
1) Introduction
In today’s world datacenters consisting of many servers are used to run mission critical and enterprise applications like stock trading or corporate finance and billing. Server failures can cause potential data loss and downtime, resulting in increasing service costs and potentially compromising data integrity. To minimize these effects, Intel introduced advanced Reliability, Availability and Serviceability (RAS) features in the Xeon® processor E7 product family. More information about advanced RAS can be found in previously authored whitepapers1 2. The purpose of this article is to describe new advanced RAS features added to the Intel® Xeon® Processor E7 v2 family in 2014 and marketed as part of the “Intel® Run Sure Technology”. This product family is a 2, 4 or 8-socket platform based on Intel® Core™ microarchitecture (formerly codenamed Ivy Bridge) and manufactured on 22-nanometer process technology.
2) New RAS Features
Many of the new advanced Reliability, Availability and Serviceability features introduced here are implemented in hardware and firmware which don’t require any changes to software programs. However some do require Operating System or Virtual Machine Manager (VMM) support as well as recovery mechanisms from a software perspective.
PCIe Live Error Recovery (LER)
This feature allows the system to bring down the PCIe3 link associated with the PCIe root port where an uncorrected (fatal or non-fatal) fault is detected in either an incoming or outgoing transaction without resetting the entire system. It also allows Firmware/Software assisted link retraining and recovery. LER also protects against the transfer of potentially corrupt data to the disk.
Enhanced Machine Check Architecture Gen 1 (eMCA1)
This feature enhances the existing Machine Check Architecture (MCA) by implementing Firmware First Model2 (FFM) of the error reporting (logging and signaling). FFM is a server RAS paradigm where all the error events are first signaled to platform specific firmware. The firmware processes the error logs and decides if and when to notify the Operating System or Application software layers of an error. EMCA14 can be configured to provide enhanced error log information to the OS and VMM that can be used to implement advanced diagnostic and predictive failure analysis7 (PFA) for the system. Legacy MCA provides physical address of the memory location when a corrected fault occurs, but it is challenging for PFA software to map it to an actual physical DIMM. EMCA1 allows providing such additional error logs to the PFA software.
Machine Check Architecture (MCA) recovery for I/O
The MCA recovery for I/O allows uncorrected, both fatal and non-fatal, I/O errors to be reported through the MCA mechanism. Intel® Xeon® Processor E7 families incorporate PCI Express* Advanced Error Reporting5 (AER) architecture to report (log and signal) uncorrected and corrected I/O errors. Normally uncorrected I/O errors are signaled to the system software either as AER Message Signaled Interrupt (MSI) or via platform specific mechanisms such as System Management Interrupt (SMI) and/or physical Error Pins. The signaling mechanism is controlled by BIOS and/or platform firmware. As a part of this new feature the processor has added a new Machine Check bank called IOMCA and allows logging and signaling of IO uncorrected errors through standard Machine Check Architecture. It logs the Bus, Device, and Function information associated with the PCI Express port, thus allowing error handling software to identify the source of error faster. By using this feature to signal the uncorrected I/O errors through the MCA mechanism, the errors can be communicated to the software layer (OS, VMM and DBMS) to improve error identification and recovery.
Machine Check Architecture (MCA) recovery – Execution Path
The MCA recovery - Execution Path feature offers the capability for a system to continue to operate even when the processor is unable to correct data errors within the memory sub-system and allows software layers (OS, VMM, DBMS, and Applications) to participate in system recovery. This feature can handle hardware uncorrected errors occurring within the memory sub-system including main memory, last level caches, and mid-level caches. When the processor detects a fault within the memory sub-system, it will attempt to correct the fault. In most of the cases, memory faults are corrected by the processor. However, if the error cannot be corrected, the processor will notify the operating system (OS) using Machine Check Exception6 (MCE) and logs the error as an uncorrected recoverable error (UCR). The OS analyzes the log and verifies that the recovery is possible. If the recovery is possible then the OS un-maps the affected page(s) and triggers a SIGBUS event to the application. If the error is detected in an instruction code then the instruction fetch unit (IFU) is notified and MCE is triggered by the IFU. In this case, the OS will reload the affected page containing the instruction to a new physical page and resume normal execution. If the error is detected within the data space then the Data Cache Unit (DCU) is notified and the MCE is triggered by the DCU. In this case, the OS will notify the application through the SIGBUS event, and it is up to the application to take further action. The affected application is then responsible for reloading the data. If the data was already modified and the application cannot reload the data from the disk, the affected application will be terminated (i.e. a system reset will not be required, and other applications will continue to operate normally). In order to take full benefit of the MCA recovery – Execution Path feature, applications are required to be ‘Recovery Aware’.
3) Conclusion
Additional advanced RAS features allow the Intel® Xeon® processor E7 V2 family to increase resiliency within the memory sub-system and IO sub-system so that when hardware uncorrected errors are detected, the system can detect them, recover and continue to operate instead of suffering fatal events requiring a system reset. It also allows enhanced error reporting to expedite fault diagnosis.
4) References
[2] https://noggin.intel.com/content/autonomic-foundation-for-fault-diagnosis
[3] http://en.wikipedia.org/wiki/PCI_Express
[5] https://www.kernel.org/doc/ols/2007/ols2007v2-pages-297-304.pdf
[6] http://en.wikipedia.org/wiki/Machine-check_exception
[7] http://en.wikipedia.org/wiki/Predictive_failure_analysis