Introduction
The Intel's Cache Monitoring Technology (CMT) feature launched with the Xeon E5 2600 v3 line of server processors.
Previous blog posts as referenced below provide an overview of various aspects of the feature:
- Product page: https://software.intel.com/en-us/articles/intel-xeon-e5-2600-v3-product-family
- Blog 1: Introduction to CMT: https://software.intel.com/en-us/blogs/2014/06/18/benefit-of-cache-monitoring
- Blog 2: Discussion of RMIDs and CMT Software Interfaces: https://software.intel.com/en-us/blogs/2014/12/11/intel-s-cache-monitoring-technology-software-visible-interfaces
- Blog 3: Use Models and Example Data: https://software.intel.com/en-us/blogs/2014/12/11/intels-cache-monitoring-technology-use-models-and-data
This blog, the fourth in the series, discusses details of available Operating System (OS) support, and software packages which can be used to test the feature.
Key details discussed in this installment include Linux*, perf and a software package which can be used on POSIX Operating Systems to monitor the L3 cache usage of applications (or pinned VMs) on a per-app/VM basis by pinning apps/VMs to cores.
Standalone vs Scheduler based monitoring
Utilizing the CMT capabilities is simple from a code development perspective with model specific registers providing the interface to set up this new feature. All modern Operating Systems provide API’s that enable users with the appropriate privilege to read and write the MSR’s. Linux* provides the msr-tools package which integrates both the readmsr and writemsr commands. MS Windows has a similar interface. There are two approaches to Cache Monitoring:
- Standalone Cache Monitoring looks at the Last Level Cache usage from a Core or Logical Thread (referred to as CPU hereafter) perspective, regardless of what task is executing. An RMID is statically assigned to a CPU and periodically the occupancy is read back. If the platform has been statically configured and applications have been pinned to resources then this method will yield appropriate results. If system administrators are interested in whether the platform is suitably balanced and there are no misbehaving applications, this approach is acceptable.
- Scheduler based Cache Monitoring involves the operating system scheduler. Obviously when RMID’s are assigned statically, following the Standalone based method, they do not track the process or thread id and therefore occupancy cannot be reported on a per application basis. In order to track the applications occupancy scheduler changes are required. Software will be required to assign an RMID to a process, in turn the scheduler will need to associate the core with the appropriate RMID when the application of interest is scheduled to execute on a CPU. Also when the application de-scheduled or migrated to a different core or thread, the scheduler is required to update RMID assignment to make sure occupancy is only updated when the application is executing. Systems software is also responsible for any remapping across CMT settings which may be required across processor sockets. Since RMIDs are local per socket, for instance, if an application with a given RMID is moved to another processor the OS or VMM is responsible for finding an available RMID on the destination socket to track the migrated application (if monitoring is required).
To enable standalone and scheduler based monitoring several software development initiatives are in progress that are described in subsequent sections.
Scheduler Based Monitoring – Linux* Operating System Support Overview
As explained in the above section, Scheduler based Cache Monitoring makes sure that the application of interest will be tracked with appropriate core and RMID association. This is achieved by integrating CMT into perf and its kernel support which is tightly bound to Linux* scheduler functionality.
In supported platforms (where both the processor and OS have support for CMT) Perf is used to specify which process or thread is to be monitored and assigns it an RMID. All threads not being monitored will be assigned a default RMID used to capture the occupancy associated with those threads not specifically being monitored. Once perf configures the system for monitoring, context switches for the monitored threads result in a callback into the perf_events subsystem. When the CMT callback from the scheduler occurs (during ‘context_switch’ kernel function), the perf_events subsystem selects the RMID associated with the thread being scheduled and assigns it to the CPU. The associated RMID may be for explicit monitoring or the default RMID in the case where the scheduled thread has not been configured for monitoring. From this point until the next context switch, the memory read requests and their subsequent cache loads from this logical processor will be assigned to the RMID just set up.
When a process or thread that is tracked for Cache occupancy terminates or the sched_out function call occurs, the perf CMT callback functionality selects a new RMID. In this instance the default RMID will be selected so that cache loads are not counted towards any explicitly monitored thread. After the monitored process terminates the associated RMID will be returned to a pool of unused RMID and will be recycled for new monitoring request. Mainstream support for these capabilities is trending to kernel version 3.19
Perf Implementation
CMT (Cache Monitoring Technology) Perf Implementation: The perf Linux* application provides an interface into kernel based performance counters. An extension has been developed to support the Cache Monitoring Technology feature. This allows users to monitor last level cache occupancy on a per process or thread basis. The name of the new event is intel_cqm/llc_occupancy/. This new event returns the occupancy in bytes. The patches to perf and Linux* kernel are available in the mailing list here:
https://lkml.kernel.org/r/1415999712-5850-1-git-send-email-matt@console-pimps.org.
The Perf driver module will check for CMT hardware availability using the CPUID instruction (see blog #2). If CMT has been detected a number of function calls will be registered with Perf. Below mentioned are some of the registered events and their functionality:
- .event_init:
- Event Handled – Start perf monitoring on a PID or TID
- CMT callback –Allocates and Sets an unique RMID per PID/TID
- .start:
- Event Handled – Start perf monitoring on a PID or TID after event_init
- CMT callback – Starts the monitoring capability
- .add:
- Event Handled – on Schedule in of monitoring PID/TID
- CMT callback – Sets the monitoring capability on the scheduled core
- .del:
- Event Handled – on Schedule out of monitoring PID/TID
- CMT callback – Resets the monitoring capability on the scheduled out core
- .read
- Event Handled – Read monitoring counters for the PID/TID
- CMT callback – Read CMT occupancy value from MSR with RMID associated with the PID/TID
- .stop
- Event Handled – End of perf monitoring on a PID or TID
- CMT callback – Resets the monitoring capability and clears all the allocated RMID
To make sure that the occupancy associated with CPU is accurate the Perf kernel component associates the RMID only with the specific application thread while it is running on the CPU. As explained in the previous section, when the Linux scheduler swaps the process the RMID will no longer be associated with the core. In addition to RMID tracking, Perf also has process or thread inheritance support (any child process will inherit the RMID of its parent).
Basic operation of perf with CMT:
User Space CMT APIs
The motivation for proposing to use a limited set of User Space CMT APIs is to provide easier usage and integration of CMT into applications. This enables developers to use a small subset of API to retrieve cache occupancy information in their applications. Such unified access API implementation methodology would provide better management of shared level platform resources like RMIDs, access to MSRs etc.
Below are proposed functions which would wrap around the perf_event system calls. It will help tracking cache occupancy for task/pids.
- pqos_register_cmt(taskid, cpu) : This API provides pid/tid along with cpuid which needs to be tracked for CMT. Internally, perf will internally take care of RMID assignment, RMID recycling with scheduler implementation.
- pqos_get_cmt_occupancy(taskid, cpu) : This API reports last level cache occupancy for registered task.
- pqos_unregister_cmt(taskid, cpu): This API provides way to unregister tasks and release all associated RMID which were tracking this task to monitor level cache occupancy.
Research is ongoing to provide a user space library that allow developers or system administrators to take advantage of CMT without the need to worry about RMID management to track number of applications.
Propose design for above implementation in below diagram:
Virtual Machine Monitor Support (KVM & Xen)
Since KVM is a type two hypervisor it inherits the scheduler enhancements discussed in the previous section. Administrator or developers can utilize perf to track the last level cache occupancy of a virtual machine. The process or thread id’s of the virtual machines can be retrieved from the operating system through top or the Qemu monitor.
Since Xen is type 1 hypervisor scheduler enhancement will have to be made to track the last level cache occupancy. Xen 4.5 will be the first version of Xen supporting CMT. The hypervisor implementation associates an RMID with each Domain (DomU or guest VM). Those that have been specified for monitoring will be associated with their own RMID while those not specified will be associated with the default RMID used to collect all non-monitored occupancy data. As the hypervisor schedules each domain on to a CPU and performs the context switch it also writes the RMID to the CPU specific MSR thus associating this CPU with the RMID and its associated domain. As long as the domain continues to run on the CPU the LLC cache loads caused by domain memory reads from the CPU will be tracked in the RMID specific location. When the next domain is schedule for this CPU and the current monitored domain is switched out, its associated RMID is replaced on the CPU so no further association exists.
Xen’s xl command tool has a few additions to support CMT. The additions allow users to attach monitoring to a domain, detach monitoring and to show the LLC occupancy information. They have the following form:
$ xl psr-cmt-attach domid
$ xl psr-cmt-detach domid
$ xl psr-cmt-show cache_occupancy
where domid is the id number of the domain (guest VM) of interest.
Multi-OS Support via the Standalone Cache Monitoring Technology Library
This standalone library enables developers to monitor the last level cache occupancy on per CPU basis. When the library / application initially comes up it will check for the Cache Monitoring support. Once initialization is complete the monitoring functionality provides a “top” like interface listing the last level cache occupancy on a per CPU basis. The library implements a number of API’s that enable developers to take advantage of CMT without the need to setup the MSR’s that configure the RMID assignment or retrieval of the last level cache occupancy data. Developers can also utilize the library from within the virtual machine however either PV or MSR’s bitmaps might be required to gain access to the CMT Model Specific Registers.
Other Operating Systems and VMMs
Additional OSes and VMMs will be enabled over time. Check the documentation or feature list for your preferred OS/VMM to determine if CMT is supported on a particular version.
If your preferred OS/VMM doesn’t yet support CMT their customer support organization may be able to track the feature request and provide an estimated time when support will be ready.
Conclusion
Several mainstream OSes and VMMs now include support for Intel's Cache Monitoring Technology (CMT), and for non-enabled OSes a software library will be available soon to enable experimentation, prototyping of resource management heuristics and deployment of the features.
Authors
Andrew Herdrich
Edwin Verplanke
Priya Autee
Will Auld