Counting CPU Cyles with ARM PMU

Exercise 3, 4, and 5 from Washington University CSE 522S: “Advanced Operating Systems” Studio 3: “Hardware Counters and Loadable Kernel Modules”. For Linux kernel version used in the course and Raspberry Pi configuration, please refer to this post. A month ago, I tried to get timing measurements from userspace on my Raspberry Pi, but failed. This post provides some sort of solution. Programs shown here are largely adapted from the starter code provided by Dr. Marion Sudvarg.

The Arm® Cortex-A53 Performance Monitor Unit (“PMU” for short, a co-processor separate from the main processor) enables us to gather various statistics on the operation of the processor and its memory system during runtime. The PMU has \(32\)-bit counters that increment when they are enabled based on events and a \(64\)-bit cycle counter. The PMU counters and their associated control registers are accessible in the AArch32 Execution state from the internal CP15 system register interface with MCR (move to co-processor from ARM register) and MRC (move to ARM register from co-processor) instructions for \(32\)-bit registers and MCRR and MRRC for \(64\)-bit registers.

The PMCR register controls accesses to the PMU in general:

CRn Op1 CRm Op2 Name Type Width Description
c9 0 c12 0 PMCR RW \(32\) Performance Monitors Control Register

PMUSERENR is the User Enable Register that needs to be configured to allow user code to access the PMU:

CRn Op1 CRm Op2 Name Type Width Description
c9 0 c14 0 PMUSERENR RW \(32\) Performance Monitors User Enable Register

These two registers are of particular interest. More implementation details can be found in Arm® Cortex-A53 MPCore Processor Technical Reference Manual and ARM Architecture Reference Manual — ARMv8, for ARMv8-A architecture profile.

Table of Contents

Reading from Performance Monitors Cycle Count Register (PMCNTR)

On x86 machines, we can directly access a Time Stamp Counter (TSC) register from userspace with the following code:

static inline unsigned long long rdtsc_get(void) {
    unsigned long high, low;
    asm volatile ("rdtsc" : "=a" (low), "=d" (high));
    return ( (unsigned long long) low ) |
           ( ( (unsigned long long) high ) << 32 );
}

ARM also provides a cycle counter, the Performance Monitors Cycle Count Register (PMCNTR) on the PMU, which is architecturally mapped to AArch64 System register PMCCNTR_EL0. PMCNTR is a \(64\)-bit register that can also be accessed as a \(32\)-bit value. If it is accessed as a \(32\)-bit register, accesses read and write bits [31:0] and do not modify bits [63:32]. However, unlike the TSC register on x86, ARM’s PMCNTR register is not, by default, accessible from userspace. We have to write a kernel module to read its content. First, inside the directory that holds our out-of-tree kernel modules (/project/scratch01/compile/x.xingjian/modules), I created a C program named ccnt_get_kernel.c and a Makefile that contains the line:

obj-m := ccnt_get_kernel.o

The ccnt_get_kernel.c program should be able to measure the elapsed times between reads from the PMCNTR for \(100\) samples, and reports the minimum, maximum, mean, and standard deviation of the elapsed cycles between reads.

Enable Cycle Count

The first thing we need to do is to set the Enable bit ([0]) of PMCR, which is a global counter enable bit. Then, the second thing we need to do is to enable the Cycle Count Register PMCCNTR by setting the Enable bit of PMCNTENSET (the Performance Monitors Count Enable Set register). Note that the value of PMCCNTR increments in one of the following ways:

Writing 1 to the C bit ([2]) of PMCR sets the cycle count field to 0. If PMCCNTR is accessed as a \(32\)-bit register, it can be read using MRC <syntax> and written using MCR <syntax>. The syntax is as shown below:

<syntax> opc1 opc2 CRn coproc CRm
p15, 0, <Rt>, c9, c13, 0 000 000 1001 1111 1101

If PMCCNTR is accessed as a \(64\)-bit register, it can be read using MRRC <syntax> and written using MCRR <syntax>. The syntax is as shown below:

<syntax> opc1 coproc CRm
p15, 0, <Rt>, <Rt2>, c9 0000 1111 1001

The syntax for the second task is as shown below:

<syntax> opc1 opc2 CRn coproc CRm
p15, 0, <Rt>, c9, c12, 1 000 001 1001 1111 1100

Here I include an implementation modified from Dr. Marion Sudvarg’s driver code:

// PMCR: Performance Monitor Control Register
#define PMCR_ENABLE_COUNTERS (1 << 0)
#define PMCR_EVENT_COUNTER_RESET (1 << 1)
#define PMCR_CYCLE_COUNTER_RESET (1 << 2)
#define PMCR_CYCLE_COUNT_64 (1 << 3)
#define PMCR_EXPORT_ENABLE (1 << 4)
#define PMCR_CYCLE_COUNTER_DISABLE (1 << 5)
#define PMCR_CYCLE_COUNTER_64_BIT (1 << 6)

// CNTENS: Count Enable Set Register
#define CNTENS_CTR0 = 1 << 0;
#define CNTENS_CTR1 = 1 << 1;
#define CNTENS_CTR2 = 1 << 2;
#define CNTENS_CTR3 = 1 << 3;
#define CNTENS_CYCLE_CTR = 1 << 31;

// Utility Functions
static inline void pmcr_set(unsigned x);
static inline void cntens_set(unsigned x);
static inline unsigned long pmcr_get(void);
static inline unsigned long cntens_get(void);
static inline void pmcr_enable_pmccntr(bool cycles_64_bit, bool count_every_64);
static inline void cntens_enable_pmccntr(void);
static inline void pmccntr_enable_once(bool cycles_64_bit, bool count_every_64);

static inline void pmcr_set(unsigned x)
{
    asm volatile ("MCR p15, 0, %0, c9, c12, 0\t\n" :: "r" (x));
}

static inline unsigned long pmcr_get(void)
{
    unsigned long x = 0;
    asm volatile ("MRC p15, 0, %0, c9, c12, 0\t\n" : "=r" (x));
    return x;
}

static inline void cntens_set(unsigned x)
{
    asm volatile ("MCR p15, 0, %0, c9, c12, 1\t\n" :: "r" (x));
}

static inline unsigned long cntens_get(void)
{
    unsigned long x = 0;
    asm volatile ("MRC p15, 0, %0, c9, c12, 1\t\n" : "=r" (x));
    return x;
}

static inline void pmcr_enable_pmccntr(bool cycles_64_bit, bool count_every_64)
{
    unsigned long flags = pmcr_get();
    flags = flags | PMCR_ENABLE_COUNTERS | PMCR_CYCLE_COUNTER_RESET;
    if (cycles_64_bit)
        flags |= PMCR_CYCLE_COUNTER_64_BIT;
    else
        flags = flags & ~PMCR_CYCLE_COUNTER_64_BIT;
    if (count_every_64)
        flags |= PMCR_CYCLE_COUNT_64;
    else
        flags = flags & ~PMCR_CYCLE_COUNT_64;
    pmcr_set(flags);
}

static inline void cntens_enable_pmccntr(void)
{
    unsigned long flags = cntens_get();
    flags = flags | CNTENS_CYCLE_CTR;
    cntens_set(CNTENS_CYCLE_CTR);
}

static inline void pmccntr_enable_once(bool cycles_64_bit, bool count_every_64)
{
    unsigned long flags;
    flags = pmcr_get();
    if ( !(flags & PMCR_ENABLE_COUNTERS) )
        pmcr_enable_pmccntr(cycles_64_bit, count_every_64);;
    flags = cntens_get();
    if ( !(flags & CNTENS_CYCLE_CTR) )
        cntens_enable_pmccntr();
}

Retrieve Cycle Count

Based on the aforementioned ARM assembly syntax, we can use the following function, which takes no arguments and returns a uint64_t value, to retrieve the CPU cycle counts:

static inline long long pmccntr_get(void) {
    register unsigned long pmcr = pmcr_get();
    if(pmcr & PMCR_CYCLE_COUNTER_64_BIT) {
        unsigned long low, high;
        asm volatile ("MRRC p15, 0, %0, %1, c9" : "=r" (low), "=r" (high));
        return ( (unsigned long long) low ) |
               ( ( (unsigned long long) high ) << 32 );
    } else {
        unsigned long cycle_count;
        asm volatile ("MRC p15, 0, %0, c9, c13, 0" : "=r" (cycle_count));
        return (unsigned long long) cycle_count;
    }
}

Math in the Kernel

For calculating the mean and standard deviation, we will likely need to use \(64\)-bit division and a square-root function. The cross-compiler we use in the course may not be able to generate the correct operations for \(64\)-bit division on the target \(32\)-bit platform. In his famous Linux Kernel Development, Robert Love wrote:

When a userspace process uses floating-point instructions, the kernel manages the transition from integer to floating point mode. What the kernel has to do when using floating-point instructions varies by architecture, but the kernel normally catches a trap and then initiates the transition from integer to floating point mode.

Unlike userspace, the kernel does not have the luxury of seamless support for floating point because it cannot easily trap itself. Using a floating point inside the kernel requires manually saving and restoring the floating point registers, among other possible chores. The short answer is: Don’t do it! Except in the rare cases, no floating-point operations are in the kernel.

For \(64\)-bit division, we can use the div_u64() function declared in <linux/math64.h>:

static inline u64 div_u64(u64 dividend, u32 divisor)
{
    u32 remainder;
    return div_u64_rem(dividend, divisor, &remainder);
}

For calculating a power of any given \(64\)-bit number, we can use the int_pow() function; for an integer approximation of the square root, we can use the int_sqrt() or int_sqrt64() function. Both of them are declared in <linux/kernel.h>:

u64 int_pow(u64 base, unsigned int exp);
unsigned long int_sqrt(unsigned long);

#if BITS_PER_LONG < 64
u32 int_sqrt64(u64 x);
#else
static inline u32 int_sqrt64(u64 x)
{
    return (u32)int_sqrt(x);
}
#endif

Finally, my kernel module code looks like this:

/* ccnt_get_kernel.c */
static bool cycles_64_bit = 1;
static bool count_every_64 = 0;

static uint64_t elapse[100];

static void enable_ccnt(void *info)
{
    pmccntr_enable_once(cycles_64_bit, count_every_64);
}

static void ccnt_get(void)
{
    uint64_t t1 = 0, t2 = 0, min = 32768, max = 0, sum = 0, mean = 0;
    unsigned long sd = 0, ssq = 0;
    unsigned int i;
    for (i = 0; i < 100; i++) {
        t1 = pmccntr_get(); t2 = pmccntr_get();
        if (t2 > t1) {
            elapse[i] = t2 - t1;
            if (elapse[i] < min) {
                min = elapse[i];
            }
            if (elapse[i] > max) {
                max = elapse[i];
            }
        } else {
            /* skip if overflow */
            i--;
        }
    }

    for (i = 0; i < 100; i++) {
        sum += elapse[i];
    }
    mean = div_u64(sum, (uint32_t)100);
    for (i = 0; i < 100; i++) {
        ssq += int_pow((elapse[i] - mean), 2);
    }
    sd = int_sqrt(div_u64((uint64_t)ssq, (uint32_t)100));
    printk(KERN_INFO "Minimum and maximum of the elapsed cycles: %llu, %llu\n", min, max);
    printk(KERN_INFO "Average of the elapsed cycles: %llu\n", mean);
    printk(KERN_INFO "Standard deviation of the elapsed cycles: %lu\n", sd);
}

static int ccnt_get_init(void) {
    on_each_cpu(enable_ccnt, NULL, 0);
    printk(KERN_INFO "Cycle Counter Enabled\n"); ccnt_get();
    return 0;
}

static void ccnt_get_exit(void) {

}

module_init(ccnt_get_init);
module_exit(ccnt_get_exit);

MODULE_LICENSE("GPL v2");
MODULE_AUTHOR("Xingjian Xuanyuan");
MODULE_DESCRIPTION("Studio 3");

To build the module, issue the command:

LINUX_SOURCE=/project/scratch01/compile/x.xingjian/linux_source/linux

Compile the module via:

make -C $LINUX_SOURCE ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- M=$PWD modules

which, if successful, should produce a kernel module file named ccnt_get_kernel.ko. Transfer this file to the Raspberry Pi using sftp. Use the insmod utility to load this module into the kernel:

sudo insmod ccnt_get_kernel.ko

To confirm the module was successfully loaded, issue the command:

lsmod

to see a listing of all currently loaded kernel modules. To unload the module, issue the command:

sudo rmmod enable_ccnt_522.ko

Finally, we can view the kernel log:

$ dmesg | tail
[  569.358298] enable_ccnt_522: loading out-of-tree module taints kernel.
[  569.358779] Cycle Counter Enabled
[  569.358807] Minimum and Maximum of the elapsed cycles: 10, 27
[  569.358818] Average of the elapsed cycles: 10
[  569.358830] Standard deviation of the elapsed cycles: 1

Accessing the PMU from Userspace

The ARM PMU has a register, the Performance Monitor User Enable Register (PMUSERENR), that, when set, allows direct access to the PMU from userspace. We need the following function to configure it:

static inline void pmccntr_enable_once_user(bool cycles_64_bit, bool count_every_64) {
    pmccntr_enable_once(cycles_64_bit, count_every_64);
    asm volatile ("MCR p15, 0, %0, c9, c14, 0" :: "r" (1));
}

Calling this function inside the enable_ccnt() function rather than pmccntr_enable_once(), our module should still be loaded correctly.

Moreover, the following userspace program calls the pmccntr_get() function directly:

/* ccnt_get_user.c */
static bool cycles_64_bit = 1;
static bool count_every_64 = 0;

static unsigned long long elapse[100];

static void enable_ccnt(void) {
    pmccntr_enable_once(cycles_64_bit, count_every_64);
}

static void ccnt_get(void)
{
    unsigned long long t1 = 0, t2 = 0, min = 32768, max = 0, sum = 0;
    double mean = 0.0, sd = 0.0, ssq = 0.0;
    unsigned int i;
    for (i = 0; i < 100; i++) {
        t1 = pmccntr_get();
        t2 = pmccntr_get();
        if (t2 > t1) {
            elapse[i] = t2 - t1;
            if (elapse[i] < min) {
                min = elapse[i];
            }
            if (elapse[i] > max) {
                max = elapse[i];
	        }
        } else {
            /* skip if overflow */
            i--;
        }
    }

    for (i = 0; i < 100; i++) {
        sum += elapse[i];
    }
    mean = sum / (double)100;
    for (i = 0; i < 100; i++) {
        ssq += pow((elapse[i] - mean), 2);
    }
    sd = sqrt(ssq / (double)100);
    printf("Minimum and Maximum of the elapsed cycles: %lld, %lld\n", min, max);
    printf("Average of the elapsed cycles: %.4lf\n", mean);
    printf("Standard deviation of the elapsed cycles: %.4lf\n", sd);
}

int main()
{
    enable_ccnt();
    ccnt_get();

    return 0;
}

Make sure to include the driver file and <stdbool.h>. To compile this program, issue the command:

gcc ccnt_get_user.c -o ccnt_get_user -lm

Here is the output results:

$ ./ccnt_get_user
Minimum and Maximum of the elapsed cycles: 48, 159
Average of the elapsed cycles: 49.1800
Standard deviation of the elapsed cycles: 11.0593

We can see that the values are similar to the previous exercise except that standard deviation is significantly larger.

Using the perf_event_open() Syscall

The Linux kernel provides indirect access to hardware counters, including the ARM PMU, via the perf_event_open() system call with system call number defined in /arch/arm64/include/asm/unistd32.h:

#define __NR_perf_event_open 364
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)

Because glibc provides no wrapper for this function, it must be called directly with syscall(). System calls are very dependent on architecture, both in how they are invoked as well as what calls are available to be made. These are some of the files essential to the perf_event_open() syscall on the ARM architecture:

Measured performance events come in two flavors: counting and sampling. A counting event is one that is used for counting the aggregate number of events that occur. In general, counting event results are gathered with a read() call. A sampling event periodically writes measurements to a buffer that can then be accessed via mmap().

/**
 * sys_perf_event_open - open a performance event, associate it to a task/cpu
 *
 * @attr_uptr:  event_id type attributes for monitoring/sampling
 * @pid:        target pid
 * @cpu:        target cpu
 * @group_fd:   group leader event fd
 */
SYSCALL_DEFINE5(perf_event_open,
                struct perf_event_attr __user *, attr_uptr,
                pid_t, pid, int, cpu, int, group_fd, unsigned long, flags) { ... }

When the pid argument is zero and the cpu argument is minus one (pid == 0 && cpu == -1), the syscall measures the calling process/thread on any CPU. A single event on its own is created with group_fd == -1 and is considered to be a group with only one member.

In this exercise, we are asked to create a program directly on the Raspberry Pi that uses this syscall to get a file descriptor which, when read, supplies the current value of the PMCNTR register. This Arm® blog post provides some helpful instructions. Use PERF_COUNT_HW_CPU_CYCLES as the config field of the perf_event_attr argument to the syscall. Then, similarly to the previous exercises, take \(100\) samples of the elapsed cycles between two subsequent reads, then have our program print the minimum, maximum, mean, and standard deviation of these samples.

My code looks like this:

/* perf_event.c */
#define _GNU_SOURCE
#include <linux/perf_event.h>    /* Definition of PERF_* constants */
#include <linux/hw_breakpoint.h> /* Definition of HW_* constants */
#include <sys/syscall.h>         /* Definition of SYS_* constants */
#include <sys/ioctl.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <math.h>
#include <assert.h>
#include <asm/unistd.h>

unsigned long long elapse[100];

static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    int ret;
    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
    return ret;
}

int main(void)
{
    struct perf_event_attr pe;
    int ccnt_fd, rc;
    unsigned long long c1 = 0, c2 = 0, min = 32768, max = 0, sum = 0;
    double mean = 0.0, sd = 0.0, ssq = 0.0;
    unsigned int i;

    memset(&pe, 0, sizeof(struct perf_event_attr));

    pe.disabled = 0;                            // start disabled
    pe.type = PERF_TYPE_HARDWARE;               // "generalized" hardware events
    pe.exclude_hv = 1;                          // exclude hypervisor
    pe.size = sizeof(struct perf_event_attr);   // allow the kernel to see the struct size at the time of compilation
    pe.enable_on_exec = 1;                      // auto enable on exec
    pe.inherit = 1;                             // count children
    pe.exclude_kernel = 1;                      // excludes events that happen in kernel space
    pe.config = PERF_COUNT_HW_CPU_CYCLES;       // total cycles

    ccnt_fd = perf_event_open(&pe, getpid(), -1, -1, 0);
    if (ccnt_fd < 0) {
        perror("perf_event_open for CPU cycles failed");
        exit(EXIT_FAILURE);
    }

    // ioctl(ccnt_fd, PERF_EVENT_IOC_RESET, 0);
    // ioctl(ccnt_fd, PERF_EVENT_IOC_ENABLE, 0);
    printf("Measuring CPU cycle count for two subsequent reads:\n");

    // ioctl(ccnt_fd, PERF_EVENT_IOC_DISABLE, 0);
    for (i = 0; i < 100; i++) {
        asm volatile ("nop;"); // pseudo-barrier
        rc = read(ccnt_fd, &c1, sizeof(c1)); assert(rc);
        rc = read(ccnt_fd, &c2, sizeof(c2)); assert(rc);
        asm volatile ("nop;"); // pseudo-barrier
        if (c2 > c1) {
            elapse[i] = c2 - c1;
            if (elapse[i] < min) {
                min = elapse[i];
            }
            if (elapse[i] > max) {
                max = elapse[i];
            }
        } else {
            /* skip if overflow */
            i--;
        }
    }

    for (i = 0; i < 100; i++) {
        sum += elapse[i];
    }
    mean = sum / (double)100;
    for (i = 0; i < 100; i++) {
        ssq += pow((elapse[i] - mean), 2);
    }
    sd = sqrt(ssq / (double)100);
    printf("Minimum and Maximum of the elapsed cycles: %lld, %lld\n", min, max);
    printf("Average of the elapsed cycles: %.4lf\n", mean);
    printf("Standard deviation of the elapsed cycles: %.4lf\n", sd);

    close(ccnt_fd);

    return 0;
}

To compile this program, issue the command:

gcc perf_event.c -o perf_event -lm

Here is the output results:

$ ./perf_event
Measuring CPU cycle count for two subsequent reads:
Minimum and Maximum of the elapsed cycles: 69, 113
Average of the elapsed cycles: 72.4100
Standard deviation of the elapsed cycles: 7.7603