Question

perf on arm cortex-a7 produces no callstacks

I've written a small C++ test program that I want to profile using perf on an arm. Running and profiling the program on my x86 WSL produces expected perf results. However when I profile the program on the arm system, the perf report contains no callstacks and shows different methods compared to the x86 output. I wil show my program, the perf output of the x86 and the perf output of the arm.

My program has a main function that continuously loops a short and a long method, which both call run.

#include <map>

int run(int loop, std::map<int,int> m)
{
    int x = 0;
    for(int i = 0; i < loop; i++)
    {
       x += i + i * x;
       m.insert({i,x});
    }
    return x;
}

int short_method(std::map<int,int> m)
{
    return run(100, m);
}

int long_method(std::map<int,int> m)
{
    return run(10000, m);
}

int main()
{
    while(true)
    {
        std::map<int,int> m;
        short_method(m);
        long_method(m);
    }
    return 0;
}

First look at the expected behavior under x86 WSL. Compilation and profiling:

g++ -g -O0 main.cpp

perf record -g -F 1000 -p$(pgrep -d, a.out) sleep 5
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.637 MB perf.data (5002 samples) ]

perf report -g

Output:

+   97.84%     0.00%  a.out    [unknown]            [.] 0x41fd89415541f689                                                                 ▒
+   97.84%     0.00%  a.out    libc-2.28.so         [.] __libc_start_main                                                                  ▒
-   97.84%     0.02%  a.out    a.out                [.] main                                                                               ▒
   - 97.82% main                                                                                                                           ▒
      - 97.14% long_method                                                                                                                 ▒
         + 90.31% run                                                                                                                      ▒
         + 6.62% std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >::~map                                     ▒
      - 0.64% short_method                                                                                                                 ◆
         + 0.54% run                                                                                                                       ▒
+   97.14%     0.00%  a.out    a.out                [.] long_method                                                                        ▒
+   90.85%     0.44%  a.out    a.out                [.] run                                                                                ▒
+   89.59%     0.30%  a.out    a.out                [.] std::map<int, int, std::less<int>, std::allocator<std::pair<int const, int> > >::in▒

The output is usable to me, as the main method is clearly on top and I can open up the callstack to identify the methods called inside main.

Now compilation and profiling on the arm system:

arm-__-linux-gnueabi-g++  -mthumb -mfpu=neon-vfpv4 -mfloat-abi=hard -mcpu=cortex-a7   -O0  -Wformat -Wformat-security -Werror=format-security --sysroot=/opt/sdk/sysroots/cortexa7t2hf-neon-vfpv4-__-linux-gnueabi -g -fno-omit-frame-pointer main.cpp -o a_arm.out
perf record -g -F 1000 -p 405 sleep 5
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.346 MB perf.data (5005 samples) ]

perf report -g

Output:

-   21.35%    21.12%  a_arm.out  a_arm.out            [.] std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_get_insert_unique_pos a
     std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_get_insert_unique_pos                                                      a
-   12.46%    12.40%  a_arm.out  a_arm.out            [.] std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_S_key                   a
     std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_S_key                                                                        a
-   11.15%    11.07%  a_arm.out  a_arm.out            [.] __gnu_cxx::__aligned_membuf<std::pair<int const, int> >::_M_ptr                                                                                                                   a
     __gnu_cxx::__aligned_membuf<std::pair<int const, int> >::_M_ptr                                                                                                                                                                        a
-    7.31%     7.31%  a_arm.out  a_arm.out            [.] std::_Rb_tree_node<std::pair<int const, int> >::_M_valptr                                                                                                                         a
     std::_Rb_tree_node<std::pair<int const, int> >::_M_valptr                                                                                                                                                                              a
-    4.86%     4.84%  a_arm.out  a_arm.out            [.] __gnu_cxx::__aligned_membuf<std::pair<int const, int> >::_M_addr                                                                                                                  a
     __gnu_cxx::__aligned_membuf<std::pair<int const, int> >::_M_addr                                                                                                                                                                       a
-    3.05%     3.03%  a_arm.out  a_arm.out            [.] std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >, std::less<int>, std::allocator<std::pair<int const, int> > >::_M_insert_<std::pair<int a
     std::_Rb_tree<int, std::pair<int const, int>, std::_Select1st<std::pair<int const, int> >,

Different calls are at the top of execution time, its not clearly visible that most of the time is spent in main and on openening any of the top most functions, I do not see a proper callstack, telling me who calls the functions or what functions are being called. (depending on callee/caller option)

I wonder why I don't have a proper callstack for my arm program, any ideas?

edit This 12 year old question seems to have the same issue: How to get call graph profiling working with gcc compiled code and ARM Cortex A8 target?

update: 04.07.24 same problem appears here: https://lore.kernel.org/all/483117050be94fe89841a5cb74b66150@de.bosch.com/T/

I tried my example program myself on a raspberry with linux kernel 6.6.31 and perf version 6.6.31. Same issue - flat callstacks. Issue seems to appear primarily on ARM architectures.

similiar from 2016

 3  64  3
1 Jan 1970

Solution

 1

Add -marm -fno-omit-frame-pointer -mapcs-frame flags to compilation to generate stack frames that work with perf.

According to this llvm bug

On ARM this requires APCS frames (-fno-omit-frame-pointer -mapcs-frame) to ensure that frame pointers are stored in predictable locations.

Adding the -mapcs-frame solved it for me so that I could see call stacks with the --call-graph fp option. The dwarf option is still not working. In addition, one must disable thumb mode, so I removed the -mthumb flag and replaced it with -marm to force the arm instruction set. See Perf callgraph symbolization on thumb google issue starting in 2020 and closed in 2023 as won't fix. Apparently this in an arm 32 issue and not present in arm64 anymore.

2024-07-16
Max