Question

Understanding the flow of the kernel upon receiving a SIGSEGV for null-dereference

I'm trying to figure out the sequence of things that occur inside the Linux kernel (x86_64, v6.9) when we write these two codes:

// Null-dereference + writing to page zero
*(char *)0 = 0;
// Null-dereference + only reading from page zero
char c = *(char *)0;

I tried to analyze it with the Ftrace, and this is what I got:

handle_mm_fault <-- do_user_addr_fault
sanitize_fault_flags <-- handle_mm_fault
arch_vma_access_permitted <-- handle_mm_fault
bad_area_nosemaphore <-- do_user_addr_fault
__bad_area_nosemaphore <-- do_user_addr_fault
force_sig_fault <-- __bad_area_nosemaphore

So from my understanding, we cause a page fault, and somehow arch_vma_access_permitted() OR sanitize_fault_flags() decides to return VM_FAULT_SIGSEGV and __bad_area_nosemaphore() uses that to send a SIGSEGV to the process with force_sig_fault(). My question is, what is the permission of page zero? Does it even get mapped in the first place? If it doesn't, then I think vma_is_foreign() should cover this situation and cause the segmentation fault. I also found something interesting in load_elf_binary() for emulating the ABI behavior of previous Linux versions:

if (current->personality & MMAP_PAGE_ZERO) {
        /* Why this, you ask???  Well SVr4 maps page 0 as read-only,
           and some applications "depend" upon this behavior.
           Since we do not have the power to recompile these, we
           emulate the SVr4 behavior. Sigh. */
        error = vm_mmap(NULL, 0, PAGE_SIZE, PROT_READ | PROT_EXEC,
                        MAP_FIXED | MAP_PRIVATE, 0);
}

The most confusing part of it is the PROT_EXEC. Why do we need to store instructions inside of page zero? And if it has PROT_READ as well, then while current->personality has MMAP_PAGE_ZERO, reading from page zero should not cause a segmentation fault, right? Couldn't find the SVr4 spec so I'm not sure about the details. I'm also not certain when this personality applies to a task but we can conclude that in some scenarios page zero GETS mapped (of course we can remap it by using mmap() if mmap_min_addr is zero, but I'm talking about the default behavior right now, not remapping). I can't find any other vm_mmap() or do_mmap() that is mapping page zero.

 4  164  4
1 Jan 1970

Solution

 3

what is the permission of page zero? Does it even get mapped in the first place?

It doesn't. Except the quirky piece of code you show, or other exceptions like userspace explicitly requesting to map it when vm.mmap_min_addr = 0, no page is usually ever mapped at virtual address zero. Even if it was, the behavior would be the same as any other page, the fact that the virtual address is zero doesn't really make it special.

When the fault occurs for the program you show, it will end up exactly inside this branch of do_user_addr_fault(), because the process does not have a mapping (vma) for 0x0:

    ...
lock_mmap:

retry:
    vma = lock_mm_and_find_vma(mm, address, regs);
    if (unlikely(!vma)) {
        bad_area_nosemaphore(regs, error_code, address); // <== HERE
        return;
    }
    ...

Not sure why you are mentioning vma_is_foreign(), but that cannot really be called if you don't have a vma to begin with. bad_area_nosemaphore() will be the one responsible for SIGSEGV delivery, calling force_sig_fault(), which then calls another series of functions to deliver the signal.


The most confusing part of it is the PROT_EXEC. Why do we need to store instructions inside of page zero?

Old software expects that read-implies-exec behavior because that's just how the hardware used to work. In fact, this is still the case for older x86 32-bit CPUs and even for some 32-bit software running on x86_64. Take a look at this comment for example.

And if it has PROT_READ as well, then while current->personality has MMAP_PAGE_ZERO, reading from page zero should not cause a segmentation fault, right?

Right.

I'm also not certain when this personality applies to a task

It mainly depends on the underlying architecture. Arch-specific code chooses which personality bits to toggle on new execs. The SET_PERSONALITY and SET_PERSONALITY2 macros are defined by each arch and you can see the different definitions here and here.

2024-07-14
Marco Bonelli