eBPF Implementation in Kernel

Published:

Unprivileged eBPF

eBPF allows unprivileged user to load eBPF program if /proc/sys/kernel/unprivileged_bpf_disabled is 0.

The implementation is in __sys_bpf in linux/kernel/bpf/syscall.c.

static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
{
	union bpf_attr attr;
	bool capable;
	int err;

	capable = bpf_capable() || !sysctl_unprivileged_bpf_disabled;
    ...
}

Where is eBPF byte code allocated?

eBPF byte code is allocated in vmalloc region. The call path is bpf_prog_load -> bpf_prog_alloc –> bpf_prog_alloc_no_stats –> __vmalloc.

bpf_prog_load also calls bpf_check. Does the kernel set the program RO before the checking?

RO Hardening for Byte Code

After bpf_check, bpf_prog_load calls bpf_prog_select_runtime, which calls bpf_prog_lock_ro to set the data structure and the byte code read-only.

Where is JITed code allocated?

On x86 and RISC-V, the JITed code is in the module region, as described in Documentation/riscv/vm-layout.rst.

RISC-V Linux Kernel SV39
===================================================================================================
   Start addr    |   Offset   |     End addr     |  Size   | VM area description
===================================================================================================
                 |            |                  |         |
0000000000000000 |    0       | 0000003fffffffff |  256 GB | user virtual memory, different per mm
_________________|____________|__________________|_________|_______________________________________
                                                           |
___________________________________________________________|_______________________________________
                 |            |                  |         |
ffffffc6fee00000 | -228    GB | ffffffc6feffffff |    2 MB | fixmap
ffffffc6ff000000 | -228    GB | ffffffc6ffffffff |   16 MB | PCI io
ffffffc700000000 | -228    GB | ffffffc7ffffffff |    4 GB | vmemmap
ffffffc800000000 | -224    GB | ffffffd7ffffffff |   64 GB | vmalloc/ioremap space
ffffffd800000000 | -160    GB | fffffff6ffffffff |  124 GB | direct mapping of all physical memory
fffffff700000000 |  -36    GB | fffffffeffffffff |   32 GB | kasan
_________________|____________|__________________|_________|________________________________________
                                                           |
___________________________________________________________|________________________________________
                 |            |                  |         |
ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | modules, BPF
ffffffff80000000 |   -2    GB | ffffffffffffffff |    2 GB | kernel
_________________|____________|__________________|_________|________________________________________

RISC-V kernel also defines BPF_JIT_REGION_START.

For the actual JITed code memory allocation, bpf_jit_alloc_exec calls __vmalloc_node_range and passes BPF_JIT_REGION_START as the starting address.

On ARM64, bpf_jit_binary_alloc -> bpf_jit_alloc_exec -> [vmalloc]. Therefore, the JITed code is in the vmalloc memory region.

AArch64 Linux memory layout with 4KB pages + 4 levels (48-bit)::

  Start			End			Size		Use
  -----------------------------------------------------------------------
  0000000000000000	0000ffffffffffff	 256TB		user
  ffff000000000000	ffff7fffffffffff	 128TB		kernel logical memory map
 [ffff600000000000	ffff7fffffffffff]	  32TB		[kasan shadow region]
  ffff800000000000	ffff800007ffffff	 128MB		modules
  ffff800008000000	fffffbffefffffff	 124TB		vmalloc
  fffffbfff0000000	fffffbfffdffffff	 224MB		fixed mappings (top down)
  fffffbfffe000000	fffffbfffe7fffff	   8MB		[guard region]
  fffffbfffe800000	fffffbffff7fffff	  16MB		PCI I/O space
  fffffbffff800000	fffffbffffffffff	   8MB		[guard region]
  fffffc0000000000	fffffdffffffffff	   2TB		vmemmap
  fffffe0000000000	ffffffffffffffff	   2TB		[guard region]
  -----------------------------------------------------------------------

RO Hardening for JIT Code

The BPF JIT compiler sets the JITed native code to ROX (PXN clear). The call path is bpf_int_jit_compile -> bpf_jit_binary_lock_ro -> set_memory_rox -> change_memory_common

change_memory_common will clear PXN bit, allowing native code execution in kernel mode.