Linux Kernel - Syscalls

I am writing this series of articles to explain, and learn along the way, how the Linux kernel works (some parts of it at least). I find it difficult to find a place to start to learn how the kernel works, so I will be taking apart different functionnalities, in order to gain insight on how it works :D

I am by no means an expert, so feel free to reach out to correct anything incorrect or imcomplete!

I will be focusing on the 5.14.14, the latest at the time of writing, and on the x86_64 architecture.

Syscalls

Learning low level programming, the first time I encountered the kernel was when I learned about the syscall instruction. This instruction, is a blackbox for userland programs, you setup some registers properly, issue the instruction, and magically, your program has been modified accordingly to what the syscall does. But how does it work behind the scenes ?

In this post, we will see how the syscalls are performed from userland and back. In the x86_64 architecture, there are two ways to issue sycalls, using the syscall instruction, as well as the using the compatibility int 0x80, that allows 32-bits program compatibility. There are differences between the two, for now I will focus on the former, and might add an article for the latter one day :)

The instruction

The first step is to see what the instruction actually does, and for this, we will have to look inside the Intel manual.

Let’s take it step by step :

SYSCALL invokes an OS system-call handler at privilege level 0. It does so by loading RIP from the IA32_LSTAR MSR (after saving the address of the instruction following SYSCALL into RCX)

SYSCALL also saves RFLAGS into R11 and then masks RFLAGS using the IA32_FMASK MSR

What we can learn from this is that there is a Model Specific Register (MSR) named IA32_LSTAR, that should contain the address of our system-call handler. We will see what value it contains just after.

We also learn that the address at which we should return once the syscall is done is saved in RCX, which means that this register is not saved ! (This also explains the value that you can find in the RCX register after a syscall, if you’ve ever wondered), same goes for the RFLAGS, which overwrites the R11 register.

So, where is this LSTAR register set ? This is done during the booting process, in arch/x86/kernel/cpu/common.c, the relevant line is :

wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);

The wrmsrl is a function that ends up using the WRMSR instruction to write to a model specific register.

The entrypoint

Once the syscall instruction is issued, we end up at the entry_SYSCALL_64 symbol. This symbol is defined in the arch/x86/entry/entry_64.S file.

This is an assembly file, let’s take it step by step:

swapgs
/* tss.sp2 is scratch space. */
movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp

Getting a proper stack pointer

The first thing we want to do is save the user-supplied registers, which are to be preserved. For this, the simplest idea would be to push them on a kernel stack, however, we don’t have a kernel stack pointer, and no real way to get it easily. This is where the swapgs comes into play : it exchanges the the gs base register from userland with the kernel one, which, on Linux, is a way to access per-CPU storage. On each CPU, there is a section named .data..percpu, which is used to store per-CPU data. More informations about it can be found here .

Inside this this .data..percpu, we have the TSS, or Task State Segment, more information can be found on Wikipedia, but basically, it is used to store the state of the registers of the CPU. Notably, we have the sp0, sp1 and sp2. These are used to store stack pointer for each privilege level (rings). Since the second ring is unused, the TSS_sp2 is unused, so it can be used for saving RSP.

Firstly, it saves the userland RSP into PER_CPU_VAR(cpu_tss_rw + TSS_sp2), and PER_CPU_VAR is a macro aliasing to GS, so we end up storing it into [GS:cpu_tss_rw + TSS_sp2].

Skipping for now the next line, the following instruction loads into RSP the new kernel stack pointer, which is stored into the per-CPU storage. We have set our proper stack pointer ! The value comes from cpu_current_top_of_stack, which is a macro that accesses the TSS_sp1, where the kernel stack pointer is stored.

SWITCH_TO_KERNEL_CR3 is a macro that sets the CR3 register, which is used on linux as the Page Global Directory (PGD), which is used for virtual address translation. So why would we need to change it ? I believe this is linked to the Kernel page-table isolation (KPTI) protection, which separates the pages tables for userland and kernelland, to mitigate the Meltdown vulnerability.

Saving registers

Now that the kernels has a stack pointer, we can save the userland context : this comes just after.

SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)

	/* Construct struct pt_regs on stack */
	pushq	$__USER_DS				/* pt_regs->ss */
	pushq	PER_CPU_VAR(cpu_tss_rw + TSS_sp2)	/* pt_regs->sp */
	pushq	%r11					/* pt_regs->flags */
	pushq	$__USER_CS				/* pt_regs->cs */
	pushq	%rcx					/* pt_regs->ip */
SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
	pushq	%rax					/* pt_regs->orig_ax */

    PUSH_AND_CLEAR_REGS rax=$-ENOSYS

It creates at struct pt_regs on the newly created stack, in order to save all the registers, to restore them later on. The userland RSP is pushed from the GS segment where it was stored earlier. This struct is basically all the registers in a specific order. PUSH_AND_CLEAR_REGS is a macro that pushes all remaining registers that the syscall has to preserve, and clears them.

Also note that both the flags and IP registers are pushed from R11 and RCX respectively, which makes sense according to what the SYSCALL instruction does.

In the struct pt_regs, there are two fields for the ax registers, one is used to save the syscall number, one is used for the return value. For now, the return value in the struct is set to -ENOSYS, the value returned when the syscall number does not exists.

Next, we have the following lines

	movq	%rsp, %rdi
	movslq	%eax, %rsi
	call	do_syscall_64		

We are setting up the arguments for a call to do_syscall_64, the first argument being a pointer to our new struct pt_regs, the second being the number of the syscall we are trying to call.

The do_syscall_64 is a C function, defined in arch/x86/entry/common.c.

__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
{
	add_random_kstack_offset();
	nr = syscall_enter_from_user_mode(regs, nr);

	instrumentation_begin();

	if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
		/* Invalid system call, but still a system call. */
		regs->ax = __x64_sys_ni_syscall(regs);
	}

	instrumentation_end();
	syscall_exit_to_user_mode(regs);
}

The first function called, add_random_kstack_offset, is related to KASLR : its goal is to add a random offset to the stack address, and this for each syscall. It uses GCC’s __builtin_alloca to add a random constant to the stack pointer, and inline assembly to keep the offset value.

Next, syscall_enter_from_user_mode is called, which is simply a wrapper for __syscall_enter_from_user_work, defined in kernel/entry/common.c

From what I understand, this function is related to syscall tracing, for instance when using seccomp, and simply returns the syscall number when no tracing is taking place, which we will assume here. Similarily, instrumentation_begin is called, which is related to instrumentating how syscalls behave, so we will ignore it.

Afterwards, do_syscall_x64 and do_syscall_x32 are called, which is where our syscall is dispatched to the correct handler ! Before diving in on how they work, I’ll first mention why we might call a 32 bit syscall.

The reason for that is that even if we issue a SYSCALL instruction in a 64 bits process, we can still access some 32 bits syscalls, exposed for compatibility ! If you didn’t know, issuing a syscall with RAX set to 0x40000000 + x, you can call these compatibility syscalls (x being the number of the specific syscall your are intersted in) !

This is also the reason you’ll almost always see in all seccomp rules a line at the begginning that looks like # A >= 0x40000000 ? dead : next, to ensure we can’t escape the filter using this x32 ABI.

Taking a glimpse at the begin of do_syscall_x32, we can see this :

static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)
{
	/*
	 * Adjust the starting offset of the table, and convert numbers
	 * < __X32_SYSCALL_BIT to very high and thus out of range
	 * numbers for comparisons.
	 */
	unsigned int xnr = nr - __X32_SYSCALL_BIT

where __X32_SYSCALL_BIT is defined using

#define __X32_SYSCALL_BIT	0x40000000

Afterwards xrn will be considered as the number of the syscall in the list of compatibility ones.

But let’s focus on do_syscall_x64 for now :)

The dispatcher

This function, do_syscall_x64, is defined in arch/x86/entry/common.c

Its goal is simple : it has to choose the correct handler, if it exists, and call it, passing the struct pt_regs as an argument. It dispatches to the appropriate handler.

static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
{
	/*
	 * Convert negative numbers to very high and thus out of range
	 * numbers for comparisons.
	 */
	unsigned int unr = nr;

	if (likely(unr < NR_syscalls)) {
		unr = array_index_nospec(unr, NR_syscalls);
		regs->ax = sys_call_table[unr](regs);
		return true;
	}
	return false;
}

First, we check if the syscall number is out of range, comparing the syscall number to NR_syscalls, which is a macro, whose value is the highest number for a syscall, in this version, it is 448.

array_index_nospec is a function used to mitigate against speculative attacks, so we will safely assume it returns our syscall number.

Then, the call to our handler is made ! For this, the array sys_call_table, is dereferenced by our syscall number, and the handler is called with our struct pt_regs as an argument. The return value of the handler is stored in our registers structure, and we are ready to start the return to userland !

But before doing that, we will take a look at how the sys_call_table is constructed. It is located in arch/x86/entry/syscall_64.c

asmlinkage const sys_call_ptr_t sys_call_table[] = {
#include <asm/syscalls_64.h>
};

So actually, this is not really defined here. But if we look for the file syscalls_64.h in the source code, we can’t find it ! The reason for this is because this header file is dynamically generated, using the script/syscalltbl.sh script.

This script is used to parse the file arch/x86/entry/syscalls/syscall_64.tbl, which contains the list of our syscalls.

Looking at the start, we can see the usual suspects.

0	common	read			sys_read
1	common	write			sys_write
2	common	open			sys_open
...
13	64	rt_sigaction		sys_rt_sigaction
...

512	x32	rt_sigaction		compat_sys_rt_sigaction
513	x32	rt_sigreturn		compat_sys_x32_rt_sigreturn
514	x32	ioctl			compat_sys_ioctl
...

The second column specifies which ABI it is part of. The syscalls_64.h header is built using the syscalls classified as either common or 64.

The list also references the symbols of the handlers, for instance, for the syscall read, the handler in the array that we’ll end up calling is going to be sys_read.

For the do_syscall_x32 function, the process is exactly the same, after substracting __X32_SYSCALL_BIT. The only difference is that we are using another array, x32_sys_call_table, which is built from asm/syscalls_x32.h, which is also dynamically generated using script/syscalltbl.sh, which parses again syscall_64.tbl (yes, this is not a typo), and extract syscalls classified as either common or x32. They are often prefixed with compat_.

Back to userland

Once the handler returns, we go back to do_syscall_x64, we set the ax register stored on the stack, to the return value of our handler, and return true.

Next, we go back to do_syscall_64, and since we just returned true, we call instrumentation_end, related to the aformentionned instrumentation_begin and go to syscall_exit_to_user_mode.

Finally, once that is done, the code returns to entry_SYSCALL_64, and there, there are several path to return to userland. This is because there are multiple ways to get to this point in entry_SYSCALL_64. Let’s see the relevant assembly.

	movq	RCX(%rsp), %rcx
	movq	RIP(%rsp), %r11

	cmpq	%rcx, %r11	/* SYSRET requires RCX == RIP */
	jne	swapgs_restore_regs_and_return_to_usermode

Remembering that our saved registers are stored on the kernel stack, we restore the values of RCX and RIP (into R11 - here RCX and RIP are macros that contains the offset of these registers in a struct pt_regs).

Remembering what we did here, since we arrived in entry_SYSCALL_64 using a SYSCALL instruction, the RCX contained the saved value for RIP, thus here they are equal.

Next, we have :

#ifdef CONFIG_X86_5LEVEL
	ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
		"shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
#else
	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
#endif

This is a sanity check to ensure that the provided return value in RCX is canonical, which means it is a proper userland adress. To do so, the code clears out either the 16 or 7 top bits of the RCX, and compare it with the previous value of the register, if they are different, the adress is not canonical, otherwise it is.

(__VIRTUAL_MASK_SHIFT is a macro whose value depends on whether the kernel uses 4 or 5 page tables)

	cmpq	%rcx, %r11
	jne	swapgs_restore_regs_and_return_to_usermode

Why two possible values ? This depends on whether the kernel uses a 4 or 5 level page table, more information on the virtual memory layout can be found in Documentation/x86/x86_64/mm.rst

Similarily, we have 4 more sanity checks :

	cmpq	$__USER_CS, CS(%rsp)		/* CS must match SYSRET */
	jne	swapgs_restore_regs_and_return_to_usermode

	movq	R11(%rsp), %r11
	cmpq	%r11, EFLAGS(%rsp)		/* R11 == RFLAGS */
	jne	swapgs_restore_regs_and_return_to_usermode

	testq	$(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
	jnz	swapgs_restore_regs_and_return_to_usermode

	/* nothing to check for RSP */

	cmpq	$__USER_DS, SS(%rsp)		/* SS must match SYSRET */
	jne	swapgs_restore_regs_and_return_to_usermode

Just like we ensured RCX had a proper value, we check that the registers (or flags) that the kernel must preserve are correct, if they are not, we go to swapgs_restore_regs_and_return_to_usermode. The reason why the sanity checks are performed is to avoid going to this function, which is much slower that using SYSRET like we are going to.

Assuming we are in a normal situation, all those checks should succeed.

If so, we proceed to restore all remaining registers that are saved on the stack

syscall_return_via_sysret:
	/* rcx and r11 are already restored (see code above) */
	POP_REGS pop_rdi=0 skip_r11rcx=1

	/*
	 * Now all regs are restored except RSP and RDI.
	 * Save old stack pointer and switch to trampoline stack.
	 */
	movq	%rsp, %rdi
	movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
	UNWIND_HINT_EMPTY

	pushq	RSP-RDI(%rdi)	/* RSP */
	pushq	(%rdi)		/* RDI */

First, we restore all remaining registers, using the POP_REGS macro, which undoes what PUSH_AND_CLEAR_REGS did when entering, except for RDI and the registers we have already restored.

Next, we save RSP into RDI; at the time of doing so, it points to the saved RDI register.

Afterwards, we save the values of the RDI and RSP to the TSS. For this, we load RSP using the per-CPU mechanism we saw earlier. This information is useful for the next time that the current thread switches to kernel mode.

Remember that RDI points to the saved RDI, so we find the offset of RSP by calculating RSP - RDI on the stack, and we store them by pushing, to the RSP that was previously loaded.

At last, we have this code :

	STACKLEAK_ERASE_NOCLOBBER

	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi

	popq	%rdi
	popq	%rsp
	swapgs
	sysretq

STACKLEAK_ERASE_NOCLOBBER is a macro that clears out the stack to avoid memory leaks, and we restore the userland page table using SWITCH_TO_USER_CR3_STACK.

Finally, we restore the userland RDI and RSP register, change the gs base register again with SWAPGS, and issue a SYSRETQ which does the opposite of SYSCALL !

According to the documentation from the manual, the SYSRET does the following :

SYSRET is a companion instruction to the SYSCALL instruction. It returns from an OS system-call handler to user code at privilege level 3. It does so by loading RIP from RCX and loading RFLAGS from R11.

After that, we are back to our userland program ! This also explains why after a SYSCALL, we will find that the RCX register contains the address of the instruction that follows the SYSCALL (and R11 contains the RFLAGS)

Conclusion

That was a nice learning experience to see what happens behind the SYSCALL instruction, which is meant to be a blackbox from the userland perspective, however there are other way to issue “system calls”, notably using the int 0x80 instruction (even in 64 bits), and the path that the kernel takes is rather different, since it is an interrupt ! I might add another article another day !