Linux Kernel - Syscalls
I am writing this series of articles to explain, and learn along the way, how the Linux kernel works (some parts of it at least). I find it difficult to find a place to start to learn how the kernel works, so I will be taking apart different functionnalities, in order to gain insight on how it works :D
I am by no means an expert, so feel free to reach out to correct anything incorrect or incomplete!
I will be focusing on the 5.14.14, the latest at the time of writing, and on the x86_64 architecture.
Syscalls
Learning low level programming, the first time I encountered the kernel was when I learned about the syscall
instruction. This instruction, is a blackbox for userland programs, you setup some registers properly, issue the instruction, and magically, your program has been modified accordingly to what the syscall does. But how does it work behind the scenes ?
In this post, we will see how the syscalls are performed from userland and back. In the x86_64 architecture, there are two ways to issue sycalls, using the syscall
instruction, as well as the using the compatibility int 0x80
, that allows 32-bits program compatibility. There are differences between the two, for now I will focus on the former, and might add an article for the latter one day :)
The instruction
The first step is to see what the instruction actually does, and for this, we will have to look inside the Intel manual.
Let’s take it step by step :
SYSCALL invokes an OS system-call handler at privilege level 0. It does so by loading RIP from the IA32_LSTAR MSR (after saving the address of the instruction following SYSCALL into RCX)
SYSCALL also saves RFLAGS into R11 and then masks RFLAGS using the IA32_FMASK MSR
What we can learn from this is that there is a Model Specific Register (MSR) named IA32_LSTAR, that should contain the address of our system-call handler. We will see what value it contains just after.
We also learn that the address at which we should return once the syscall is done is saved in RCX
, which means that this register is not saved ! (This also explains the value that you can find in the RCX
register after a syscall, if you’ve ever wondered), same goes for the RFLAGS
, which overwrites the R11
register.
So, where is this LSTAR
register set ? This is done during the booting process, in arch/x86/kernel/cpu/common.c
, the relevant line is :
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
The wrmsrl
is a function that ends up using the WRMSR
instruction to write to a model specific register.
The entrypoint
Once the syscall instruction is issued, we end up at the entry_SYSCALL_64
symbol. This symbol is defined in the arch/x86/entry/entry_64.S
file.
This is an assembly file, let’s take it step by step:
swapgs
/* tss.sp2 is scratch space. */
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
Getting a proper stack pointer
The first thing we want to do is save the user-supplied registers, which are to be preserved. For this, the simplest idea would be to push them on a kernel stack, however, we don’t have a kernel stack pointer, and no real way to get it easily.
This is where the swapgs
comes into play : it exchanges the the gs
base register from userland with the kernel one, which, on Linux, is a way to access per-CPU storage. On each CPU, there is a section named .data..percpu
, which is used to store per-CPU data. More informations about it can be found here .
Inside this this .data..percpu
, we have the TSS, or Task State Segment, more information can be found on Wikipedia, but basically, it is used to store the state of the registers of the CPU. Notably, we have the sp0
, sp1
and sp2
. These are used to store stack pointer for each privilege level (rings). Since the second ring is unused, the TSS_sp2
is unused, so it can be used for saving RSP
.
Firstly, it saves the userland RSP
into PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
, and PER_CPU_VAR
is a macro aliasing to GS
, so we end up storing it into [GS:cpu_tss_rw + TSS_sp2]
.
Skipping for now the next line, the following instruction loads into RSP
the new kernel stack pointer, which is stored into the per-CPU storage. We have set our proper stack pointer ! The value comes from cpu_current_top_of_stack
, which is a macro that accesses the TSS_sp1
, where the kernel stack pointer is stored.
SWITCH_TO_KERNEL_CR3
is a macro that sets the CR3
register, which is used on linux as the Page Global Directory (PGD), which is used for virtual address translation. So why would we need to change it ? I believe this is linked to the Kernel page-table isolation (KPTI) protection, which separates the pages tables for userland and kernelland, to mitigate the Meltdown vulnerability.
Saving registers
Now that the kernels has a stack pointer, we can save the userland context : this comes just after.
SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
/* Construct struct pt_regs on stack */
pushq $__USER_DS /* pt_regs->ss */
pushq PER_CPU_VAR(cpu_tss_rw + TSS_sp2) /* pt_regs->sp */
pushq %r11 /* pt_regs->flags */
pushq $__USER_CS /* pt_regs->cs */
pushq %rcx /* pt_regs->ip */
SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
pushq %rax /* pt_regs->orig_ax */
PUSH_AND_CLEAR_REGS rax=$-ENOSYS
It creates at struct pt_regs
on the newly created stack, in order to save all the registers, to restore them later on. The userland RSP
is pushed from the GS
segment where it was stored earlier.
This struct is basically all the registers in a specific order. PUSH_AND_CLEAR_REGS
is a macro that pushes all remaining registers that the syscall
has to preserve, and clears them.
Also note that both the flags
and IP
registers are pushed from R11
and RCX
respectively, which makes sense according to what the SYSCALL
instruction does.
In the struct pt_regs
, there are two fields for the ax
registers, one is used to save the syscall number, one is used for the return value. For now, the return value in the struct is set to -ENOSYS
, the value returned when the syscall number does not exists.
Next, we have the following lines
movq %rsp, %rdi
movslq %eax, %rsi
call do_syscall_64
We are setting up the arguments for a call to do_syscall_64
, the first argument being a pointer to our new struct pt_regs
, the second being the number of the syscall we are trying to call.
The do_syscall_64
is a C function, defined in arch/x86/entry/common.c
.
__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
{
add_random_kstack_offset();
nr = syscall_enter_from_user_mode(regs, nr);
instrumentation_begin();
if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
/* Invalid system call, but still a system call. */
regs->ax = __x64_sys_ni_syscall(regs);
}
instrumentation_end();
syscall_exit_to_user_mode(regs);
}
The first function called, add_random_kstack_offset
, is related to KASLR
: its goal is to add a random offset to the stack address, and this for each syscall. It uses GCC’s __builtin_alloca
to add a random constant to the stack pointer, and inline assembly to keep the offset value.
Next, syscall_enter_from_user_mode
is called, which is simply a wrapper for __syscall_enter_from_user_work
, defined in kernel/entry/common.c
From what I understand, this function is related to syscall tracing, for instance when using seccomp, and simply returns the syscall number when no tracing is taking place, which we will assume here.
Similarily, instrumentation_begin
is called, which is related to instrumentating how syscalls behave, so we will ignore it.
Afterwards, do_syscall_x64
and do_syscall_x32
are called, which is where our syscall is dispatched to the correct handler ! Before diving in on how they work, I’ll first mention why we might call a 32 bit syscall.
The reason for that is that even if we issue a SYSCALL
instruction in a 64 bits process, we can still access some 32 bits syscalls, exposed for compatibility ! If you didn’t know, issuing a syscall with RAX
set to 0x40000000 + x
, you can call these compatibility syscalls (x
being the number of the specific syscall your are intersted in) !
This is also the reason you’ll almost always see in all seccomp rules a line at the begginning that looks like # A >= 0x40000000 ? dead : next
, to ensure we can’t escape the filter using this x32 ABI.
Taking a glimpse at the begin of do_syscall_x32
, we can see this :
static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)
{
/*
* Adjust the starting offset of the table, and convert numbers
* < __X32_SYSCALL_BIT to very high and thus out of range
* numbers for comparisons.
*/
unsigned int xnr = nr - __X32_SYSCALL_BIT
where __X32_SYSCALL_BIT
is defined using
#define __X32_SYSCALL_BIT 0x40000000
Afterwards xrn
will be considered as the number of the syscall in the list of compatibility ones.
But let’s focus on do_syscall_x64
for now :)
The dispatcher
This function, do_syscall_x64
, is defined in arch/x86/entry/common.c
Its goal is simple : it has to choose the correct handler, if it exists, and call it, passing the struct pt_regs
as an argument. It dispatches to the appropriate handler.
static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
{
/*
* Convert negative numbers to very high and thus out of range
* numbers for comparisons.
*/
unsigned int unr = nr;
if (likely(unr < NR_syscalls)) {
unr = array_index_nospec(unr, NR_syscalls);
regs->ax = sys_call_table[unr](regs);
return true;
}
return false;
}
First, we check if the syscall number is out of range, comparing the syscall number to NR_syscalls
, which is a macro, whose value is the highest number for a syscall, in this version, it is 448
.
array_index_nospec
is a function used to mitigate against speculative attacks, so we will safely assume it returns our syscall number.
Then, the call to our handler is made ! For this, the array sys_call_table
, is dereferenced by our syscall number, and the handler is called with our struct pt_regs
as an argument. The return value of the handler is stored in our registers structure, and we are ready to start the return to userland !
But before doing that, we will take a look at how the sys_call_table
is constructed. It is located in arch/x86/entry/syscall_64.c
asmlinkage const sys_call_ptr_t sys_call_table[] = {
#include <asm/syscalls_64.h>
};
So actually, this is not really defined here. But if we look for the file syscalls_64.h
in the source code, we can’t find it ! The reason for this is because this header file is dynamically generated, using the script/syscalltbl.sh
script.
This script is used to parse the file arch/x86/entry/syscalls/syscall_64.tbl
, which contains the list of our syscalls.
Looking at the start, we can see the usual suspects.
0 common read sys_read
1 common write sys_write
2 common open sys_open
...
13 64 rt_sigaction sys_rt_sigaction
...
512 x32 rt_sigaction compat_sys_rt_sigaction
513 x32 rt_sigreturn compat_sys_x32_rt_sigreturn
514 x32 ioctl compat_sys_ioctl
...
The second column specifies which ABI it is part of. The syscalls_64.h
header is built using the syscalls classified as either common
or 64
.
The list also references the symbols of the handlers, for instance, for the syscall read
, the handler in the array that we’ll end up calling is going to be sys_read
.
For the do_syscall_x32
function, the process is exactly the same, after substracting __X32_SYSCALL_BIT
. The only difference is that we are using another array, x32_sys_call_table
, which is built from asm/syscalls_x32.h
, which is also dynamically generated using script/syscalltbl.sh
, which parses again syscall_64.tbl
(yes, this is not a typo), and extract syscalls classified as either common
or x32
. They are often prefixed with compat_
.
Back to userland
Once the handler returns, we go back to do_syscall_x64
, we set the ax
register stored on the stack, to the return value of our handler, and return true
.
Next, we go back to do_syscall_64
, and since we just returned true
, we call instrumentation_end
, related to the aformentionned instrumentation_begin
and go to syscall_exit_to_user_mode
.
Finally, once that is done, the code returns to entry_SYSCALL_64
, and there, there are several path to return to userland. This is because there are multiple ways to get to this point in entry_SYSCALL_64
. Let’s see the relevant assembly.
movq RCX(%rsp), %rcx
movq RIP(%rsp), %r11
cmpq %rcx, %r11 /* SYSRET requires RCX == RIP */
jne swapgs_restore_regs_and_return_to_usermode
Remembering that our saved registers are stored on the kernel stack, we restore the values of RCX
and RIP
(into R11
- here RCX
and RIP
are macros that contains the offset of these registers in a struct pt_regs
).
Remembering what we did here, since we arrived in entry_SYSCALL_64
using a SYSCALL
instruction, the RCX
contained the saved value for RIP
, thus here they are equal.
Next, we have :
#ifdef CONFIG_X86_5LEVEL
ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
"shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
#else
shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
#endif
This is a sanity check to ensure that the provided return value in RCX
is canonical, which means it is a proper userland adress. To do so, the code clears out either the 16
or 7
top bits of the RCX
, and compare it with the previous value of the register, if they are different, the adress is not canonical, otherwise it is.
(__VIRTUAL_MASK_SHIFT
is a macro whose value depends on whether the kernel uses 4 or 5 page tables)
cmpq %rcx, %r11
jne swapgs_restore_regs_and_return_to_usermode
Why two possible values ? This depends on whether the kernel uses a 4 or 5 level page table, more information on the virtual memory layout can be found in Documentation/x86/x86_64/mm.rst
Similarily, we have 4 more sanity checks :
cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */
jne swapgs_restore_regs_and_return_to_usermode
movq R11(%rsp), %r11
cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */
jne swapgs_restore_regs_and_return_to_usermode
testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
jnz swapgs_restore_regs_and_return_to_usermode
/* nothing to check for RSP */
cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */
jne swapgs_restore_regs_and_return_to_usermode
Just like we ensured RCX
had a proper value, we check that the registers (or flags) that the kernel must preserve are correct, if they are not, we go to swapgs_restore_regs_and_return_to_usermode
. The reason why the sanity checks are performed is to avoid going to this function, which is much slower that using SYSRET
like we are going to.
Assuming we are in a normal situation, all those checks should succeed.
If so, we proceed to restore all remaining registers that are saved on the stack
syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
POP_REGS pop_rdi=0 skip_r11rcx=1
/*
* Now all regs are restored except RSP and RDI.
* Save old stack pointer and switch to trampoline stack.
*/
movq %rsp, %rdi
movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
UNWIND_HINT_EMPTY
pushq RSP-RDI(%rdi) /* RSP */
pushq (%rdi) /* RDI */
First, we restore all remaining registers, using the POP_REGS
macro, which undoes what PUSH_AND_CLEAR_REGS
did when entering, except for RDI
and the registers we have already restored.
Next, we save RSP
into RDI
; at the time of doing so, it points to the saved RDI
register.
Afterwards, we save the values of the RDI
and RSP
to the TSS. For this, we load RSP
using the per-CPU mechanism we saw earlier. This information is useful for the next time that the current thread switches to kernel mode.
Remember that RDI
points to the saved RDI
, so we find the offset of RSP
by calculating RSP - RDI
on the stack, and we store them by pushing, to the RSP
that was previously loaded.
At last, we have this code :
STACKLEAK_ERASE_NOCLOBBER
SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
popq %rdi
popq %rsp
swapgs
sysretq
STACKLEAK_ERASE_NOCLOBBER
is a macro that clears out the stack to avoid memory leaks, and we restore the userland page table using SWITCH_TO_USER_CR3_STACK
.
Finally, we restore the userland RDI
and RSP
register, change the gs
base register again with SWAPGS
, and issue a SYSRETQ
which does the opposite of SYSCALL
!
According to the documentation from the manual, the SYSRET
does the following :
SYSRET is a companion instruction to the SYSCALL instruction. It returns from an OS system-call handler to user code at privilege level 3. It does so by loading RIP from RCX and loading RFLAGS from R11.
After that, we are back to our userland program ! This also explains why after a SYSCALL
, we will find that the RCX
register contains the address of the instruction that follows the SYSCALL
(and R11
contains the RFLAGS
)
Conclusion
That was a nice learning experience to see what happens behind the SYSCALL
instruction, which is meant to be a blackbox from the userland perspective, however there are other way to issue “system calls”, notably using the int 0x80
instruction (even in 64 bits), and the path that the kernel takes is rather different, since it is an interrupt ! I might add another article another day !