OSDev.org https://forum.osdev.org/ |
|
The cost of a system call https://forum.osdev.org/viewtopic.php?f=1&t=25267 |
Page 1 of 3 |
Author: | gerryg400 [ Fri May 04, 2012 7:42 pm ] |
Post subject: | The cost of a system call |
I've been doing some testing this morning and thought someone might be interested in the results. My machine is a Core2 Quad at 2.83 GHz. I have a 'null' system call that I used to do the tests in long mode. It does the following. i) begins in the C lib passes 3 parameters to the kernel ii) enters the kernel via syscall or int $0x20 iii) kernel saves all 16 GP regs iv) uses swapgs to retrieve kernel stack, core id etc. v) switches to kernel stack vi) moves into kernel C code through my kernel system call mechanism and increments a counter vii) reverses the above and returns to user mode. The scheduler is not called. Average times are as follows Code: int $0x20/iretq - 587 ns syscall/sysret - 449 ns int $0x20/sysret - 506 ns syscall/iretq - 530 ns What this means is that per call - Code: syscall instead of int $0x20 saves 57 ns sysret instead of iretq saves 81 ns total saving 138 ns The other thing that this shows is that you can mix and match between the 2 mechanisms. You can use sysret to return from hardware interrupts to ring3. I also tested using syscall from ring 1 and that works fine as long as you use iretq to return. |
Author: | bluemoon [ Sat May 05, 2012 4:38 am ] |
Post subject: | Re: The cost of a system call |
Here is my result, test with QEMU. The OS has only launched single process (kthreads are idle priority so will not be switched, but PIC timer will still fire which should affect very little) The kernel and application is compiled with -O2 Note that my syscall will only preserve registers according to AMD ABI, but not 16 like yours. Code: PID[1]: Hello from user application Call Start : 4378D42C End : 4381FCF7 Elapsed : 928CB (600267 cycles) Average : 3C (60 cycles) Syscall Start : 4381FDF4 End : 43CFDD63 Elapsed : 4DDF6F (5103471 cycles) Average : 1FE (510 cycles) Related code: Code: // test call __asm volatile ("rdtsc; rdtsc\n" : "=a"(cstart_lo), "=d"(cstart_hi)); for ( int i=0; i<10000; i++ ) { call_null(); } __asm volatile ("rdtsc\n" : "=a"(cend_lo), "=d"(cend_hi)); // test sysacall __asm volatile ("rdtsc; rdtsc\n" : "=a"(start_lo), "=d"(start_hi)); for ( int i=0; i<10000; i++ ) { syscall_null(); } __asm volatile ("rdtsc\n" : "=a"(end_lo), "=d"(end_hi)); userland interface: Code: call_null: ret syscall_null: xor eax, eax syscall ret syscall in kernel: Code: ; Max 5 parameters: rdi rsi rdx r9 r8 _syscall_stub: cmp rsp, APPADDR_PROCESS_STACK jae .fault push rcx ; ring3 rip push r11 ; rflags mov r11, qword syscall_table mov rcx, r9 ; 4th parameter cmp eax, 12 jbe .1 mov edi, eax xor eax, eax .1: call qword [r11+rax*8] pop r11 pop rcx db 0x48 ; REX.w sysret null call is a C function with just return 0; |
Author: | rdos [ Sat May 05, 2012 5:29 am ] |
Post subject: | Re: The cost of a system call |
gerryg400 wrote: i) begins in the C lib passes 3 parameters to the kernel ii) enters the kernel via syscall or int $0x20 iii) kernel saves all 16 GP regs iv) uses swapgs to retrieve kernel stack, core id etc. v) switches to kernel stack vi) moves into kernel C code through my kernel system call mechanism and increments a counter vii) reverses the above and returns to user mode. The scheduler is not called. Average times are as follows Code: int $0x20/iretq - 587 ns syscall/sysret - 449 ns int $0x20/sysret - 506 ns syscall/iretq - 530 ns This is a lot more than what I measured on this processor in 32-bit mode with call gates in RDOS. Results from the other thread: near: 51.6 million calls per second gate: 13.4 million calls per second sysenter: 10.5 million calls per second The call gate version takes about 75ns. I haven't measured the sysenter/sysexit version yet (edit: just below 100ns, and thus slower). And this is the full overhead as there is nothing else involved in calling kernel functions (other than loading the appropriate registers for the call in case the function has parameters). At the kernel side, no registers are saved unless they are used. |
Author: | rdos [ Sat May 05, 2012 5:35 am ] |
Post subject: | Re: The cost of a system call |
bluemoon wrote: Here is my result, test with QEMU. Its not reliable to use QEMU for these kind of performance tests. You must do them on real hardware. Additionally, you should not use idealised code, but you should rather compile it and validate syscalls like you would do in a production release of your OS/application. |
Author: | bluemoon [ Sat May 05, 2012 5:40 am ] |
Post subject: | Re: The cost of a system call |
I just started porting my OS to 64bit last week and it just finally worked yesterday, so what I can do is run it in QEMU for now. sure someday I'll try it on real hardware. The numbers are for references only, don't take it too serious. |
Author: | bluemoon [ Sat May 05, 2012 11:06 am ] |
Post subject: | Re: The cost of a system call |
oops that was mistake, should be: Code: __asm volatile ("xor eax, eax; cpuid; rdtsc" : "=a"(cstart_lo), "=d"(cstart_hi) :: "ebx", "ecx"); The idea is to make sure rdtsc is not executed out of order. Then, the result become: Code: PID[1]: Hello from user application
Call Start : 45409DEE End : 48B61737 Elapsed : 3757949 (58030409 cycles) Average : 3A (58 cycles) Syscall Start : 48B618EA End : 61EB4055 Elapsed : 1935276B (422913899 cycles) Average : 1A6 (422 cycles) PID[1]: Hello again! counter=0 PID[1]: Hello again! counter=1 PID[1]: Hello again! counter=2 PID[1]: Hello again! counter=3 KMAIN : Clean zombie process: FFFFFFFF:8012A500 |
Author: | turdus [ Sat May 05, 2012 11:43 am ] |
Post subject: | Re: The cost of a system call |
gerryg400 wrote: iii) kernel saves all 16 GP regs Good testing, but I think your results were influenced by this. One of the advantage of using syscall is no need for saving all the registers. You only have to save rcx and r11. Gives considerable performance boost. Here's how I do it. KMEM_userspace points to current TCB, which happens to be TSS as well. This is the prologue: Code: cli if INTSYSCALL clsavectx sub qword [MEM_userspace+24h], KERNELSTACKSIZE else mov qword [MEM_userspace+tcb.userrip], rcx pushf pop qword [MEM_userspace+tcb.userflags] //restore previous r11 from local variables stack (pushed on caller side) mov r11, qword [r15] end if //bound check cmp qword [MEM_userspace+24h], MEM_userspace+tcb.acl_end jb @f sti @@: And this is the epilogue: Code: if INTSYSCALL clloadctx iretq else mov r11, qword [MEM_userspace+tcb.userflags] mov rcx, qword [MEM_userspace+tcb.userrip] //force interrupt enable bts r11, 9 sysretq end if Maybe yield is interesting too: Code: if INTSYSCALL //no clloadctx, we want registers changed add rsp, 16*8 add qword [MEM_userspace+24h], KERNELSTACKSIZE clsavectx //switch page tables and refresh cr3 call sys.arch.thread.thswitch clloadctx else int SCHEDTMR_INT end if Hope it was useful. |
Author: | gerryg400 [ Sat May 05, 2012 2:46 pm ] |
Post subject: | Re: The cost of a system call |
I understand the comments but the purpose of the test was to compare "syscall/sysret" to "int/iretq". Most documentation tells us that the former pair is "4 times quicker" (or something similar) to the latter. I've always felt that this is a useless way of comparing the instructions unless you know how much the rest of the system call costs. As turdus points out there is another saving with sys call/sysret (i.e. not really needing to save the GP regs on a syscall). But surely this is true for the int/iretq situation as well ? |
Author: | Brendan [ Sun May 06, 2012 9:16 am ] |
Post subject: | Re: The cost of a system call |
Hi, gerryg400 wrote: I understand the comments but the purpose of the test was to compare "syscall/sysret" to "int/iretq". Most documentation tells us that the former pair is "4 times quicker" (or something similar) to the latter. I've always felt that this is a useless way of comparing the instructions unless you know how much the rest of the system call costs. I've always thought that, because syscall/sysret doesn't do some things that are likely to be necessary (like switching ESP to a kernel stack), it isn't directly comparable to software interrupts or call gates (or SYSENTER) because a kernel typically needs to add more instructions to a syscall handler that wouldn't have been necessary for software interrupts or call gates. For an extreme example; because ESP/RSP isn't switched and the CPU doesn't push anything on the stack while at CPL=3, user space code could do "mov rsp, SOMEWHERE_IN_KERNEL_SPACE" and then "SYSCALL" and trick the kernel into trashing itself or modifying kernel data. To guard against that, the kernel has to save RSP somewhere and load RSP with a "known good" value before anything is pushed on the stack (either by the SYSCALL handler itself or by the CPU if an NMI or machine check exception occurs). Note: To be honest, I'm not even sure if it's possible to use SYSCALL in a "guaranteed 100.0000% safe" way (as you can't prevent NMI or machine check before the SYSCALL handler switches to a safe stack, and task switching and IST fails for nesting). For worst case, you'd need to deal with malicious user space code that does something like this: Code: mov eax,0 mov ds,eax mov es,eax mov fs,eax mov gs,eax mov esp,SOMEWHERE_IN_KERNEL_SPACE syscall gerryg400 wrote: As turdus points out there is another saving with sys call/sysret (i.e. not really needing to save the GP regs on a syscall). But surely this is true for the int/iretq situation as well ? Yes. For a fair comparison that isn't overly effected by OS design, you'd want to compare:
I'd also suggest that the caller's code size also be taken into account. SYSCALL and "INT n" both cost 2 bytes. For SYSENTER, the caller needs to store "return EIP" and "return ESP" somewhere (likely EDX and ECX), so even though SYSENTER is only 2 bytes itself it's probably going to cost 6 or more bytes. For 32-bit code, call gates are going to cost a minimum of 6 bytes (using a 16-bit address size override prefix to avoid the full 32-bit offset that's ignored anyway). I'd expect that SYSENTER would end up being the winner for performance (for frequently executed pieces of code), and SYSCALL and software interrupts would tie for code size (for infrequently used code). Cheers, Brendan |
Author: | bluemoon [ Sun May 06, 2012 10:00 am ] |
Post subject: | Re: The cost of a system call |
Brendan wrote: For worst case, you'd need to deal with malicious user space code that does something like this: Code: mov eax,0 mov ds,eax mov es,eax mov fs,eax mov gs,eax mov esp,SOMEWHERE_IN_KERNEL_SPACE syscall If I understand correctly, in long mode (hence required by syscall instruction) ds, es are practically ignored. I do the above in my code and it affect nothing. I still need to check cmp rsp, APPADDR_PROCESS_STACK, where APPADDR_PROCESS_STACK is the application legal address range, and have enough room, and return fail for the syscall or abort the process. syscall handler can reuse the application's user stack just fine, while make sure for not leaving sensitive data there - at that time you may still switch stack. |
Author: | Brendan [ Sun May 06, 2012 11:25 am ] |
Post subject: | Re: The cost of a system call |
Hi, bluemoon wrote: Brendan wrote: For worst case, you'd need to deal with malicious user space code that does something like this: Code: mov eax,0 mov ds,eax mov es,eax mov fs,eax mov gs,eax mov esp,SOMEWHERE_IN_KERNEL_SPACE syscall If I understand correctly, in long mode (hence required by syscall instruction) ds, es are practically ignored. I do the above in my code and it affect nothing. The example was for a 32-bit protected mode kernel (otherwise it would've been "mov rsp,SOMEWHERE_IN_KERNEL_SPACE" ). bluemoon wrote: I still need to check cmp rsp, APPADDR_PROCESS_STACK, where APPADDR_PROCESS_STACK is the application legal address range, and have enough room, and return fail for the syscall or abort the process. syscall handler can reuse the application's user stack just fine, while make sure for not leaving sensitive data there - at that time you may still switch stack. If the syscall handler reuses the application's user stack, be very careful with your page fault handler. If the syscall handler's RSP (inherited from user space) ends up pointing to a "not present" page (either because that's where the caller left it, or because the kernel pushed enough on the stack to cross from a present page into a not present page), then the CPU won't try to switch to a different stack when trying to start the page fault exception handler (no privilege level transition) and will generate a double fault. To avoid that you'd probably need to use IST for the page fault handler (and ensure that page faults never nest), or use IST for the double fault handler. Also, "cmp rsp, APPADDR_PROCESS_STACK" isn't enough. Consider: Code: mov rsp,0x00000008 syscall Cheers, Brendan |
Author: | bluemoon [ Sun May 06, 2012 11:31 am ] |
Post subject: | Re: The cost of a system call |
Brendan wrote: The example was for a 32-bit protected mode kernel (otherwise it would've been "mov rsp,SOMEWHERE_IN_KERNEL_SPACE" ).[/code] according to intel manual, syscall in 32-bit or compatibility mode trigger #UD. Brendan wrote: Also, "cmp rsp, APPADDR_PROCESS_STACK" isn't enough. Consider: Code: mov rsp,0x00000008 syscall That's why I said "is the application legal address range, and have enough room". edit: I did an experiment with this: Code: syscall_null: xor eax, eax mov ds, ax mov es, ax mov fs, ax mov gs, ax mov rbx, rsp mov rsp,0x00000008 syscall mov rsp, rbx ret And this is catched by #PF within syscall handler, which I have a chance to terminate this abnormal process. Code: INT0E : #PF Page Fault Exception. RIP:FFFFFFFF:80104AE9 CODE:2 ADDR:00000000:00000000 : PML4[0] PDPT[0] PD[0] PT[0] #PF : Access to unallocated memory. CODE: 2 : ADDR: 00000000:00000000 PTE[0]: 00000000:00000000 By the way, you are correct on the #PF issue which I overlooked. |
Author: | rdos [ Sun May 06, 2012 11:59 am ] |
Post subject: | Re: The cost of a system call |
Brendan wrote: Note: To be honest, I'm not even sure if it's possible to use SYSCALL in a "guaranteed 100.0000% safe" way (as you can't prevent NMI or machine check before the SYSCALL handler switches to a safe stack, and task switching and IST fails for nesting). I don't remember the parameters for SYSCALL, but at least for SYSENTER it is possible to make 100% certain that an application cannot modify kernel data or malfunctions because of an invalid kernel stack. I do it like this: 1. Kernel ESP MSR is loaded with the current thread stack offset from TSS (by taking base + size of SS0) whenever a new thread is scheduled. This takes care of the nesting issue as ESP is not loaded manually in kernel. 2. When using the stack reference from the application, address it with the ds or es register, and let ds and es for applications only cover application address space. This will make the sysenter entry-point code protection fault if a stack reference to kernel space is provided. In long mode, this doesn't work (limits are not used), and so the pointer needs to be checked with software. |
Author: | Brynet-Inc [ Sun May 06, 2012 2:01 pm ] |
Post subject: | Re: The cost of a system call |
bluemoon wrote: according to intel manual, syscall in 32-bit or compatibility mode trigger #UD. SYSCALL/SYSRET are from AMD, which does support them in 32-bit mode. Intel only supports them in 64-bit mode. |
Author: | Cognition [ Sun May 06, 2012 2:25 pm ] |
Post subject: | Re: The cost of a system call |
Generally if you were to use SYSCALL in long mode you simply swapgs and load in a known good pointer. Code: user_enter_syscall64: swapgs mov rax, [gs:KSTACK_OFFSET] mov [gs:USTACK_OFFSET], rsp mov rsp, rax ... mov rsp, [gs:USTACK_OFFSET] swapgs sysret This is making the assumption you could at least clobber RAX initially as you'll probably return some value in it later. You could also do similar things for protected mode. Code: user_enter_syscall32: mov ax, PROC_SPECIFIC_DATA_SEG mov gs, ax mov eax, [gs:KSTACK_OFFSET] mov [ss:eax+4], esp mov esp, eax ... pop gs pop esp sysret Here the user space GS value is assumed to be determinable from some other structure (thread info for example), which should work out since it's usually used for thread specific data anyways. To Brendan's point about NMI's it'd be a mess if you aren't using a task gate/IST for them. AFAIK Linux deals with NMI nesting by using task gates or ISTs and doing some extensive checking in software to determine if an NMI nested within the NMI handler code itself. |
Page 1 of 3 | All times are UTC - 6 hours |
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group http://www.phpbb.com/ |