Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing SYSRETs #13

Open
shlomopongratz opened this issue Aug 23, 2017 · 15 comments
Open

Missing SYSRETs #13

shlomopongratz opened this issue Aug 23, 2017 · 15 comments

Comments

@shlomopongratz
Copy link

Hi,

I was counting the number of SYSCALLs and SYSRETs and I see that that from time to time SYSRET is missing and I assume that the depth of the stack increases.
Is it possible that the kernel module sometimes fails to capture a SYSRET or that it is ignored from some reason?

@Wenzel
Copy link
Member

Wenzel commented Aug 23, 2017

hi @shlomopongartz ,

how did you make the counts ?

Yes, if an exit syscall is missing, the corresponding stack will continue to grow at the next syscall entering in the kernel.
https://github.com/KVM-VMI/nitro/blob/master/nitro/backend.py#L49

Is it possible that the kernel module sometimes fails to capture a SYSRET or that it is ignored from some reason?

I don't see why such thing would happen ?
the kernel is configured to report every possible event, and wait for the userland to confirm that the event has been processed before continuing the execution.

Let's investigate this issue, i want to make sure that the kernel code is behaving as it should be.

@shlomopongratz
Copy link
Author

Just added a counters in the if statement "if event.direction == SyscallDirection.exit:" and in the corresponding "else:" statement. and print the balance.

@shlomopongratz
Copy link
Author

shlomopongratz commented Aug 23, 2017

Hi,
From code reading of arch/x86/kvm/emulate.c::{em_syscall and em_sysret} I learn that in em_syscall event is marked present at the beginning of the routine regardless if the rutine returns before the end where as in em_sysret it is done at the end and the routine may return before the event could be marked present.
So ether we are missing systret or we have extra syscall.
S.P.

@Wenzel
Copy link
Member

Wenzel commented Aug 23, 2017

I monitored the enter and exit direction of the syscalls in backend.py:process_event, and here are my results after some time:

{
    "exit": 132515,
    "enter": 132629
}

So we have more exits than enter, i think you are right @shlomopongartz !

If we take a look at emulate.c:em_sysret:

static int em_sysret(struct x86_emulate_ctxt *ctxt)
{
	const struct x86_emulate_ops *ops = ctxt->ops;
	struct desc_struct cs, ss;
	u64 msr_data, rcx;
	u16 cs_sel, ss_sel;
	u64 efer = 0;
	struct kvm_vcpu *vcpu = container_of(ctxt, struct kvm_vcpu, arch.emulate_ctxt);
	/* syscall is not available in real mode */
	if (ctxt->mode == X86EMUL_MODE_REAL ||
	    ctxt->mode == X86EMUL_MODE_VM86)
		return emulate_ud(ctxt);
	if (!(em_syscall_is_enabled(ctxt)))
		return emulate_ud(ctxt);
	
	if(ctxt->ops->cpl(ctxt) != 0){
		return emulate_gp(ctxt,0);
	}
	
	//check if RCX is in canonical form
	rcx = reg_read(ctxt, VCPU_REGS_RCX);
	if(( (rcx & 0xFFFF800000000000) != 0xFFFF800000000000) &&
	  ( (rcx & 0x00007FFFFFFFFFFF) != rcx)){
		return emulate_gp(ctxt,0);
	}
	ops->get_msr(ctxt, MSR_EFER, &efer);
	setup_syscalls_segments(ctxt, &cs, &ss);
	if (!(efer & EFER_SCE) && !nitro_is_trap_set(vcpu->kvm, NITRO_TRAP_SYSCALL))
		return emulate_ud(ctxt);
	ops->get_msr(ctxt, MSR_STAR, &msr_data);
	msr_data >>= 48;
	
	//setup code segment, atleast what is left to do.  
	//setup_syscalls_segments does most of the work for us
	if (ctxt->mode == X86EMUL_MODE_PROT64){ //if longmode
	  cs_sel = (u16)((msr_data + 0x10) | 0x3);
	  cs.l = 1;
	  cs.d = 0;
	}
	else{
	  cs_sel = (u16)(msr_data | 0x3);
	  cs.l = 0;
	  cs.d = 1;
	}
	cs.dpl = 0x3;
	
	//setup stack segment, atleast what is left to do.  
	//setup_syscalls_segments does most of the work for us
	ss_sel = (u16)((msr_data + 0x8) | 0x3);
	ss.dpl = 0x3;
	
	ops->set_segment(ctxt, cs_sel, &cs, 0, VCPU_SREG_CS);
	ops->set_segment(ctxt, ss_sel, &ss, 0, VCPU_SREG_SS);
	
	ctxt->eflags = (reg_read(ctxt, VCPU_REGS_R11) & 0x3c7fd7) | 0x2;
	ctxt->_eip = reg_read(ctxt, VCPU_REGS_RCX);
	
	if(nitro_is_trap_set(vcpu->kvm, NITRO_TRAP_SYSCALL)){
		vcpu->nitro.event.present = true;
		vcpu->nitro.event.type = SYSCALL;
		vcpu->nitro.event.direction = EXIT;
		kvm_arch_vcpu_ioctl_get_regs(vcpu, &(vcpu->nitro.event.regs));
		kvm_arch_vcpu_ioctl_get_sregs(vcpu, &(vcpu->nitro.event.sregs));
	}
	return X86EMUL_CONTINUE;
}

The emulation has 5 occasions to fail and return before setting the event as present.

compared to em_syscall:

static int em_syscall(struct x86_emulate_ctxt *ctxt)
{
	const struct x86_emulate_ops *ops = ctxt->ops;
	struct desc_struct cs, ss;
	u64 msr_data;
	u16 cs_sel, ss_sel;
	u64 efer = 0;
	struct kvm_vcpu *vcpu = container_of(ctxt, struct kvm_vcpu, arch.emulate_ctxt);
	if(nitro_is_trap_set(vcpu->kvm, NITRO_TRAP_SYSCALL)){
		vcpu->nitro.event.present = true;
		vcpu->nitro.event.type = SYSCALL;
		vcpu->nitro.event.direction = ENTER;
		kvm_arch_vcpu_ioctl_get_regs(vcpu, &(vcpu->nitro.event.regs));
		kvm_arch_vcpu_ioctl_get_sregs(vcpu, &(vcpu->nitro.event.sregs));
	}
	/* syscall is not available in real mode */
	if (ctxt->mode == X86EMUL_MODE_REAL ||
	    ctxt->mode == X86EMUL_MODE_VM86)
		return emulate_ud(ctxt);
	if (!(em_syscall_is_enabled(ctxt)))
		return emulate_ud(ctxt);
	ops->get_msr(ctxt, MSR_EFER, &efer);
	setup_syscalls_segments(ctxt, &cs, &ss);
	if (!(efer & EFER_SCE) && !nitro_is_trap_set(vcpu->kvm, NITRO_TRAP_SYSCALL))
		return emulate_ud(ctxt);
	
	ops->get_msr(ctxt, MSR_STAR, &msr_data);
	msr_data >>= 32;
	cs_sel = (u16)(msr_data & 0xfffc);
	ss_sel = (u16)(msr_data + 8);
	if (efer & EFER_LMA) {
		cs.d = 0;
		cs.l = 1;
	}
	ops->set_segment(ctxt, cs_sel, &cs, 0, VCPU_SREG_CS);
	ops->set_segment(ctxt, ss_sel, &ss, 0, VCPU_SREG_SS);
	*reg_write(ctxt, VCPU_REGS_RCX) = ctxt->_eip;
	if (efer & EFER_LMA) {
#ifdef CONFIG_X86_64
		*reg_write(ctxt, VCPU_REGS_R11) = ctxt->eflags;
		ops->get_msr(ctxt,
			     ctxt->mode == X86EMUL_MODE_PROT64 ?
			     MSR_LSTAR : MSR_CSTAR, &msr_data);
		ctxt->_eip = msr_data;
		ops->get_msr(ctxt, MSR_SYSCALL_MASK, &msr_data);
		ctxt->eflags &= ~msr_data;
		ctxt->eflags |= X86_EFLAGS_FIXED;
#endif
	} else {
		/* legacy mode */
		ops->get_msr(ctxt, MSR_STAR, &msr_data);
		ctxt->_eip = (u32)msr_data;
		ctxt->eflags &= ~(X86_EFLAGS_VM | X86_EFLAGS_IF);
	}
	
	return X86EMUL_CONTINUE;
}

I think the fix here is that we should move the code which configures the event from the beginning of the function to the end of em_syscall (and em_sysenter too).

I will try this solution and post the results !

@Wenzel
Copy link
Member

Wenzel commented Aug 23, 2017

Actually it's the reverse, we should move the code that configures the event in em_sysret and em_sysexit at the beginning of the function, so that we are not missing any syscall.

@shlomopongratz
Copy link
Author

But if you put the code at the beginning of the sysret than this is before set_segment is called.
Maybe you should just mark the present at the end of the syscall (but take the registers at the beginning, before they are changed).
So we will record only syscall with successful emulation.

@Wenzel
Copy link
Member

Wenzel commented Aug 23, 2017

So, if we want to monitor all syscall attempt (good or bad), we have to place our event at the beginning of the function, and if we want only the successful ones, place it at the end like you said.

I implemented your solution on a new branch.
here is the commit: KVM-VMI/kvm@d5fe8f5

The results are better, but we are still missing some exits !

{
    "exit": 64267,
    "enter": 64314
}

@shlomopongratz
Copy link
Author

So it seems that for some successful syscalls there are failed sysrets.
I just wonder if we can trust the values stored in the registers if we record at the beginning of sysret.
Maybe we should record all events and add a completion flag that wll be set only at the end so we will know if we can trust the values stored in the registers.

@shlomopongratz
Copy link
Author

The trust should also go for the syscall.

@Wenzel
Copy link
Member

Wenzel commented Aug 23, 2017

If i place the event at the beginning :

{
    "enter": 64289,
    "exit": 64198
}

Almost the same results

@shlomopongratz
Copy link
Author

This is odd, I wonder if the number of times em_syscall and em_sysret are the same.
The OS must return to user before issuing a new system call so maybe sometimes the emulation is not called. So maybe we are bailing out in x86_decode_insn or before that.

@shlomopongratz
Copy link
Author

shlomopongratz commented Aug 24, 2017

Hi,
I added printk to em_syscall and em_sysret and found out the the number of times em_syscall is called is greater then the number of times em_sysret is called.
This imbalance can cause two issues, one is syscall stack overflow and the other is wrong matching between calls and returns, assume one wants to process OUT syscall parameters only on return and he/she may procces the wrong parameters.

I think that maybe some of the syscalls are not really syscalls, I used the value in the GS register which holds the pointer to the Thread Information Block in 64 bit (FS in 32 bit guests) and tried to read it with vmi_read_addr_va and sometimes it failed! which I think may imply that it wasn't a real syscall.

S.P.

@shlomopongratz
Copy link
Author

Hi,

Just found out that all missing SYSRET belong to calls to NtContinue!
That is I see SYSCALL with syscall number that matches NtContinue without a return.
Can someone explain this to me.

S.P.

@Wenzel
Copy link
Member

Wenzel commented Oct 9, 2017

Hi,

I think i found why.

After a bit of googling, i saw this comment:
http://www.abfl.org.in/Aug-08/interrupts-return-quarry-iretq/

However, NtContinue uses iretq to return to usermode, not sysret.

We have to change our logic, and not systematically associate a syscall with a sysret.

@shlomopongratz
Copy link
Author

Hi,
Funny but the link took me to the web page of "GME mining equipment production base,"
I think there is another issue regarding the association of syscall and sysret. Currently Nitio uses stack to find the syscall number for the sysret. This doesn't take scheduling into account, assume one thread calls some blocking API e.g. NtWaitForSingleObject than the scheduler may let another thread to run which also calls a blocking API then the first thread resume execution and the first syscall returns.
What we need is to map from the combination of the process and the thread to the system call number MAP{CR3x{fs|gs}} -> syscall.

Best regards

S.P.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants