Commit Graph

20924 Commits

Author SHA1 Message Date
Nadav Amit
3dc4bc4f6b KVM: x86: JMP/CALL using call- or task-gate causes exception
The KVM emulator does not emulate JMP and CALL that target a call gate or a
task gate.  This patch does not try to implement these scenario as they are
presumably rare; yet it returns X86EMUL_UNHANDLEABLE error in such cases
instead of generating an exception.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:48:05 +01:00
Nadav Amit
16bebefe29 KVM: x86: fnstcw and fnstsw may cause spurious exception
Since the operand size of fnstcw and fnstsw is updated during the execution,
the emulation may cause spurious exceptions as it reads the memory beforehand.

Marking these instructions as Mov (since the previous value is ignored) and
DstMem16 to simplify the setting of operand size.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:48:05 +01:00
Nadav Amit
3313bc4ee8 KVM: x86: pop sreg accesses only 2 bytes
Although pop sreg updates RSP according to the operand size, only 2 bytes are
read.  The current behavior may result in incorrect #GP or #PF exceptions.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:48:04 +01:00
Paolo Bonzini
fa4a2c080e KVM: x86: mmu: replace assertions with MMU_WARN_ON, a conditional WARN_ON
This makes the direction of the conditions consistent with code that
is already using WARN_ON.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:48:04 +01:00
Paolo Bonzini
4c1a50de92 KVM: x86: mmu: remove ASSERT(vcpu)
Because ASSERT is just a printk, these would oops right away.
The assertion thus hardly adds anything.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:48:03 +01:00
Paolo Bonzini
ad896af0b5 KVM: x86: mmu: remove argument to kvm_init_shadow_mmu and kvm_init_shadow_ept_mmu
The initialization function in mmu.c can always use walk_mmu, which
is known to be vcpu->arch.mmu.  Only init_kvm_nested_mmu is used to
initialize vcpu->arch.nested_mmu.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:48:02 +01:00
Paolo Bonzini
e0c6db3e22 KVM: x86: mmu: do not use return to tail-call functions that return void
This is, pedantically, not valid C.  It also looks weird.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:48:02 +01:00
Marcelo Tosatti
6c19b7538f KVM: x86: add tracepoint to wait_lapic_expire
Add tracepoint to wait_lapic_expire.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
[Remind reader if early or late. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:48:01 +01:00
Marcelo Tosatti
d0659d946b KVM: x86: add option to advance tscdeadline hrtimer expiration
For the hrtimer which emulates the tscdeadline timer in the guest,
add an option to advance expiration, and busy spin on VM-entry waiting
for the actual expiration time to elapse.

This allows achieving low latencies in cyclictest (or any scenario
which requires strict timing regarding timer expiration).

Reduces average cyclictest latency from 12us to 8us
on Core i5 desktop.

Note: this option requires tuning to find the appropriate value
for a particular hardware/guest combination. One method is to measure the
average delay between apic_timer_fn and VM-entry.
Another method is to start with 1000ns, and increase the value
in say 500ns increments until avg cyclictest numbers stop decreasing.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:47:30 +01:00
Marcelo Tosatti
7c6a98dfa1 KVM: x86: add method to test PIR bitmap vector
kvm_x86_ops->test_posted_interrupt() returns true/false depending
whether 'vector' is set.

Next patch makes use of this interface.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:45:18 +01:00
Tiejun Chen
b4eef9b36d kvm: x86: vmx: NULL out hwapic_isr_update() in case of !enable_apicv
In most cases calling hwapic_isr_update(), we always check if
kvm_apic_vid_enabled() == 1, but actually,
kvm_apic_vid_enabled()
    -> kvm_x86_ops->vm_has_apicv()
        -> vmx_vm_has_apicv() or '0' in svm case
            -> return enable_apicv && irqchip_in_kernel(kvm)

So its a little cost to recall vmx_vm_has_apicv() inside
hwapic_isr_update(), here just NULL out hwapic_isr_update() in
case of !enable_apicv inside hardware_setup() then make all
related stuffs follow this. Note we don't check this under that
condition of irqchip_in_kernel() since we should make sure
definitely any caller don't work  without in-kernel irqchip.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:45:17 +01:00
Nicholas Krause
5ff22e7ebf KVM: x86: Remove FIXMEs in emulate.c for the function,task_switch_32
Remove FIXME comments about needing fault addresses to be returned.  These
are propaagated from walk_addr_generic to gva_to_gpa and from there to
ops->read_std and ops->write_std.

Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:45:17 +01:00
Eugene Korenevsky
19d5f10b3a KVM: nVMX: consult PFEC_MASK and PFEC_MATCH when generating #PF VM-exit
When generating #PF VM-exit, check equality:
(PFEC & PFEC_MASK) == PFEC_MATCH
If there is equality, the 14 bit of exception bitmap is used to take decision
about generating #PF VM-exit. If there is inequality, inverted 14 bit is used.

Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:45:16 +01:00
Eugene Korenevsky
e9ac033e6b KVM: nVMX: Improve nested msr switch checking
This patch improve checks required by Intel Software Developer Manual.
 - SMM MSRs are not allowed.
 - microcode MSRs are not allowed.
 - check x2apic MSRs only when LAPIC is in x2apic mode.
 - MSR switch areas must be aligned to 16 bytes.
 - address of first and last byte in MSR switch areas should not set any bits
   beyond the processor's physical-address width.

Also it adds warning messages on failures during MSR switch. These messages
are useful for people who debug their VMMs in nVMX.

Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:45:15 +01:00
Wincy Van
ff651cb613 KVM: nVMX: Add nested msr load/restore algorithm
Several hypervisors need MSR auto load/restore feature.
We read MSRs from VM-entry MSR load area which specified by L1,
and load them via kvm_set_msr in the nested entry.
When nested exit occurs, we get MSRs via kvm_get_msr, writing
them to L1`s MSR store area. After this, we read MSRs from VM-exit
MSR load area, and load them via kvm_set_msr.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:45:14 +01:00
Linus Torvalds
716c13a817 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
 "This fixes a build problem with sha-mb with old toolchains and an
  implementation bug in the ctr(aes)/by8 branch of aesni-intel that's
  enabled when AVX is available"

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: sha-mb - Add avx2_supported check.
  crypto: aesni - fix "by8" variant for 128 bit keys
2015-01-08 11:33:51 -08:00
Vitaly Kuznetsov
7be0772d19 x86/xen: avoid freeing static 'name' when kasprintf() fails
In case kasprintf() fails in xen_setup_timer() we assign name to the
static string "<timer kasprintf failed>". We, however, don't check
that fact before issuing kfree() in xen_teardown_timer(), kernel is
supposed to crash with 'kernel BUG at mm/slub.c:3341!'

Solve the issue by making name a fixed length string inside struct
xen_clock_event_device. 16 bytes should be enough.

Suggested-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-01-08 13:55:25 +00:00
David Vrabel
a97dae1a2e x86/xen: add extra memory for remapped frames during setup
If the non-RAM regions in the e820 memory map are larger than the size
of the initial balloon, a BUG was triggered as the frames are remaped
beyond the limit of the linear p2m.  The frames are remapped into the
initial balloon area (xen_extra_mem) but not enough of this is
available.

Ensure enough extra memory regions are added for these remapped
frames.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
2015-01-08 13:52:37 +00:00
David Vrabel
bc7142cf79 x86/xen: don't count how many PFNs are identity mapped
This accounting is just used to print a diagnostic message that isn't
very useful.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
2015-01-08 13:52:36 +00:00
Boris Ostrovsky
701a261ad6 x86/xen: Free bootmem in free_p2m_page() during early boot
With recent changes in p2m we now have legitimate cases when
p2m memory needs to be freed during early boot (i.e. before
slab is initialized).

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-01-08 13:52:36 +00:00
Andy Shevchenko
d58cf5ff65 spi: dw-pci: describe Intel MID controllers better
There are more that one SPI controller on the Intel MID boards. This patch
describes the status and IDs of them. From now on we also have to care about
bus number that must be unique per host.

According to the specification the SPI1 has 5 bits for chip selects and SPI2
only 2 bits. The patch makes it depend to PCI ID.

The first controller (SPI1) is DMA capable, meanwhile SPI2 can share same
channels (via software switch) such functionality is not in the scope of this
patch. Thus, attempt to init DMA for SPI2 will always fail for now.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
2015-01-07 18:01:06 +00:00
Luck, Tony
d4812e169d x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks
We now switch to the kernel stack when a machine check interrupts
during user mode.  This means that we can perform recovery actions
in the tail of do_machine_check()

Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2015-01-07 07:47:42 -08:00
Rafael J. Wysocki
794c3a0a93 Merge branches 'acpi-pm', 'acpi-processor' and 'acpi-video'
* acpi-pm:
  ACPI / PM: Fix PM initialization for devices that are not present

* acpi-processor:
  ACPI / processor: Rename acpi_(un)map_lsapic() to acpi_(un)map_cpu()
  ACPI / processor: Convert apic_id to phys_id to make it arch agnostic

* acpi-video:
  ACPI / video: Add disable_native_backlight quirk for Dell XPS15 L521X
2015-01-06 23:35:43 +01:00
Pranith Kumar
83fe27ea53 rcu: Make SRCU optional by using CONFIG_SRCU
SRCU is not necessary to be compiled by default in all cases. For tinification
efforts not compiling SRCU unless necessary is desirable.

The current patch tries to make compiling SRCU optional by introducing a new
Kconfig option CONFIG_SRCU which is selected when any of the components making
use of SRCU are selected.

If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.

   text    data     bss     dec     hex filename
   2007       0       0    2007     7d7 kernel/rcu/srcu.o

Size of arch/powerpc/boot/zImage changes from

   text    data     bss     dec     hex filename
 831552   64180   23944  919676   e087c arch/powerpc/boot/zImage : before
 829504   64180   23952  917636   e0084 arch/powerpc/boot/zImage : after

so the savings are about ~2000 bytes.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
2015-01-06 11:04:29 -08:00
Hanjun Guo
d02dc27db0 ACPI / processor: Rename acpi_(un)map_lsapic() to acpi_(un)map_cpu()
acpi_map_lsapic() will allocate a logical CPU number and map it to
physical CPU id (such as APIC id) for the hot-added CPU, it will also
do some mapping for NUMA node id and etc, acpi_unmap_lsapic() will
do the reverse.

We can see that the name of the function is a little bit confusing and
arch (IA64) dependent so rename them as acpi_(un)map_cpu() to make arch
agnostic and explicit.

Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-01-05 23:34:26 +01:00
Vinson Lee
0b8c960cf6 crypto: sha-mb - Add avx2_supported check.
This patch fixes this allyesconfig target build error with older
binutils.

  LD      arch/x86/crypto/built-in.o
ld: arch/x86/crypto/sha-mb/built-in.o: No such file: No such file or directory

Cc: stable@vger.kernel.org # 3.18+
Signed-off-by: Vinson Lee <vlee@twitter.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-01-05 21:35:02 +11:00
Mathias Krause
0b1e95b2fa crypto: aesni - fix "by8" variant for 128 bit keys
The "by8" counter mode optimization is broken for 128 bit keys with
input data longer than 128 bytes. It uses the wrong key material for
en- and decryption.

The key registers xkey0, xkey4, xkey8 and xkey12 need to be preserved
in case we're handling more than 128 bytes of input data -- they won't
get reloaded after the initial load. They must therefore be (a) loaded
on the first iteration and (b) be preserved for the latter ones. The
implementation for 128 bit keys does not comply with (a) nor (b).

Fix this by bringing the implementation back to its original source
and correctly load the key registers and preserve their values by
*not* re-using the registers for other purposes.

Kudos to James for reporting the issue and providing a test case
showing the discrepancies.

Reported-by: James Yonan <james@openvpn.net>
Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Cc: <stable@vger.kernel.org> # v3.18
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-01-05 21:35:02 +11:00
Daniel Borkmann
b485342bd7 x86, um: actually mark system call tables readonly
Commit a074335a37 ("x86, um: Mark system call tables readonly") was
supposed to mark the sys_call_table in UML as RO by adding the const,
but it doesn't have the desired effect as it's nevertheless being placed
into the data section since __cacheline_aligned enforces sys_call_table
being placed into .data..cacheline_aligned instead. We need to use
the ____cacheline_aligned version instead to fix this issue.

Before:

$ nm -v arch/x86/um/sys_call_table_64.o | grep -1 "sys_call_table"
                 U sys_writev
0000000000000000 D sys_call_table
0000000000000000 D syscall_table_size

After:

$ nm -v arch/x86/um/sys_call_table_64.o | grep -1 "sys_call_table"
                 U sys_writev
0000000000000000 R sys_call_table
0000000000000000 D syscall_table_size

Fixes: a074335a37 ("x86, um: Mark system call tables readonly")
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2015-01-04 14:21:25 +01:00
Andy Lutomirski
bced35b65a x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic
In some IST handlers, if the interrupt came from user mode,
we can safely enable preemption.  Add helpers to do it safely.

This is intended to be used my the memory failure code in
do_machine_check.

Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2015-01-02 10:22:46 -08:00
Andy Lutomirski
83653c16da x86: Clean up current_stack_pointer
There's no good reason for it to be a macro, and x86_64 will want to
use it, so it should be in a header.

Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2015-01-02 10:22:46 -08:00
Andy Lutomirski
9592747538 x86, traps: Track entry into and exit from IST context
We currently pretend that IST context is like standard exception
context, but this is incorrect.  IST entries from userspace are like
standard exceptions except that they use per-cpu stacks, so they are
atomic.  IST entries from kernel space are like NMIs from RCU's
perspective -- they are not quiescent states even if they
interrupted the kernel during a quiescent state.

Add and use ist_enter and ist_exit to track IST context.  Even
though x86_32 has no IST stacks, we track these interrupts the same
way.

This fixes two issues:

 - Scheduling from an IST interrupt handler will now warn.  It would
   previously appear to work as long as we got lucky and nothing
   overwrote the stack frame.  (I don't know of any bugs in this
   that would trigger the warning, but it's good to be on the safe
   side.)

 - RCU handling in IST context was dangerous.  As far as I know,
   only machine checks were likely to trigger this, but it's good to
   be on the safe side.

Note that the machine check handlers appears to have been missing
any context tracking at all before this patch.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2015-01-02 10:22:46 -08:00
Andy Lutomirski
48e08d0fb2 x86, entry: Switch stacks on a paranoid entry from userspace
This causes all non-NMI, non-double-fault kernel entries from
userspace to run on the normal kernel stack.  Double-fault is
exempt to minimize confusion if we double-fault directly from
userspace due to a bad kernel stack.

This is, suprisingly, simpler and shorter than the current code.  It
removes the IMO rather frightening paranoid_userspace path, and it
make sync_regs much simpler.

There is no risk of stack overflow due to this change -- the kernel
stack that we switch to is empty.

This will also enable us to create non-atomic sections within
machine checks from userspace, which will simplify memory failure
handling.  It will also allow the upcoming fsgsbase code to be
simplified, because it doesn't need to worry about usergs when
scheduling in paranoid_exit, as that code no longer exists.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2015-01-02 10:22:45 -08:00
Pavel Machek
801a559114 x86: init_mem_mapping(): use capital BIOS in comment
Use capital BIOS in comment. Its cleaner, and allows diference
between BIOS and BIOs.

Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-01-02 12:07:06 +01:00
Ingo Molnar
2aba73a614 Merge tag 'pr-20141223-x86-vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux into x86/urgent
Pull VDSO fix from Andy Lutomirski:

 "This is hopefully the last vdso fix for 3.19.  It should be very
  safe (it just adds a volatile).

  I don't think it fixes an actual bug (the __getcpu calls in the
  pvclock code may not have been needed in the first place), but
  discussion on that point is ongoing.

  It also fixes a big performance issue in 3.18 and earlier in which
  the lsl instructions in vclock_gettime got hoisted so far up the
  function that they happened even when the function they were in was
  never called.  n 3.19, the performance issue seems to be gone due to
  the whims of my compiler and some interaction with a branch that's
  now gone.

  I'll hopefully have a much bigger overhaul of the pvclock code
  for 3.20, but it needs careful review."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-01 22:21:22 +01:00
Paolo Bonzini
a629df7ead kvm: x86: drop severity of "generation wraparound" message
Since most virtual machines raise this message once, it is a bit annoying.
Make it KERN_DEBUG severity.

Cc: stable@vger.kernel.org
Fixes: 7a2e8aaf0f
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-27 21:52:28 +01:00
Tiejun Chen
baa035227b kvm: x86: vmx: reorder some msr writing
The commit 34a1cd60d1, "x86: vmx: move some vmx setting from
vmx_init() to hardware_setup()", tried to refactor some codes
specific to vmx hardware setting into hardware_setup(), but some
msr writing should depend on our previous setting condition like
enable_apicv, enable_ept and so on.

Reported-by: Jamie Heilman <jamie@audible.transient.net>
Tested-by: Jamie Heilman <jamie@audible.transient.net>
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-27 21:52:10 +01:00
Andy Lutomirski
1ddf0b1b11 x86, vdso: Use asm volatile in __getcpu
In Linux 3.18 and below, GCC hoists the lsl instructions in the
pvclock code all the way to the beginning of __vdso_clock_gettime,
slowing the non-paravirt case significantly.  For unknown reasons,
presumably related to the removal of a branch, the performance issue
is gone as of

e76b027e64 x86,vdso: Use LSL unconditionally for vgetcpu

but I don't trust GCC enough to expect the problem to stay fixed.

There should be no correctness issue, because the __getcpu calls in
__vdso_vlock_gettime were never necessary in the first place.

Note to stable maintainers: In 3.18 and below, depending on
configuration, gcc 4.9.2 generates code like this:

     9c3:       44 0f 03 e8             lsl    %ax,%r13d
     9c7:       45 89 eb                mov    %r13d,%r11d
     9ca:       0f 03 d8                lsl    %ax,%ebx

This patch won't apply as is to any released kernel, but I'll send a
trivial backported version if needed.

Fixes: 51c19b4f59 x86: vdso: pvclock gettime support
Cc: stable@vger.kernel.org # 3.8+
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2014-12-23 13:05:30 -08:00
Bjørn Mork
280dbc5723 x86/build: Clean auto-generated processor feature files
Commit 9def39be4e ("x86: Support compiling out human-friendly
processor feature names") made two source file targets
conditional. Such conditional targets will not be cleaned
automatically by make mrproper.

Fix by adding explicit clean-files targets for the two files.

Fixes: 9def39be4e ("x86: Support compiling out human-friendly processor feature names")
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Cc: Josh Triplett <josh@joshtriplett.org>
Link: http://lkml.kernel.org/r/1419335863-10608-1-git-send-email-bjorn@mork.no
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-23 15:37:06 +01:00
Sylvain BERTRAND
ea174f4c4f x86: Fix mkcapflags.sh bash-ism
Chocked while compiling linux with dash shell instead of bash
shell. See:

   http://pubs.opengroup.org/onlinepubs/9699919799/utilities/test.html

Signed-off-by: Sylvain BERTRAND <sylvain.bertrand@gmail.com>
Link: http://lkml.kernel.org/r/20141223123912.GA1386@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-23 15:34:57 +01:00
Jan Beulich
132978b94e x86: Fix step size adjustment during initial memory mapping
The old scheme can lead to failure in certain cases - the
problem is that after bumping step_size the next (non-final)
iteration is only guaranteed to make available a memory block
the size of what step_size was before. E.g. for a memory block
[0,3004600000) we'd have:

 iter	start		end		step		amount
 1	3004400000	30045fffff	 2M		  2M
 2	3004000000	30043fffff	64M		  4M
 3	3000000000	3003ffffff	 2G		 64M
 4	2000000000	2fffffffff	64G		 64G

Yet to map 64G with 4k pages (as happens e.g. under PV Xen) we
need slightly over 128M, but the first three iterations made
only about 70M available.

The condition (new_mapped_ram_size > mapped_ram_size) for
bumping step_size is just not suitable. Instead we want to bump
it when we know we have enough memory available to cover a block
of the new step_size. And rather than making that condition more
complicated than needed, simply adjust step_size by the largest
possible factor we know we can cover at that point - which is
shifting it left by one less than the difference between page
table level shifts. (Interestingly the original STEP_SIZE_SHIFT
definition had a comment hinting at that having been the
intention, just that it should have been PUD_SHIFT-PMD_SHIFT-1
instead of (PUD_SHIFT-PMD_SHIFT)/2, and of course for non-PAE
32-bit we can't really use these two constants as they're equal
there.)

Furthermore the comment in get_new_step_size() didn't get
updated when the bottom-down mapping logic got added. Yet while
an overflow (flushing step_size to zero) of the shift doesn't
matter for the top-down method, it does for bottom-up because
round_up(x, 0) = 0, and an upper range boundary of zero can't
really work well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/54945C1E020000780005114E@mail.emea.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-23 11:39:34 +01:00
Boris Ostrovsky
8b8cd8a367 x86/xen: Remove unnecessary BUG_ON(preemptible()) in xen_setup_timer()
There is no reason for having it and, with commit 250a1ac685 ("x86,
smpboot: Remove pointless preempt_disable() in
native_smp_prepare_cpus()"), it prevents HVM guests from booting.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-12-23 10:31:55 +00:00
Rickard Strandqvist
2b261f9f7b x86/platform: Remove unused function from apb_timer.c
Remove the function is_apbt_capable() that is not used anywhere.

This was partially found by using a static code analysis program
called 'cppcheck'.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Reviewed-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1419166698-2470-1-git-send-email-rickard_strandqvist@spectrumdigital.se
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-23 10:43:35 +01:00
Borislav Petkov
6ca7a8a150 x86/fpu: Use a symbolic name for asm operand
Fix up the else-case in fpu_fxsave() which seems like it has
been overlooked. Correct comment style in restore_fpu_checking()
while at it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1419170543-11393-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-23 10:41:12 +01:00
Borislav Petkov
83737691e5 x86, mce: Fix sparse errors
Make stuff used in mce.c only, static.

Signed-off-by: Borislav Petkov <bp@suse.de>
2014-12-22 21:04:31 +01:00
Andy Lutomirski
6c80f87ed4 x86, mce: Improve timeout error messages
There are four different possible types of timeouts.  Distinguish
them in the logs to help debug them.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/0fa6d2653a54a01c48b43a3583caf950ea99606e.1419178397.git.luto@amacapital.net
Signed-off-by: Borislav Petkov <bp@suse.de>
2014-12-22 17:47:45 +01:00
Li Bin
b5bfc51707 livepatch: move x86 specific ftrace handler code to arch/x86
The execution flow redirection related implemention in the livepatch
ftrace handler is depended on the specific architecture. This patch
introduces klp_arch_set_pc(like kgdb_arch_set_pc) interface to change
the pt_regs.

Signed-off-by: Li Bin <huawei.libin@huawei.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-12-22 15:40:49 +01:00
Seth Jennings
b700e7f03d livepatch: kernel: add support for live patching
This commit introduces code for the live patching core.  It implements
an ftrace-based mechanism and kernel interface for doing live patching
of kernel and kernel module functions.

It represents the greatest common functionality set between kpatch and
kgraft and can accept patches built using either method.

This first version does not implement any consistency mechanism that
ensures that old and new code do not run together.  In practice, ~90% of
CVEs are safe to apply in this way, since they simply add a conditional
check.  However, any function change that can not execute safely with
the old version of the function can _not_ be safely applied in this
version.

[ jkosina@suse.cz: due to the number of contributions that got folded into
  this original patch from Seth Jennings, add SUSE's copyright as well, as
  discussed via e-mail ]

Signed-off-by: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-12-22 15:40:49 +01:00
Ingo Molnar
fbe1bf1406 Merge tag 'pr-20141220-x86-vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux into x86/urgent
Pull a VDSO fix from Andy Lutomirski:

  "One vdso fix for a longstanding ASLR bug that's been in the news lately.

   The vdso base address has always been randomized, and I don't think there's
   anything particularly wrong with the range over which it's randomized,
   but the implementation seems to have been buggy since the very beginning.

   This fixes the implementation to remove a large bias that caused a small
   fraction of possible vdso load addresess to be vastly more likely than
   the rest of the possible addresses."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-21 11:16:49 +01:00
Andy Lutomirski
394f56fe48 x86_64, vdso: Fix the vdso address randomization algorithm
The theory behind vdso randomization is that it's mapped at a random
offset above the top of the stack.  To avoid wasting a page of
memory for an extra page table, the vdso isn't supposed to extend
past the lowest PMD into which it can fit.  Other than that, the
address should be a uniformly distributed address that meets all of
the alignment requirements.

The current algorithm is buggy: the vdso has about a 50% probability
of being at the very end of a PMD.  The current algorithm also has a
decent chance of failing outright due to incorrect handling of the
case where the top of the stack is near the top of its PMD.

This fixes the implementation.  The paxtest estimate of vdso
"randomisation" improves from 11 bits to 18 bits.  (Disclaimer: I
don't know what the paxtest code is actually calculating.)

It's worth noting that this algorithm is inherently biased: the vdso
is more likely to end up near the end of its PMD than near the
beginning.  Ideally we would either nix the PMD sharing requirement
or jointly randomize the vdso and the stack to reduce the bias.

In the mean time, this is a considerable improvement with basically
no risk of compatibility issues, since the allowed outputs of the
algorithm are unchanged.

As an easy test, doing this:

for i in `seq 10000`
  do grep -P vdso /proc/self/maps |cut -d- -f1
done |sort |uniq -d

used to produce lots of output (1445 lines on my most recent run).
A tiny subset looks like this:

7fffdfffe000
7fffe01fe000
7fffe05fe000
7fffe07fe000
7fffe09fe000
7fffe0bfe000
7fffe0dfe000

Note the suspicious fe000 endings.  With the fix, I get a much more
palatable 76 repeated addresses.

Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2014-12-20 16:56:57 -08:00
Linus Torvalds
60815cf2e0 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux
Pull ACCESS_ONCE cleanup preparation from Christian Borntraeger:
 "kernel: Provide READ_ONCE and ASSIGN_ONCE

  As discussed on LKML http://marc.info/?i=54611D86.4040306%40de.ibm.com
  ACCESS_ONCE might fail with specific compilers for non-scalar
  accesses.

  Here is a set of patches to tackle that problem.

  The first patch introduce READ_ONCE and ASSIGN_ONCE.  If the data
  structure is larger than the machine word size memcpy is used and a
  warning is emitted.  The next patches fix up several in-tree users of
  ACCESS_ONCE on non-scalar types.

  This does not yet contain a patch that forces ACCESS_ONCE to work only
  on scalar types.  This is targetted for the next merge window as Linux
  next already contains new offenders regarding ACCESS_ONCE vs.
  non-scalar types"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
  s390/kvm: REPLACE barrier fixup with READ_ONCE
  arm/spinlock: Replace ACCESS_ONCE with READ_ONCE
  arm64/spinlock: Replace ACCESS_ONCE READ_ONCE
  mips/gup: Replace ACCESS_ONCE with READ_ONCE
  x86/gup: Replace ACCESS_ONCE with READ_ONCE
  x86/spinlock: Replace ACCESS_ONCE with READ_ONCE
  mm: replace ACCESS_ONCE with READ_ONCE or barriers
  kernel: Provide READ_ONCE and ASSIGN_ONCE
2014-12-20 16:48:59 -08:00