summaryrefslogtreecommitdiff
path: root/Documentation/virtual
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/virtual')
-rw-r--r--Documentation/virtual/00-INDEX11
-rw-r--r--Documentation/virtual/kvm/api.txt2000
-rw-r--r--Documentation/virtual/kvm/cpuid.txt45
-rw-r--r--Documentation/virtual/kvm/locking.txt25
-rw-r--r--Documentation/virtual/kvm/mmu.txt366
-rw-r--r--Documentation/virtual/kvm/msr.txt221
-rw-r--r--Documentation/virtual/kvm/nested-vmx.txt251
-rw-r--r--Documentation/virtual/kvm/ppc-pv.txt178
-rw-r--r--Documentation/virtual/kvm/review-checklist.txt38
-rw-r--r--Documentation/virtual/kvm/timekeeping.txt612
-rw-r--r--Documentation/virtual/uml/UserModeLinux-HOWTO.txt4589
-rw-r--r--Documentation/virtual/virtio-spec.txt2200
12 files changed, 10536 insertions, 0 deletions
diff --git a/Documentation/virtual/00-INDEX b/Documentation/virtual/00-INDEX
new file mode 100644
index 00000000..924bd462
--- /dev/null
+++ b/Documentation/virtual/00-INDEX
@@ -0,0 +1,11 @@
+Virtualization support in the Linux kernel.
+
+00-INDEX
+ - this file.
+kvm/
+ - Kernel Virtual Machine. See also http://linux-kvm.org
+uml/
+ - User Mode Linux, builds/runs Linux kernel as a userspace program.
+virtio.txt
+ - Text version of draft virtio spec.
+ See http://ozlabs.org/~rusty/virtio-spec
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
new file mode 100644
index 00000000..6386f8c0
--- /dev/null
+++ b/Documentation/virtual/kvm/api.txt
@@ -0,0 +1,2000 @@
+The Definitive KVM (Kernel-based Virtual Machine) API Documentation
+===================================================================
+
+1. General description
+
+The kvm API is a set of ioctls that are issued to control various aspects
+of a virtual machine. The ioctls belong to three classes
+
+ - System ioctls: These query and set global attributes which affect the
+ whole kvm subsystem. In addition a system ioctl is used to create
+ virtual machines
+
+ - VM ioctls: These query and set attributes that affect an entire virtual
+ machine, for example memory layout. In addition a VM ioctl is used to
+ create virtual cpus (vcpus).
+
+ Only run VM ioctls from the same process (address space) that was used
+ to create the VM.
+
+ - vcpu ioctls: These query and set attributes that control the operation
+ of a single virtual cpu.
+
+ Only run vcpu ioctls from the same thread that was used to create the
+ vcpu.
+
+2. File descriptors
+
+The kvm API is centered around file descriptors. An initial
+open("/dev/kvm") obtains a handle to the kvm subsystem; this handle
+can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this
+handle will create a VM file descriptor which can be used to issue VM
+ioctls. A KVM_CREATE_VCPU ioctl on a VM fd will create a virtual cpu
+and return a file descriptor pointing to it. Finally, ioctls on a vcpu
+fd can be used to control the vcpu, including the important task of
+actually running guest code.
+
+In general file descriptors can be migrated among processes by means
+of fork() and the SCM_RIGHTS facility of unix domain socket. These
+kinds of tricks are explicitly not supported by kvm. While they will
+not cause harm to the host, their actual behavior is not guaranteed by
+the API. The only supported use is one virtual machine per process,
+and one vcpu per thread.
+
+3. Extensions
+
+As of Linux 2.6.22, the KVM ABI has been stabilized: no backward
+incompatible change are allowed. However, there is an extension
+facility that allows backward-compatible extensions to the API to be
+queried and used.
+
+The extension mechanism is not based on on the Linux version number.
+Instead, kvm defines extension identifiers and a facility to query
+whether a particular extension identifier is available. If it is, a
+set of ioctls is available for application use.
+
+4. API description
+
+This section describes ioctls that can be used to control kvm guests.
+For each ioctl, the following information is provided along with a
+description:
+
+ Capability: which KVM extension provides this ioctl. Can be 'basic',
+ which means that is will be provided by any kernel that supports
+ API version 12 (see section 4.1), or a KVM_CAP_xyz constant, which
+ means availability needs to be checked with KVM_CHECK_EXTENSION
+ (see section 4.4).
+
+ Architectures: which instruction set architectures provide this ioctl.
+ x86 includes both i386 and x86_64.
+
+ Type: system, vm, or vcpu.
+
+ Parameters: what parameters are accepted by the ioctl.
+
+ Returns: the return value. General error numbers (EBADF, ENOMEM, EINVAL)
+ are not detailed, but errors with specific meanings are.
+
+4.1 KVM_GET_API_VERSION
+
+Capability: basic
+Architectures: all
+Type: system ioctl
+Parameters: none
+Returns: the constant KVM_API_VERSION (=12)
+
+This identifies the API version as the stable kvm API. It is not
+expected that this number will change. However, Linux 2.6.20 and
+2.6.21 report earlier versions; these are not documented and not
+supported. Applications should refuse to run if KVM_GET_API_VERSION
+returns a value other than 12. If this check passes, all ioctls
+described as 'basic' will be available.
+
+4.2 KVM_CREATE_VM
+
+Capability: basic
+Architectures: all
+Type: system ioctl
+Parameters: machine type identifier (KVM_VM_*)
+Returns: a VM fd that can be used to control the new virtual machine.
+
+The new VM has no virtual cpus and no memory. An mmap() of a VM fd
+will access the virtual machine's physical address space; offset zero
+corresponds to guest physical address zero. Use of mmap() on a VM fd
+is discouraged if userspace memory allocation (KVM_CAP_USER_MEMORY) is
+available.
+You most certainly want to use 0 as machine type.
+
+In order to create user controlled virtual machines on S390, check
+KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
+privileged user (CAP_SYS_ADMIN).
+
+4.3 KVM_GET_MSR_INDEX_LIST
+
+Capability: basic
+Architectures: x86
+Type: system
+Parameters: struct kvm_msr_list (in/out)
+Returns: 0 on success; -1 on error
+Errors:
+ E2BIG: the msr index list is to be to fit in the array specified by
+ the user.
+
+struct kvm_msr_list {
+ __u32 nmsrs; /* number of msrs in entries */
+ __u32 indices[0];
+};
+
+This ioctl returns the guest msrs that are supported. The list varies
+by kvm version and host processor, but does not change otherwise. The
+user fills in the size of the indices array in nmsrs, and in return
+kvm adjusts nmsrs to reflect the actual number of msrs and fills in
+the indices array with their numbers.
+
+Note: if kvm indicates supports MCE (KVM_CAP_MCE), then the MCE bank MSRs are
+not returned in the MSR list, as different vcpus can have a different number
+of banks, as set via the KVM_X86_SETUP_MCE ioctl.
+
+4.4 KVM_CHECK_EXTENSION
+
+Capability: basic
+Architectures: all
+Type: system ioctl
+Parameters: extension identifier (KVM_CAP_*)
+Returns: 0 if unsupported; 1 (or some other positive integer) if supported
+
+The API allows the application to query about extensions to the core
+kvm API. Userspace passes an extension identifier (an integer) and
+receives an integer that describes the extension availability.
+Generally 0 means no and 1 means yes, but some extensions may report
+additional information in the integer return value.
+
+4.5 KVM_GET_VCPU_MMAP_SIZE
+
+Capability: basic
+Architectures: all
+Type: system ioctl
+Parameters: none
+Returns: size of vcpu mmap area, in bytes
+
+The KVM_RUN ioctl (cf.) communicates with userspace via a shared
+memory region. This ioctl returns the size of that region. See the
+KVM_RUN documentation for details.
+
+4.6 KVM_SET_MEMORY_REGION
+
+Capability: basic
+Architectures: all
+Type: vm ioctl
+Parameters: struct kvm_memory_region (in)
+Returns: 0 on success, -1 on error
+
+This ioctl is obsolete and has been removed.
+
+4.7 KVM_CREATE_VCPU
+
+Capability: basic
+Architectures: all
+Type: vm ioctl
+Parameters: vcpu id (apic id on x86)
+Returns: vcpu fd on success, -1 on error
+
+This API adds a vcpu to a virtual machine. The vcpu id is a small integer
+in the range [0, max_vcpus).
+
+The recommended max_vcpus value can be retrieved using the KVM_CAP_NR_VCPUS of
+the KVM_CHECK_EXTENSION ioctl() at run-time.
+The maximum possible value for max_vcpus can be retrieved using the
+KVM_CAP_MAX_VCPUS of the KVM_CHECK_EXTENSION ioctl() at run-time.
+
+If the KVM_CAP_NR_VCPUS does not exist, you should assume that max_vcpus is 4
+cpus max.
+If the KVM_CAP_MAX_VCPUS does not exist, you should assume that max_vcpus is
+same as the value returned from KVM_CAP_NR_VCPUS.
+
+On powerpc using book3s_hv mode, the vcpus are mapped onto virtual
+threads in one or more virtual CPU cores. (This is because the
+hardware requires all the hardware threads in a CPU core to be in the
+same partition.) The KVM_CAP_PPC_SMT capability indicates the number
+of vcpus per virtual core (vcore). The vcore id is obtained by
+dividing the vcpu id by the number of vcpus per vcore. The vcpus in a
+given vcore will always be in the same physical core as each other
+(though that might be a different physical core from time to time).
+Userspace can control the threading (SMT) mode of the guest by its
+allocation of vcpu ids. For example, if userspace wants
+single-threaded guest vcpus, it should make all vcpu ids be a multiple
+of the number of vcpus per vcore.
+
+On powerpc using book3s_hv mode, the vcpus are mapped onto virtual
+threads in one or more virtual CPU cores. (This is because the
+hardware requires all the hardware threads in a CPU core to be in the
+same partition.) The KVM_CAP_PPC_SMT capability indicates the number
+of vcpus per virtual core (vcore). The vcore id is obtained by
+dividing the vcpu id by the number of vcpus per vcore. The vcpus in a
+given vcore will always be in the same physical core as each other
+(though that might be a different physical core from time to time).
+Userspace can control the threading (SMT) mode of the guest by its
+allocation of vcpu ids. For example, if userspace wants
+single-threaded guest vcpus, it should make all vcpu ids be a multiple
+of the number of vcpus per vcore.
+
+For virtual cpus that have been created with S390 user controlled virtual
+machines, the resulting vcpu fd can be memory mapped at page offset
+KVM_S390_SIE_PAGE_OFFSET in order to obtain a memory map of the virtual
+cpu's hardware control block.
+
+4.8 KVM_GET_DIRTY_LOG (vm ioctl)
+
+Capability: basic
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_dirty_log (in/out)
+Returns: 0 on success, -1 on error
+
+/* for KVM_GET_DIRTY_LOG */
+struct kvm_dirty_log {
+ __u32 slot;
+ __u32 padding;
+ union {
+ void __user *dirty_bitmap; /* one bit per page */
+ __u64 padding;
+ };
+};
+
+Given a memory slot, return a bitmap containing any pages dirtied
+since the last call to this ioctl. Bit 0 is the first page in the
+memory slot. Ensure the entire structure is cleared to avoid padding
+issues.
+
+4.9 KVM_SET_MEMORY_ALIAS
+
+Capability: basic
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_memory_alias (in)
+Returns: 0 (success), -1 (error)
+
+This ioctl is obsolete and has been removed.
+
+4.10 KVM_RUN
+
+Capability: basic
+Architectures: all
+Type: vcpu ioctl
+Parameters: none
+Returns: 0 on success, -1 on error
+Errors:
+ EINTR: an unmasked signal is pending
+
+This ioctl is used to run a guest virtual cpu. While there are no
+explicit parameters, there is an implicit parameter block that can be
+obtained by mmap()ing the vcpu fd at offset 0, with the size given by
+KVM_GET_VCPU_MMAP_SIZE. The parameter block is formatted as a 'struct
+kvm_run' (see below).
+
+4.11 KVM_GET_REGS
+
+Capability: basic
+Architectures: all
+Type: vcpu ioctl
+Parameters: struct kvm_regs (out)
+Returns: 0 on success, -1 on error
+
+Reads the general purpose registers from the vcpu.
+
+/* x86 */
+struct kvm_regs {
+ /* out (KVM_GET_REGS) / in (KVM_SET_REGS) */
+ __u64 rax, rbx, rcx, rdx;
+ __u64 rsi, rdi, rsp, rbp;
+ __u64 r8, r9, r10, r11;
+ __u64 r12, r13, r14, r15;
+ __u64 rip, rflags;
+};
+
+4.12 KVM_SET_REGS
+
+Capability: basic
+Architectures: all
+Type: vcpu ioctl
+Parameters: struct kvm_regs (in)
+Returns: 0 on success, -1 on error
+
+Writes the general purpose registers into the vcpu.
+
+See KVM_GET_REGS for the data structure.
+
+4.13 KVM_GET_SREGS
+
+Capability: basic
+Architectures: x86, ppc
+Type: vcpu ioctl
+Parameters: struct kvm_sregs (out)
+Returns: 0 on success, -1 on error
+
+Reads special registers from the vcpu.
+
+/* x86 */
+struct kvm_sregs {
+ struct kvm_segment cs, ds, es, fs, gs, ss;
+ struct kvm_segment tr, ldt;
+ struct kvm_dtable gdt, idt;
+ __u64 cr0, cr2, cr3, cr4, cr8;
+ __u64 efer;
+ __u64 apic_base;
+ __u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64];
+};
+
+/* ppc -- see arch/powerpc/include/asm/kvm.h */
+
+interrupt_bitmap is a bitmap of pending external interrupts. At most
+one bit may be set. This interrupt has been acknowledged by the APIC
+but not yet injected into the cpu core.
+
+4.14 KVM_SET_SREGS
+
+Capability: basic
+Architectures: x86, ppc
+Type: vcpu ioctl
+Parameters: struct kvm_sregs (in)
+Returns: 0 on success, -1 on error
+
+Writes special registers into the vcpu. See KVM_GET_SREGS for the
+data structures.
+
+4.15 KVM_TRANSLATE
+
+Capability: basic
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_translation (in/out)
+Returns: 0 on success, -1 on error
+
+Translates a virtual address according to the vcpu's current address
+translation mode.
+
+struct kvm_translation {
+ /* in */
+ __u64 linear_address;
+
+ /* out */
+ __u64 physical_address;
+ __u8 valid;
+ __u8 writeable;
+ __u8 usermode;
+ __u8 pad[5];
+};
+
+4.16 KVM_INTERRUPT
+
+Capability: basic
+Architectures: x86, ppc
+Type: vcpu ioctl
+Parameters: struct kvm_interrupt (in)
+Returns: 0 on success, -1 on error
+
+Queues a hardware interrupt vector to be injected. This is only
+useful if in-kernel local APIC or equivalent is not used.
+
+/* for KVM_INTERRUPT */
+struct kvm_interrupt {
+ /* in */
+ __u32 irq;
+};
+
+X86:
+
+Note 'irq' is an interrupt vector, not an interrupt pin or line.
+
+PPC:
+
+Queues an external interrupt to be injected. This ioctl is overleaded
+with 3 different irq values:
+
+a) KVM_INTERRUPT_SET
+
+ This injects an edge type external interrupt into the guest once it's ready
+ to receive interrupts. When injected, the interrupt is done.
+
+b) KVM_INTERRUPT_UNSET
+
+ This unsets any pending interrupt.
+
+ Only available with KVM_CAP_PPC_UNSET_IRQ.
+
+c) KVM_INTERRUPT_SET_LEVEL
+
+ This injects a level type external interrupt into the guest context. The
+ interrupt stays pending until a specific ioctl with KVM_INTERRUPT_UNSET
+ is triggered.
+
+ Only available with KVM_CAP_PPC_IRQ_LEVEL.
+
+Note that any value for 'irq' other than the ones stated above is invalid
+and incurs unexpected behavior.
+
+4.17 KVM_DEBUG_GUEST
+
+Capability: basic
+Architectures: none
+Type: vcpu ioctl
+Parameters: none)
+Returns: -1 on error
+
+Support for this has been removed. Use KVM_SET_GUEST_DEBUG instead.
+
+4.18 KVM_GET_MSRS
+
+Capability: basic
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_msrs (in/out)
+Returns: 0 on success, -1 on error
+
+Reads model-specific registers from the vcpu. Supported msr indices can
+be obtained using KVM_GET_MSR_INDEX_LIST.
+
+struct kvm_msrs {
+ __u32 nmsrs; /* number of msrs in entries */
+ __u32 pad;
+
+ struct kvm_msr_entry entries[0];
+};
+
+struct kvm_msr_entry {
+ __u32 index;
+ __u32 reserved;
+ __u64 data;
+};
+
+Application code should set the 'nmsrs' member (which indicates the
+size of the entries array) and the 'index' member of each array entry.
+kvm will fill in the 'data' member.
+
+4.19 KVM_SET_MSRS
+
+Capability: basic
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_msrs (in)
+Returns: 0 on success, -1 on error
+
+Writes model-specific registers to the vcpu. See KVM_GET_MSRS for the
+data structures.
+
+Application code should set the 'nmsrs' member (which indicates the
+size of the entries array), and the 'index' and 'data' members of each
+array entry.
+
+4.20 KVM_SET_CPUID
+
+Capability: basic
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_cpuid (in)
+Returns: 0 on success, -1 on error
+
+Defines the vcpu responses to the cpuid instruction. Applications
+should use the KVM_SET_CPUID2 ioctl if available.
+
+
+struct kvm_cpuid_entry {
+ __u32 function;
+ __u32 eax;
+ __u32 ebx;
+ __u32 ecx;
+ __u32 edx;
+ __u32 padding;
+};
+
+/* for KVM_SET_CPUID */
+struct kvm_cpuid {
+ __u32 nent;
+ __u32 padding;
+ struct kvm_cpuid_entry entries[0];
+};
+
+4.21 KVM_SET_SIGNAL_MASK
+
+Capability: basic
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_signal_mask (in)
+Returns: 0 on success, -1 on error
+
+Defines which signals are blocked during execution of KVM_RUN. This
+signal mask temporarily overrides the threads signal mask. Any
+unblocked signal received (except SIGKILL and SIGSTOP, which retain
+their traditional behaviour) will cause KVM_RUN to return with -EINTR.
+
+Note the signal will only be delivered if not blocked by the original
+signal mask.
+
+/* for KVM_SET_SIGNAL_MASK */
+struct kvm_signal_mask {
+ __u32 len;
+ __u8 sigset[0];
+};
+
+4.22 KVM_GET_FPU
+
+Capability: basic
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_fpu (out)
+Returns: 0 on success, -1 on error
+
+Reads the floating point state from the vcpu.
+
+/* for KVM_GET_FPU and KVM_SET_FPU */
+struct kvm_fpu {
+ __u8 fpr[8][16];
+ __u16 fcw;
+ __u16 fsw;
+ __u8 ftwx; /* in fxsave format */
+ __u8 pad1;
+ __u16 last_opcode;
+ __u64 last_ip;
+ __u64 last_dp;
+ __u8 xmm[16][16];
+ __u32 mxcsr;
+ __u32 pad2;
+};
+
+4.23 KVM_SET_FPU
+
+Capability: basic
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_fpu (in)
+Returns: 0 on success, -1 on error
+
+Writes the floating point state to the vcpu.
+
+/* for KVM_GET_FPU and KVM_SET_FPU */
+struct kvm_fpu {
+ __u8 fpr[8][16];
+ __u16 fcw;
+ __u16 fsw;
+ __u8 ftwx; /* in fxsave format */
+ __u8 pad1;
+ __u16 last_opcode;
+ __u64 last_ip;
+ __u64 last_dp;
+ __u8 xmm[16][16];
+ __u32 mxcsr;
+ __u32 pad2;
+};
+
+4.24 KVM_CREATE_IRQCHIP
+
+Capability: KVM_CAP_IRQCHIP
+Architectures: x86, ia64
+Type: vm ioctl
+Parameters: none
+Returns: 0 on success, -1 on error
+
+Creates an interrupt controller model in the kernel. On x86, creates a virtual
+ioapic, a virtual PIC (two PICs, nested), and sets up future vcpus to have a
+local APIC. IRQ routing for GSIs 0-15 is set to both PIC and IOAPIC; GSI 16-23
+only go to the IOAPIC. On ia64, a IOSAPIC is created.
+
+4.25 KVM_IRQ_LINE
+
+Capability: KVM_CAP_IRQCHIP
+Architectures: x86, ia64
+Type: vm ioctl
+Parameters: struct kvm_irq_level
+Returns: 0 on success, -1 on error
+
+Sets the level of a GSI input to the interrupt controller model in the kernel.
+Requires that an interrupt controller model has been previously created with
+KVM_CREATE_IRQCHIP. Note that edge-triggered interrupts require the level
+to be set to 1 and then back to 0.
+
+struct kvm_irq_level {
+ union {
+ __u32 irq; /* GSI */
+ __s32 status; /* not used for KVM_IRQ_LEVEL */
+ };
+ __u32 level; /* 0 or 1 */
+};
+
+4.26 KVM_GET_IRQCHIP
+
+Capability: KVM_CAP_IRQCHIP
+Architectures: x86, ia64
+Type: vm ioctl
+Parameters: struct kvm_irqchip (in/out)
+Returns: 0 on success, -1 on error
+
+Reads the state of a kernel interrupt controller created with
+KVM_CREATE_IRQCHIP into a buffer provided by the caller.
+
+struct kvm_irqchip {
+ __u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */
+ __u32 pad;
+ union {
+ char dummy[512]; /* reserving space */
+ struct kvm_pic_state pic;
+ struct kvm_ioapic_state ioapic;
+ } chip;
+};
+
+4.27 KVM_SET_IRQCHIP
+
+Capability: KVM_CAP_IRQCHIP
+Architectures: x86, ia64
+Type: vm ioctl
+Parameters: struct kvm_irqchip (in)
+Returns: 0 on success, -1 on error
+
+Sets the state of a kernel interrupt controller created with
+KVM_CREATE_IRQCHIP from a buffer provided by the caller.
+
+struct kvm_irqchip {
+ __u32 chip_id; /* 0 = PIC1, 1 = PIC2, 2 = IOAPIC */
+ __u32 pad;
+ union {
+ char dummy[512]; /* reserving space */
+ struct kvm_pic_state pic;
+ struct kvm_ioapic_state ioapic;
+ } chip;
+};
+
+4.28 KVM_XEN_HVM_CONFIG
+
+Capability: KVM_CAP_XEN_HVM
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_xen_hvm_config (in)
+Returns: 0 on success, -1 on error
+
+Sets the MSR that the Xen HVM guest uses to initialize its hypercall
+page, and provides the starting address and size of the hypercall
+blobs in userspace. When the guest writes the MSR, kvm copies one
+page of a blob (32- or 64-bit, depending on the vcpu mode) to guest
+memory.
+
+struct kvm_xen_hvm_config {
+ __u32 flags;
+ __u32 msr;
+ __u64 blob_addr_32;
+ __u64 blob_addr_64;
+ __u8 blob_size_32;
+ __u8 blob_size_64;
+ __u8 pad2[30];
+};
+
+4.29 KVM_GET_CLOCK
+
+Capability: KVM_CAP_ADJUST_CLOCK
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_clock_data (out)
+Returns: 0 on success, -1 on error
+
+Gets the current timestamp of kvmclock as seen by the current guest. In
+conjunction with KVM_SET_CLOCK, it is used to ensure monotonicity on scenarios
+such as migration.
+
+struct kvm_clock_data {
+ __u64 clock; /* kvmclock current value */
+ __u32 flags;
+ __u32 pad[9];
+};
+
+4.30 KVM_SET_CLOCK
+
+Capability: KVM_CAP_ADJUST_CLOCK
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_clock_data (in)
+Returns: 0 on success, -1 on error
+
+Sets the current timestamp of kvmclock to the value specified in its parameter.
+In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
+such as migration.
+
+struct kvm_clock_data {
+ __u64 clock; /* kvmclock current value */
+ __u32 flags;
+ __u32 pad[9];
+};
+
+4.31 KVM_GET_VCPU_EVENTS
+
+Capability: KVM_CAP_VCPU_EVENTS
+Extended by: KVM_CAP_INTR_SHADOW
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_vcpu_event (out)
+Returns: 0 on success, -1 on error
+
+Gets currently pending exceptions, interrupts, and NMIs as well as related
+states of the vcpu.
+
+struct kvm_vcpu_events {
+ struct {
+ __u8 injected;
+ __u8 nr;
+ __u8 has_error_code;
+ __u8 pad;
+ __u32 error_code;
+ } exception;
+ struct {
+ __u8 injected;
+ __u8 nr;
+ __u8 soft;
+ __u8 shadow;
+ } interrupt;
+ struct {
+ __u8 injected;
+ __u8 pending;
+ __u8 masked;
+ __u8 pad;
+ } nmi;
+ __u32 sipi_vector;
+ __u32 flags;
+};
+
+KVM_VCPUEVENT_VALID_SHADOW may be set in the flags field to signal that
+interrupt.shadow contains a valid state. Otherwise, this field is undefined.
+
+4.32 KVM_SET_VCPU_EVENTS
+
+Capability: KVM_CAP_VCPU_EVENTS
+Extended by: KVM_CAP_INTR_SHADOW
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_vcpu_event (in)
+Returns: 0 on success, -1 on error
+
+Set pending exceptions, interrupts, and NMIs as well as related states of the
+vcpu.
+
+See KVM_GET_VCPU_EVENTS for the data structure.
+
+Fields that may be modified asynchronously by running VCPUs can be excluded
+from the update. These fields are nmi.pending and sipi_vector. Keep the
+corresponding bits in the flags field cleared to suppress overwriting the
+current in-kernel state. The bits are:
+
+KVM_VCPUEVENT_VALID_NMI_PENDING - transfer nmi.pending to the kernel
+KVM_VCPUEVENT_VALID_SIPI_VECTOR - transfer sipi_vector
+
+If KVM_CAP_INTR_SHADOW is available, KVM_VCPUEVENT_VALID_SHADOW can be set in
+the flags field to signal that interrupt.shadow contains a valid state and
+shall be written into the VCPU.
+
+4.33 KVM_GET_DEBUGREGS
+
+Capability: KVM_CAP_DEBUGREGS
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_debugregs (out)
+Returns: 0 on success, -1 on error
+
+Reads debug registers from the vcpu.
+
+struct kvm_debugregs {
+ __u64 db[4];
+ __u64 dr6;
+ __u64 dr7;
+ __u64 flags;
+ __u64 reserved[9];
+};
+
+4.34 KVM_SET_DEBUGREGS
+
+Capability: KVM_CAP_DEBUGREGS
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_debugregs (in)
+Returns: 0 on success, -1 on error
+
+Writes debug registers into the vcpu.
+
+See KVM_GET_DEBUGREGS for the data structure. The flags field is unused
+yet and must be cleared on entry.
+
+4.35 KVM_SET_USER_MEMORY_REGION
+
+Capability: KVM_CAP_USER_MEM
+Architectures: all
+Type: vm ioctl
+Parameters: struct kvm_userspace_memory_region (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_userspace_memory_region {
+ __u32 slot;
+ __u32 flags;
+ __u64 guest_phys_addr;
+ __u64 memory_size; /* bytes */
+ __u64 userspace_addr; /* start of the userspace allocated memory */
+};
+
+/* for kvm_memory_region::flags */
+#define KVM_MEM_LOG_DIRTY_PAGES 1UL
+
+This ioctl allows the user to create or modify a guest physical memory
+slot. When changing an existing slot, it may be moved in the guest
+physical memory space, or its flags may be modified. It may not be
+resized. Slots may not overlap in guest physical address space.
+
+Memory for the region is taken starting at the address denoted by the
+field userspace_addr, which must point at user addressable memory for
+the entire memory slot size. Any object may back this memory, including
+anonymous memory, ordinary files, and hugetlbfs.
+
+It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
+be identical. This allows large pages in the guest to be backed by large
+pages in the host.
+
+The flags field supports just one flag, KVM_MEM_LOG_DIRTY_PAGES, which
+instructs kvm to keep track of writes to memory within the slot. See
+the KVM_GET_DIRTY_LOG ioctl.
+
+When the KVM_CAP_SYNC_MMU capability, changes in the backing of the memory
+region are automatically reflected into the guest. For example, an mmap()
+that affects the region will be made visible immediately. Another example
+is madvise(MADV_DROP).
+
+It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl.
+The KVM_SET_MEMORY_REGION does not allow fine grained control over memory
+allocation and is deprecated.
+
+4.36 KVM_SET_TSS_ADDR
+
+Capability: KVM_CAP_SET_TSS_ADDR
+Architectures: x86
+Type: vm ioctl
+Parameters: unsigned long tss_address (in)
+Returns: 0 on success, -1 on error
+
+This ioctl defines the physical address of a three-page region in the guest
+physical address space. The region must be within the first 4GB of the
+guest physical address space and must not conflict with any memory slot
+or any mmio address. The guest may malfunction if it accesses this memory
+region.
+
+This ioctl is required on Intel-based hosts. This is needed on Intel hardware
+because of a quirk in the virtualization implementation (see the internals
+documentation when it pops into existence).
+
+4.37 KVM_ENABLE_CAP
+
+Capability: KVM_CAP_ENABLE_CAP
+Architectures: ppc
+Type: vcpu ioctl
+Parameters: struct kvm_enable_cap (in)
+Returns: 0 on success; -1 on error
+
++Not all extensions are enabled by default. Using this ioctl the application
+can enable an extension, making it available to the guest.
+
+On systems that do not support this ioctl, it always fails. On systems that
+do support it, it only works for extensions that are supported for enablement.
+
+To check if a capability can be enabled, the KVM_CHECK_EXTENSION ioctl should
+be used.
+
+struct kvm_enable_cap {
+ /* in */
+ __u32 cap;
+
+The capability that is supposed to get enabled.
+
+ __u32 flags;
+
+A bitfield indicating future enhancements. Has to be 0 for now.
+
+ __u64 args[4];
+
+Arguments for enabling a feature. If a feature needs initial values to
+function properly, this is the place to put them.
+
+ __u8 pad[64];
+};
+
+4.38 KVM_GET_MP_STATE
+
+Capability: KVM_CAP_MP_STATE
+Architectures: x86, ia64
+Type: vcpu ioctl
+Parameters: struct kvm_mp_state (out)
+Returns: 0 on success; -1 on error
+
+struct kvm_mp_state {
+ __u32 mp_state;
+};
+
+Returns the vcpu's current "multiprocessing state" (though also valid on
+uniprocessor guests).
+
+Possible values are:
+
+ - KVM_MP_STATE_RUNNABLE: the vcpu is currently running
+ - KVM_MP_STATE_UNINITIALIZED: the vcpu is an application processor (AP)
+ which has not yet received an INIT signal
+ - KVM_MP_STATE_INIT_RECEIVED: the vcpu has received an INIT signal, and is
+ now ready for a SIPI
+ - KVM_MP_STATE_HALTED: the vcpu has executed a HLT instruction and
+ is waiting for an interrupt
+ - KVM_MP_STATE_SIPI_RECEIVED: the vcpu has just received a SIPI (vector
+ accessible via KVM_GET_VCPU_EVENTS)
+
+This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel
+irqchip, the multiprocessing state must be maintained by userspace.
+
+4.39 KVM_SET_MP_STATE
+
+Capability: KVM_CAP_MP_STATE
+Architectures: x86, ia64
+Type: vcpu ioctl
+Parameters: struct kvm_mp_state (in)
+Returns: 0 on success; -1 on error
+
+Sets the vcpu's current "multiprocessing state"; see KVM_GET_MP_STATE for
+arguments.
+
+This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel
+irqchip, the multiprocessing state must be maintained by userspace.
+
+4.40 KVM_SET_IDENTITY_MAP_ADDR
+
+Capability: KVM_CAP_SET_IDENTITY_MAP_ADDR
+Architectures: x86
+Type: vm ioctl
+Parameters: unsigned long identity (in)
+Returns: 0 on success, -1 on error
+
+This ioctl defines the physical address of a one-page region in the guest
+physical address space. The region must be within the first 4GB of the
+guest physical address space and must not conflict with any memory slot
+or any mmio address. The guest may malfunction if it accesses this memory
+region.
+
+This ioctl is required on Intel-based hosts. This is needed on Intel hardware
+because of a quirk in the virtualization implementation (see the internals
+documentation when it pops into existence).
+
+4.41 KVM_SET_BOOT_CPU_ID
+
+Capability: KVM_CAP_SET_BOOT_CPU_ID
+Architectures: x86, ia64
+Type: vm ioctl
+Parameters: unsigned long vcpu_id
+Returns: 0 on success, -1 on error
+
+Define which vcpu is the Bootstrap Processor (BSP). Values are the same
+as the vcpu id in KVM_CREATE_VCPU. If this ioctl is not called, the default
+is vcpu 0.
+
+4.42 KVM_GET_XSAVE
+
+Capability: KVM_CAP_XSAVE
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_xsave (out)
+Returns: 0 on success, -1 on error
+
+struct kvm_xsave {
+ __u32 region[1024];
+};
+
+This ioctl would copy current vcpu's xsave struct to the userspace.
+
+4.43 KVM_SET_XSAVE
+
+Capability: KVM_CAP_XSAVE
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_xsave (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_xsave {
+ __u32 region[1024];
+};
+
+This ioctl would copy userspace's xsave struct to the kernel.
+
+4.44 KVM_GET_XCRS
+
+Capability: KVM_CAP_XCRS
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_xcrs (out)
+Returns: 0 on success, -1 on error
+
+struct kvm_xcr {
+ __u32 xcr;
+ __u32 reserved;
+ __u64 value;
+};
+
+struct kvm_xcrs {
+ __u32 nr_xcrs;
+ __u32 flags;
+ struct kvm_xcr xcrs[KVM_MAX_XCRS];
+ __u64 padding[16];
+};
+
+This ioctl would copy current vcpu's xcrs to the userspace.
+
+4.45 KVM_SET_XCRS
+
+Capability: KVM_CAP_XCRS
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_xcrs (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_xcr {
+ __u32 xcr;
+ __u32 reserved;
+ __u64 value;
+};
+
+struct kvm_xcrs {
+ __u32 nr_xcrs;
+ __u32 flags;
+ struct kvm_xcr xcrs[KVM_MAX_XCRS];
+ __u64 padding[16];
+};
+
+This ioctl would set vcpu's xcr to the value userspace specified.
+
+4.46 KVM_GET_SUPPORTED_CPUID
+
+Capability: KVM_CAP_EXT_CPUID
+Architectures: x86
+Type: system ioctl
+Parameters: struct kvm_cpuid2 (in/out)
+Returns: 0 on success, -1 on error
+
+struct kvm_cpuid2 {
+ __u32 nent;
+ __u32 padding;
+ struct kvm_cpuid_entry2 entries[0];
+};
+
+#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX 1
+#define KVM_CPUID_FLAG_STATEFUL_FUNC 2
+#define KVM_CPUID_FLAG_STATE_READ_NEXT 4
+
+struct kvm_cpuid_entry2 {
+ __u32 function;
+ __u32 index;
+ __u32 flags;
+ __u32 eax;
+ __u32 ebx;
+ __u32 ecx;
+ __u32 edx;
+ __u32 padding[3];
+};
+
+This ioctl returns x86 cpuid features which are supported by both the hardware
+and kvm. Userspace can use the information returned by this ioctl to
+construct cpuid information (for KVM_SET_CPUID2) that is consistent with
+hardware, kernel, and userspace capabilities, and with user requirements (for
+example, the user may wish to constrain cpuid to emulate older hardware,
+or for feature consistency across a cluster).
+
+Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure
+with the 'nent' field indicating the number of entries in the variable-size
+array 'entries'. If the number of entries is too low to describe the cpu
+capabilities, an error (E2BIG) is returned. If the number is too high,
+the 'nent' field is adjusted and an error (ENOMEM) is returned. If the
+number is just right, the 'nent' field is adjusted to the number of valid
+entries in the 'entries' array, which is then filled.
+
+The entries returned are the host cpuid as returned by the cpuid instruction,
+with unknown or unsupported features masked out. Some features (for example,
+x2apic), may not be present in the host cpu, but are exposed by kvm if it can
+emulate them efficiently. The fields in each entry are defined as follows:
+
+ function: the eax value used to obtain the entry
+ index: the ecx value used to obtain the entry (for entries that are
+ affected by ecx)
+ flags: an OR of zero or more of the following:
+ KVM_CPUID_FLAG_SIGNIFCANT_INDEX:
+ if the index field is valid
+ KVM_CPUID_FLAG_STATEFUL_FUNC:
+ if cpuid for this function returns different values for successive
+ invocations; there will be several entries with the same function,
+ all with this flag set
+ KVM_CPUID_FLAG_STATE_READ_NEXT:
+ for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is
+ the first entry to be read by a cpu
+ eax, ebx, ecx, edx: the values returned by the cpuid instruction for
+ this function/index combination
+
+The TSC deadline timer feature (CPUID leaf 1, ecx[24]) is always returned
+as false, since the feature depends on KVM_CREATE_IRQCHIP for local APIC
+support. Instead it is reported via
+
+ ioctl(KVM_CHECK_EXTENSION, KVM_CAP_TSC_DEADLINE_TIMER)
+
+if that returns true and you use KVM_CREATE_IRQCHIP, or if you emulate the
+feature in userspace, then you can enable the feature for KVM_SET_CPUID2.
+
+4.47 KVM_PPC_GET_PVINFO
+
+Capability: KVM_CAP_PPC_GET_PVINFO
+Architectures: ppc
+Type: vm ioctl
+Parameters: struct kvm_ppc_pvinfo (out)
+Returns: 0 on success, !0 on error
+
+struct kvm_ppc_pvinfo {
+ __u32 flags;
+ __u32 hcall[4];
+ __u8 pad[108];
+};
+
+This ioctl fetches PV specific information that need to be passed to the guest
+using the device tree or other means from vm context.
+
+For now the only implemented piece of information distributed here is an array
+of 4 instructions that make up a hypercall.
+
+If any additional field gets added to this structure later on, a bit for that
+additional piece of information will be set in the flags bitmap.
+
+4.48 KVM_ASSIGN_PCI_DEVICE
+
+Capability: KVM_CAP_DEVICE_ASSIGNMENT
+Architectures: x86 ia64
+Type: vm ioctl
+Parameters: struct kvm_assigned_pci_dev (in)
+Returns: 0 on success, -1 on error
+
+Assigns a host PCI device to the VM.
+
+struct kvm_assigned_pci_dev {
+ __u32 assigned_dev_id;
+ __u32 busnr;
+ __u32 devfn;
+ __u32 flags;
+ __u32 segnr;
+ union {
+ __u32 reserved[11];
+ };
+};
+
+The PCI device is specified by the triple segnr, busnr, and devfn.
+Identification in succeeding service requests is done via assigned_dev_id. The
+following flags are specified:
+
+/* Depends on KVM_CAP_IOMMU */
+#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
+/* The following two depend on KVM_CAP_PCI_2_3 */
+#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
+#define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
+
+If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts
+via the PCI-2.3-compliant device-level mask, thus enable IRQ sharing with other
+assigned devices or host devices. KVM_DEV_ASSIGN_MASK_INTX specifies the
+guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
+
+The KVM_DEV_ASSIGN_ENABLE_IOMMU flag is a mandatory option to ensure
+isolation of the device. Usages not specifying this flag are deprecated.
+
+Only PCI header type 0 devices with PCI BAR resources are supported by
+device assignment. The user requesting this ioctl must have read/write
+access to the PCI sysfs resource files associated with the device.
+
+4.49 KVM_DEASSIGN_PCI_DEVICE
+
+Capability: KVM_CAP_DEVICE_DEASSIGNMENT
+Architectures: x86 ia64
+Type: vm ioctl
+Parameters: struct kvm_assigned_pci_dev (in)
+Returns: 0 on success, -1 on error
+
+Ends PCI device assignment, releasing all associated resources.
+
+See KVM_CAP_DEVICE_ASSIGNMENT for the data structure. Only assigned_dev_id is
+used in kvm_assigned_pci_dev to identify the device.
+
+4.50 KVM_ASSIGN_DEV_IRQ
+
+Capability: KVM_CAP_ASSIGN_DEV_IRQ
+Architectures: x86 ia64
+Type: vm ioctl
+Parameters: struct kvm_assigned_irq (in)
+Returns: 0 on success, -1 on error
+
+Assigns an IRQ to a passed-through device.
+
+struct kvm_assigned_irq {
+ __u32 assigned_dev_id;
+ __u32 host_irq; /* ignored (legacy field) */
+ __u32 guest_irq;
+ __u32 flags;
+ union {
+ __u32 reserved[12];
+ };
+};
+
+The following flags are defined:
+
+#define KVM_DEV_IRQ_HOST_INTX (1 << 0)
+#define KVM_DEV_IRQ_HOST_MSI (1 << 1)
+#define KVM_DEV_IRQ_HOST_MSIX (1 << 2)
+
+#define KVM_DEV_IRQ_GUEST_INTX (1 << 8)
+#define KVM_DEV_IRQ_GUEST_MSI (1 << 9)
+#define KVM_DEV_IRQ_GUEST_MSIX (1 << 10)
+
+It is not valid to specify multiple types per host or guest IRQ. However, the
+IRQ type of host and guest can differ or can even be null.
+
+4.51 KVM_DEASSIGN_DEV_IRQ
+
+Capability: KVM_CAP_ASSIGN_DEV_IRQ
+Architectures: x86 ia64
+Type: vm ioctl
+Parameters: struct kvm_assigned_irq (in)
+Returns: 0 on success, -1 on error
+
+Ends an IRQ assignment to a passed-through device.
+
+See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified
+by assigned_dev_id, flags must correspond to the IRQ type specified on
+KVM_ASSIGN_DEV_IRQ. Partial deassignment of host or guest IRQ is allowed.
+
+4.52 KVM_SET_GSI_ROUTING
+
+Capability: KVM_CAP_IRQ_ROUTING
+Architectures: x86 ia64
+Type: vm ioctl
+Parameters: struct kvm_irq_routing (in)
+Returns: 0 on success, -1 on error
+
+Sets the GSI routing table entries, overwriting any previously set entries.
+
+struct kvm_irq_routing {
+ __u32 nr;
+ __u32 flags;
+ struct kvm_irq_routing_entry entries[0];
+};
+
+No flags are specified so far, the corresponding field must be set to zero.
+
+struct kvm_irq_routing_entry {
+ __u32 gsi;
+ __u32 type;
+ __u32 flags;
+ __u32 pad;
+ union {
+ struct kvm_irq_routing_irqchip irqchip;
+ struct kvm_irq_routing_msi msi;
+ __u32 pad[8];
+ } u;
+};
+
+/* gsi routing entry types */
+#define KVM_IRQ_ROUTING_IRQCHIP 1
+#define KVM_IRQ_ROUTING_MSI 2
+
+No flags are specified so far, the corresponding field must be set to zero.
+
+struct kvm_irq_routing_irqchip {
+ __u32 irqchip;
+ __u32 pin;
+};
+
+struct kvm_irq_routing_msi {
+ __u32 address_lo;
+ __u32 address_hi;
+ __u32 data;
+ __u32 pad;
+};
+
+4.53 KVM_ASSIGN_SET_MSIX_NR
+
+Capability: KVM_CAP_DEVICE_MSIX
+Architectures: x86 ia64
+Type: vm ioctl
+Parameters: struct kvm_assigned_msix_nr (in)
+Returns: 0 on success, -1 on error
+
+Set the number of MSI-X interrupts for an assigned device. The number is
+reset again by terminating the MSI-X assignment of the device via
+KVM_DEASSIGN_DEV_IRQ. Calling this service more than once at any earlier
+point will fail.
+
+struct kvm_assigned_msix_nr {
+ __u32 assigned_dev_id;
+ __u16 entry_nr;
+ __u16 padding;
+};
+
+#define KVM_MAX_MSIX_PER_DEV 256
+
+4.54 KVM_ASSIGN_SET_MSIX_ENTRY
+
+Capability: KVM_CAP_DEVICE_MSIX
+Architectures: x86 ia64
+Type: vm ioctl
+Parameters: struct kvm_assigned_msix_entry (in)
+Returns: 0 on success, -1 on error
+
+Specifies the routing of an MSI-X assigned device interrupt to a GSI. Setting
+the GSI vector to zero means disabling the interrupt.
+
+struct kvm_assigned_msix_entry {
+ __u32 assigned_dev_id;
+ __u32 gsi;
+ __u16 entry; /* The index of entry in the MSI-X table */
+ __u16 padding[3];
+};
+
+4.54 KVM_SET_TSC_KHZ
+
+Capability: KVM_CAP_TSC_CONTROL
+Architectures: x86
+Type: vcpu ioctl
+Parameters: virtual tsc_khz
+Returns: 0 on success, -1 on error
+
+Specifies the tsc frequency for the virtual machine. The unit of the
+frequency is KHz.
+
+4.55 KVM_GET_TSC_KHZ
+
+Capability: KVM_CAP_GET_TSC_KHZ
+Architectures: x86
+Type: vcpu ioctl
+Parameters: none
+Returns: virtual tsc-khz on success, negative value on error
+
+Returns the tsc frequency of the guest. The unit of the return value is
+KHz. If the host has unstable tsc this ioctl returns -EIO instead as an
+error.
+
+4.56 KVM_GET_LAPIC
+
+Capability: KVM_CAP_IRQCHIP
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_lapic_state (out)
+Returns: 0 on success, -1 on error
+
+#define KVM_APIC_REG_SIZE 0x400
+struct kvm_lapic_state {
+ char regs[KVM_APIC_REG_SIZE];
+};
+
+Reads the Local APIC registers and copies them into the input argument. The
+data format and layout are the same as documented in the architecture manual.
+
+4.57 KVM_SET_LAPIC
+
+Capability: KVM_CAP_IRQCHIP
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_lapic_state (in)
+Returns: 0 on success, -1 on error
+
+#define KVM_APIC_REG_SIZE 0x400
+struct kvm_lapic_state {
+ char regs[KVM_APIC_REG_SIZE];
+};
+
+Copies the input argument into the the Local APIC registers. The data format
+and layout are the same as documented in the architecture manual.
+
+4.58 KVM_IOEVENTFD
+
+Capability: KVM_CAP_IOEVENTFD
+Architectures: all
+Type: vm ioctl
+Parameters: struct kvm_ioeventfd (in)
+Returns: 0 on success, !0 on error
+
+This ioctl attaches or detaches an ioeventfd to a legal pio/mmio address
+within the guest. A guest write in the registered address will signal the
+provided event instead of triggering an exit.
+
+struct kvm_ioeventfd {
+ __u64 datamatch;
+ __u64 addr; /* legal pio/mmio address */
+ __u32 len; /* 1, 2, 4, or 8 bytes */
+ __s32 fd;
+ __u32 flags;
+ __u8 pad[36];
+};
+
+The following flags are defined:
+
+#define KVM_IOEVENTFD_FLAG_DATAMATCH (1 << kvm_ioeventfd_flag_nr_datamatch)
+#define KVM_IOEVENTFD_FLAG_PIO (1 << kvm_ioeventfd_flag_nr_pio)
+#define KVM_IOEVENTFD_FLAG_DEASSIGN (1 << kvm_ioeventfd_flag_nr_deassign)
+
+If datamatch flag is set, the event will be signaled only if the written value
+to the registered address is equal to datamatch in struct kvm_ioeventfd.
+
+4.59 KVM_DIRTY_TLB
+
+Capability: KVM_CAP_SW_TLB
+Architectures: ppc
+Type: vcpu ioctl
+Parameters: struct kvm_dirty_tlb (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_dirty_tlb {
+ __u64 bitmap;
+ __u32 num_dirty;
+};
+
+This must be called whenever userspace has changed an entry in the shared
+TLB, prior to calling KVM_RUN on the associated vcpu.
+
+The "bitmap" field is the userspace address of an array. This array
+consists of a number of bits, equal to the total number of TLB entries as
+determined by the last successful call to KVM_CONFIG_TLB, rounded up to the
+nearest multiple of 64.
+
+Each bit corresponds to one TLB entry, ordered the same as in the shared TLB
+array.
+
+The array is little-endian: the bit 0 is the least significant bit of the
+first byte, bit 8 is the least significant bit of the second byte, etc.
+This avoids any complications with differing word sizes.
+
+The "num_dirty" field is a performance hint for KVM to determine whether it
+should skip processing the bitmap and just invalidate everything. It must
+be set to the number of set bits in the bitmap.
+
+4.60 KVM_ASSIGN_SET_INTX_MASK
+
+Capability: KVM_CAP_PCI_2_3
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_assigned_pci_dev (in)
+Returns: 0 on success, -1 on error
+
+Allows userspace to mask PCI INTx interrupts from the assigned device. The
+kernel will not deliver INTx interrupts to the guest between setting and
+clearing of KVM_ASSIGN_SET_INTX_MASK via this interface. This enables use of
+and emulation of PCI 2.3 INTx disable command register behavior.
+
+This may be used for both PCI 2.3 devices supporting INTx disable natively and
+older devices lacking this support. Userspace is responsible for emulating the
+read value of the INTx disable bit in the guest visible PCI command register.
+When modifying the INTx disable state, userspace should precede updating the
+physical device command register by calling this ioctl to inform the kernel of
+the new intended INTx mask state.
+
+Note that the kernel uses the device INTx disable bit to internally manage the
+device interrupt state for PCI 2.3 devices. Reads of this register may
+therefore not match the expected value. Writes should always use the guest
+intended INTx disable value rather than attempting to read-copy-update the
+current physical device state. Races between user and kernel updates to the
+INTx disable bit are handled lazily in the kernel. It's possible the device
+may generate unintended interrupts, but they will not be injected into the
+guest.
+
+See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified
+by assigned_dev_id. In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is
+evaluated.
+
+4.62 KVM_CREATE_SPAPR_TCE
+
+Capability: KVM_CAP_SPAPR_TCE
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce (in)
+Returns: file descriptor for manipulating the created TCE table
+
+This creates a virtual TCE (translation control entry) table, which
+is an IOMMU for PAPR-style virtual I/O. It is used to translate
+logical addresses used in virtual I/O into guest physical addresses,
+and provides a scatter/gather capability for PAPR virtual I/O.
+
+/* for KVM_CAP_SPAPR_TCE */
+struct kvm_create_spapr_tce {
+ __u64 liobn;
+ __u32 window_size;
+};
+
+The liobn field gives the logical IO bus number for which to create a
+TCE table. The window_size field specifies the size of the DMA window
+which this TCE table will translate - the table will contain one 64
+bit TCE entry for every 4kiB of the DMA window.
+
+When the guest issues an H_PUT_TCE hcall on a liobn for which a TCE
+table has been created using this ioctl(), the kernel will handle it
+in real mode, updating the TCE table. H_PUT_TCE calls for other
+liobns will cause a vm exit and must be handled by userspace.
+
+The return value is a file descriptor which can be passed to mmap(2)
+to map the created TCE table into userspace. This lets userspace read
+the entries written by kernel-handled H_PUT_TCE calls, and also lets
+userspace update the TCE table directly which is useful in some
+circumstances.
+
+4.63 KVM_ALLOCATE_RMA
+
+Capability: KVM_CAP_PPC_RMA
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_allocate_rma (out)
+Returns: file descriptor for mapping the allocated RMA
+
+This allocates a Real Mode Area (RMA) from the pool allocated at boot
+time by the kernel. An RMA is a physically-contiguous, aligned region
+of memory used on older POWER processors to provide the memory which
+will be accessed by real-mode (MMU off) accesses in a KVM guest.
+POWER processors support a set of sizes for the RMA that usually
+includes 64MB, 128MB, 256MB and some larger powers of two.
+
+/* for KVM_ALLOCATE_RMA */
+struct kvm_allocate_rma {
+ __u64 rma_size;
+};
+
+The return value is a file descriptor which can be passed to mmap(2)
+to map the allocated RMA into userspace. The mapped area can then be
+passed to the KVM_SET_USER_MEMORY_REGION ioctl to establish it as the
+RMA for a virtual machine. The size of the RMA in bytes (which is
+fixed at host kernel boot time) is returned in the rma_size field of
+the argument structure.
+
+The KVM_CAP_PPC_RMA capability is 1 or 2 if the KVM_ALLOCATE_RMA ioctl
+is supported; 2 if the processor requires all virtual machines to have
+an RMA, or 1 if the processor can use an RMA but doesn't require it,
+because it supports the Virtual RMA (VRMA) facility.
+
+4.64 KVM_NMI
+
+Capability: KVM_CAP_USER_NMI
+Architectures: x86
+Type: vcpu ioctl
+Parameters: none
+Returns: 0 on success, -1 on error
+
+Queues an NMI on the thread's vcpu. Note this is well defined only
+when KVM_CREATE_IRQCHIP has not been called, since this is an interface
+between the virtual cpu core and virtual local APIC. After KVM_CREATE_IRQCHIP
+has been called, this interface is completely emulated within the kernel.
+
+To use this to emulate the LINT1 input with KVM_CREATE_IRQCHIP, use the
+following algorithm:
+
+ - pause the vpcu
+ - read the local APIC's state (KVM_GET_LAPIC)
+ - check whether changing LINT1 will queue an NMI (see the LVT entry for LINT1)
+ - if so, issue KVM_NMI
+ - resume the vcpu
+
+Some guests configure the LINT1 NMI input to cause a panic, aiding in
+debugging.
+
+4.65 KVM_S390_UCAS_MAP
+
+Capability: KVM_CAP_S390_UCONTROL
+Architectures: s390
+Type: vcpu ioctl
+Parameters: struct kvm_s390_ucas_mapping (in)
+Returns: 0 in case of success
+
+The parameter is defined like this:
+ struct kvm_s390_ucas_mapping {
+ __u64 user_addr;
+ __u64 vcpu_addr;
+ __u64 length;
+ };
+
+This ioctl maps the memory at "user_addr" with the length "length" to
+the vcpu's address space starting at "vcpu_addr". All parameters need to
+be alligned by 1 megabyte.
+
+4.66 KVM_S390_UCAS_UNMAP
+
+Capability: KVM_CAP_S390_UCONTROL
+Architectures: s390
+Type: vcpu ioctl
+Parameters: struct kvm_s390_ucas_mapping (in)
+Returns: 0 in case of success
+
+The parameter is defined like this:
+ struct kvm_s390_ucas_mapping {
+ __u64 user_addr;
+ __u64 vcpu_addr;
+ __u64 length;
+ };
+
+This ioctl unmaps the memory in the vcpu's address space starting at
+"vcpu_addr" with the length "length". The field "user_addr" is ignored.
+All parameters need to be alligned by 1 megabyte.
+
+4.67 KVM_S390_VCPU_FAULT
+
+Capability: KVM_CAP_S390_UCONTROL
+Architectures: s390
+Type: vcpu ioctl
+Parameters: vcpu absolute address (in)
+Returns: 0 in case of success
+
+This call creates a page table entry on the virtual cpu's address space
+(for user controlled virtual machines) or the virtual machine's address
+space (for regular virtual machines). This only works for minor faults,
+thus it's recommended to access subject memory page via the user page
+table upfront. This is useful to handle validity intercepts for user
+controlled virtual machines to fault in the virtual cpu's lowcore pages
+prior to calling the KVM_RUN ioctl.
+
+4.68 KVM_SET_ONE_REG
+
+Capability: KVM_CAP_ONE_REG
+Architectures: all
+Type: vcpu ioctl
+Parameters: struct kvm_one_reg (in)
+Returns: 0 on success, negative value on failure
+
+struct kvm_one_reg {
+ __u64 id;
+ __u64 addr;
+};
+
+Using this ioctl, a single vcpu register can be set to a specific value
+defined by user space with the passed in struct kvm_one_reg, where id
+refers to the register identifier as described below and addr is a pointer
+to a variable with the respective size. There can be architecture agnostic
+and architecture specific registers. Each have their own range of operation
+and their own constants and width. To keep track of the implemented
+registers, find a list below:
+
+ Arch | Register | Width (bits)
+ | |
+ PPC | KVM_REG_PPC_HIOR | 64
+
+4.69 KVM_GET_ONE_REG
+
+Capability: KVM_CAP_ONE_REG
+Architectures: all
+Type: vcpu ioctl
+Parameters: struct kvm_one_reg (in and out)
+Returns: 0 on success, negative value on failure
+
+This ioctl allows to receive the value of a single register implemented
+in a vcpu. The register to read is indicated by the "id" field of the
+kvm_one_reg struct passed in. On success, the register value can be found
+at the memory location pointed to by "addr".
+
+The list of registers accessible using this interface is identical to the
+list in 4.64.
+
+5. The kvm_run structure
+
+Application code obtains a pointer to the kvm_run structure by
+mmap()ing a vcpu fd. From that point, application code can control
+execution by changing fields in kvm_run prior to calling the KVM_RUN
+ioctl, and obtain information about the reason KVM_RUN returned by
+looking up structure members.
+
+struct kvm_run {
+ /* in */
+ __u8 request_interrupt_window;
+
+Request that KVM_RUN return when it becomes possible to inject external
+interrupts into the guest. Useful in conjunction with KVM_INTERRUPT.
+
+ __u8 padding1[7];
+
+ /* out */
+ __u32 exit_reason;
+
+When KVM_RUN has returned successfully (return value 0), this informs
+application code why KVM_RUN has returned. Allowable values for this
+field are detailed below.
+
+ __u8 ready_for_interrupt_injection;
+
+If request_interrupt_window has been specified, this field indicates
+an interrupt can be injected now with KVM_INTERRUPT.
+
+ __u8 if_flag;
+
+The value of the current interrupt flag. Only valid if in-kernel
+local APIC is not used.
+
+ __u8 padding2[2];
+
+ /* in (pre_kvm_run), out (post_kvm_run) */
+ __u64 cr8;
+
+The value of the cr8 register. Only valid if in-kernel local APIC is
+not used. Both input and output.
+
+ __u64 apic_base;
+
+The value of the APIC BASE msr. Only valid if in-kernel local
+APIC is not used. Both input and output.
+
+ union {
+ /* KVM_EXIT_UNKNOWN */
+ struct {
+ __u64 hardware_exit_reason;
+ } hw;
+
+If exit_reason is KVM_EXIT_UNKNOWN, the vcpu has exited due to unknown
+reasons. Further architecture-specific information is available in
+hardware_exit_reason.
+
+ /* KVM_EXIT_FAIL_ENTRY */
+ struct {
+ __u64 hardware_entry_failure_reason;
+ } fail_entry;
+
+If exit_reason is KVM_EXIT_FAIL_ENTRY, the vcpu could not be run due
+to unknown reasons. Further architecture-specific information is
+available in hardware_entry_failure_reason.
+
+ /* KVM_EXIT_EXCEPTION */
+ struct {
+ __u32 exception;
+ __u32 error_code;
+ } ex;
+
+Unused.
+
+ /* KVM_EXIT_IO */
+ struct {
+#define KVM_EXIT_IO_IN 0
+#define KVM_EXIT_IO_OUT 1
+ __u8 direction;
+ __u8 size; /* bytes */
+ __u16 port;
+ __u32 count;
+ __u64 data_offset; /* relative to kvm_run start */
+ } io;
+
+If exit_reason is KVM_EXIT_IO, then the vcpu has
+executed a port I/O instruction which could not be satisfied by kvm.
+data_offset describes where the data is located (KVM_EXIT_IO_OUT) or
+where kvm expects application code to place the data for the next
+KVM_RUN invocation (KVM_EXIT_IO_IN). Data format is a packed array.
+
+ struct {
+ struct kvm_debug_exit_arch arch;
+ } debug;
+
+Unused.
+
+ /* KVM_EXIT_MMIO */
+ struct {
+ __u64 phys_addr;
+ __u8 data[8];
+ __u32 len;
+ __u8 is_write;
+ } mmio;
+
+If exit_reason is KVM_EXIT_MMIO, then the vcpu has
+executed a memory-mapped I/O instruction which could not be satisfied
+by kvm. The 'data' member contains the written data if 'is_write' is
+true, and should be filled by application code otherwise.
+
+NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO and KVM_EXIT_OSI, the corresponding
+operations are complete (and guest state is consistent) only after userspace
+has re-entered the kernel with KVM_RUN. The kernel side will first finish
+incomplete operations and then check for pending signals. Userspace
+can re-enter the guest with an unmasked signal pending to complete
+pending operations.
+
+ /* KVM_EXIT_HYPERCALL */
+ struct {
+ __u64 nr;
+ __u64 args[6];
+ __u64 ret;
+ __u32 longmode;
+ __u32 pad;
+ } hypercall;
+
+Unused. This was once used for 'hypercall to userspace'. To implement
+such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390).
+Note KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO.
+
+ /* KVM_EXIT_TPR_ACCESS */
+ struct {
+ __u64 rip;
+ __u32 is_write;
+ __u32 pad;
+ } tpr_access;
+
+To be documented (KVM_TPR_ACCESS_REPORTING).
+
+ /* KVM_EXIT_S390_SIEIC */
+ struct {
+ __u8 icptcode;
+ __u64 mask; /* psw upper half */
+ __u64 addr; /* psw lower half */
+ __u16 ipa;
+ __u32 ipb;
+ } s390_sieic;
+
+s390 specific.
+
+ /* KVM_EXIT_S390_RESET */
+#define KVM_S390_RESET_POR 1
+#define KVM_S390_RESET_CLEAR 2
+#define KVM_S390_RESET_SUBSYSTEM 4
+#define KVM_S390_RESET_CPU_INIT 8
+#define KVM_S390_RESET_IPL 16
+ __u64 s390_reset_flags;
+
+s390 specific.
+
+ /* KVM_EXIT_S390_UCONTROL */
+ struct {
+ __u64 trans_exc_code;
+ __u32 pgm_code;
+ } s390_ucontrol;
+
+s390 specific. A page fault has occurred for a user controlled virtual
+machine (KVM_VM_S390_UNCONTROL) on it's host page table that cannot be
+resolved by the kernel.
+The program code and the translation exception code that were placed
+in the cpu's lowcore are presented here as defined by the z Architecture
+Principles of Operation Book in the Chapter for Dynamic Address Translation
+(DAT)
+
+ /* KVM_EXIT_DCR */
+ struct {
+ __u32 dcrn;
+ __u32 data;
+ __u8 is_write;
+ } dcr;
+
+powerpc specific.
+
+ /* KVM_EXIT_OSI */
+ struct {
+ __u64 gprs[32];
+ } osi;
+
+MOL uses a special hypercall interface it calls 'OSI'. To enable it, we catch
+hypercalls and exit with this exit struct that contains all the guest gprs.
+
+If exit_reason is KVM_EXIT_OSI, then the vcpu has triggered such a hypercall.
+Userspace can now handle the hypercall and when it's done modify the gprs as
+necessary. Upon guest entry all guest GPRs will then be replaced by the values
+in this struct.
+
+ /* KVM_EXIT_PAPR_HCALL */
+ struct {
+ __u64 nr;
+ __u64 ret;
+ __u64 args[9];
+ } papr_hcall;
+
+This is used on 64-bit PowerPC when emulating a pSeries partition,
+e.g. with the 'pseries' machine type in qemu. It occurs when the
+guest does a hypercall using the 'sc 1' instruction. The 'nr' field
+contains the hypercall number (from the guest R3), and 'args' contains
+the arguments (from the guest R4 - R12). Userspace should put the
+return code in 'ret' and any extra returned values in args[].
+The possible hypercalls are defined in the Power Architecture Platform
+Requirements (PAPR) document available from www.power.org (free
+developer registration required to access it).
+
+ /* Fix the size of the union. */
+ char padding[256];
+ };
+
+ /*
+ * shared registers between kvm and userspace.
+ * kvm_valid_regs specifies the register classes set by the host
+ * kvm_dirty_regs specified the register classes dirtied by userspace
+ * struct kvm_sync_regs is architecture specific, as well as the
+ * bits for kvm_valid_regs and kvm_dirty_regs
+ */
+ __u64 kvm_valid_regs;
+ __u64 kvm_dirty_regs;
+ union {
+ struct kvm_sync_regs regs;
+ char padding[1024];
+ } s;
+
+If KVM_CAP_SYNC_REGS is defined, these fields allow userspace to access
+certain guest registers without having to call SET/GET_*REGS. Thus we can
+avoid some system call overhead if userspace has to handle the exit.
+Userspace can query the validity of the structure by checking
+kvm_valid_regs for specific bits. These bits are architecture specific
+and usually define the validity of a groups of registers. (e.g. one bit
+ for general purpose registers)
+
+};
+
+6. Capabilities that can be enabled
+
+There are certain capabilities that change the behavior of the virtual CPU when
+enabled. To enable them, please see section 4.37. Below you can find a list of
+capabilities and what their effect on the vCPU is when enabling them.
+
+The following information is provided along with the description:
+
+ Architectures: which instruction set architectures provide this ioctl.
+ x86 includes both i386 and x86_64.
+
+ Parameters: what parameters are accepted by the capability.
+
+ Returns: the return value. General error numbers (EBADF, ENOMEM, EINVAL)
+ are not detailed, but errors with specific meanings are.
+
+6.1 KVM_CAP_PPC_OSI
+
+Architectures: ppc
+Parameters: none
+Returns: 0 on success; -1 on error
+
+This capability enables interception of OSI hypercalls that otherwise would
+be treated as normal system calls to be injected into the guest. OSI hypercalls
+were invented by Mac-on-Linux to have a standardized communication mechanism
+between the guest and the host.
+
+When this capability is enabled, KVM_EXIT_OSI can occur.
+
+6.2 KVM_CAP_PPC_PAPR
+
+Architectures: ppc
+Parameters: none
+Returns: 0 on success; -1 on error
+
+This capability enables interception of PAPR hypercalls. PAPR hypercalls are
+done using the hypercall instruction "sc 1".
+
+It also sets the guest privilege level to "supervisor" mode. Usually the guest
+runs in "hypervisor" privilege mode with a few missing features.
+
+In addition to the above, it changes the semantics of SDR1. In this mode, the
+HTAB address part of SDR1 contains an HVA instead of a GPA, as PAPR keeps the
+HTAB invisible to the guest.
+
+When this capability is enabled, KVM_EXIT_PAPR_HCALL can occur.
+
+6.3 KVM_CAP_SW_TLB
+
+Architectures: ppc
+Parameters: args[0] is the address of a struct kvm_config_tlb
+Returns: 0 on success; -1 on error
+
+struct kvm_config_tlb {
+ __u64 params;
+ __u64 array;
+ __u32 mmu_type;
+ __u32 array_len;
+};
+
+Configures the virtual CPU's TLB array, establishing a shared memory area
+between userspace and KVM. The "params" and "array" fields are userspace
+addresses of mmu-type-specific data structures. The "array_len" field is an
+safety mechanism, and should be set to the size in bytes of the memory that
+userspace has reserved for the array. It must be at least the size dictated
+by "mmu_type" and "params".
+
+While KVM_RUN is active, the shared region is under control of KVM. Its
+contents are undefined, and any modification by userspace results in
+boundedly undefined behavior.
+
+On return from KVM_RUN, the shared region will reflect the current state of
+the guest's TLB. If userspace makes any changes, it must call KVM_DIRTY_TLB
+to tell KVM which entries have been changed, prior to calling KVM_RUN again
+on this vcpu.
+
+For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
+ - The "params" field is of type "struct kvm_book3e_206_tlb_params".
+ - The "array" field points to an array of type "struct
+ kvm_book3e_206_tlb_entry".
+ - The array consists of all entries in the first TLB, followed by all
+ entries in the second TLB.
+ - Within a TLB, entries are ordered first by increasing set number. Within a
+ set, entries are ordered by way (increasing ESEL).
+ - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
+ where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
+ - The tsize field of mas1 shall be set to 4K on TLB0, even though the
+ hardware ignores this value for TLB0.
diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
new file mode 100644
index 00000000..88206853
--- /dev/null
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -0,0 +1,45 @@
+KVM CPUID bits
+Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
+=====================================================
+
+A guest running on a kvm host, can check some of its features using
+cpuid. This is not always guaranteed to work, since userspace can
+mask-out some, or even all KVM-related cpuid features before launching
+a guest.
+
+KVM cpuid functions are:
+
+function: KVM_CPUID_SIGNATURE (0x40000000)
+returns : eax = 0,
+ ebx = 0x4b4d564b,
+ ecx = 0x564b4d56,
+ edx = 0x4d.
+Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM".
+This function queries the presence of KVM cpuid leafs.
+
+
+function: define KVM_CPUID_FEATURES (0x40000001)
+returns : ebx, ecx, edx = 0
+ eax = and OR'ed group of (1 << flag), where each flags is:
+
+
+flag || value || meaning
+=============================================================================
+KVM_FEATURE_CLOCKSOURCE || 0 || kvmclock available at msrs
+ || || 0x11 and 0x12.
+------------------------------------------------------------------------------
+KVM_FEATURE_NOP_IO_DELAY || 1 || not necessary to perform delays
+ || || on PIO operations.
+------------------------------------------------------------------------------
+KVM_FEATURE_MMU_OP || 2 || deprecated.
+------------------------------------------------------------------------------
+KVM_FEATURE_CLOCKSOURCE2 || 3 || kvmclock available at msrs
+ || || 0x4b564d00 and 0x4b564d01
+------------------------------------------------------------------------------
+KVM_FEATURE_ASYNC_PF || 4 || async pf can be enabled by
+ || || writing to msr 0x4b564d02
+------------------------------------------------------------------------------
+KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side
+ || || per-cpu warps are expected in
+ || || kvmclock.
+------------------------------------------------------------------------------
diff --git a/Documentation/virtual/kvm/locking.txt b/Documentation/virtual/kvm/locking.txt
new file mode 100644
index 00000000..3b4cd3bf
--- /dev/null
+++ b/Documentation/virtual/kvm/locking.txt
@@ -0,0 +1,25 @@
+KVM Lock Overview
+=================
+
+1. Acquisition Orders
+---------------------
+
+(to be written)
+
+2. Reference
+------------
+
+Name: kvm_lock
+Type: raw_spinlock
+Arch: any
+Protects: - vm_list
+ - hardware virtualization enable/disable
+Comment: 'raw' because hardware enabling/disabling must be atomic /wrt
+ migration.
+
+Name: kvm_arch::tsc_write_lock
+Type: raw_spinlock
+Arch: x86
+Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
+ - tsc offset in vmcb
+Comment: 'raw' because updating the tsc offsets must not be preempted.
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
new file mode 100644
index 00000000..fa5f1dbc
--- /dev/null
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -0,0 +1,366 @@
+The x86 kvm shadow mmu
+======================
+
+The mmu (in arch/x86/kvm, files mmu.[ch] and paging_tmpl.h) is responsible
+for presenting a standard x86 mmu to the guest, while translating guest
+physical addresses to host physical addresses.
+
+The mmu code attempts to satisfy the following requirements:
+
+- correctness: the guest should not be able to determine that it is running
+ on an emulated mmu except for timing (we attempt to comply
+ with the specification, not emulate the characteristics of
+ a particular implementation such as tlb size)
+- security: the guest must not be able to touch host memory not assigned
+ to it
+- performance: minimize the performance penalty imposed by the mmu
+- scaling: need to scale to large memory and large vcpu guests
+- hardware: support the full range of x86 virtualization hardware
+- integration: Linux memory management code must be in control of guest memory
+ so that swapping, page migration, page merging, transparent
+ hugepages, and similar features work without change
+- dirty tracking: report writes to guest memory to enable live migration
+ and framebuffer-based displays
+- footprint: keep the amount of pinned kernel memory low (most memory
+ should be shrinkable)
+- reliability: avoid multipage or GFP_ATOMIC allocations
+
+Acronyms
+========
+
+pfn host page frame number
+hpa host physical address
+hva host virtual address
+gfn guest frame number
+gpa guest physical address
+gva guest virtual address
+ngpa nested guest physical address
+ngva nested guest virtual address
+pte page table entry (used also to refer generically to paging structure
+ entries)
+gpte guest pte (referring to gfns)
+spte shadow pte (referring to pfns)
+tdp two dimensional paging (vendor neutral term for NPT and EPT)
+
+Virtual and real hardware supported
+===================================
+
+The mmu supports first-generation mmu hardware, which allows an atomic switch
+of the current paging mode and cr3 during guest entry, as well as
+two-dimensional paging (AMD's NPT and Intel's EPT). The emulated hardware
+it exposes is the traditional 2/3/4 level x86 mmu, with support for global
+pages, pae, pse, pse36, cr0.wp, and 1GB pages. Work is in progress to support
+exposing NPT capable hardware on NPT capable hosts.
+
+Translation
+===========
+
+The primary job of the mmu is to program the processor's mmu to translate
+addresses for the guest. Different translations are required at different
+times:
+
+- when guest paging is disabled, we translate guest physical addresses to
+ host physical addresses (gpa->hpa)
+- when guest paging is enabled, we translate guest virtual addresses, to
+ guest physical addresses, to host physical addresses (gva->gpa->hpa)
+- when the guest launches a guest of its own, we translate nested guest
+ virtual addresses, to nested guest physical addresses, to guest physical
+ addresses, to host physical addresses (ngva->ngpa->gpa->hpa)
+
+The primary challenge is to encode between 1 and 3 translations into hardware
+that support only 1 (traditional) and 2 (tdp) translations. When the
+number of required translations matches the hardware, the mmu operates in
+direct mode; otherwise it operates in shadow mode (see below).
+
+Memory
+======
+
+Guest memory (gpa) is part of the user address space of the process that is
+using kvm. Userspace defines the translation between guest addresses and user
+addresses (gpa->hva); note that two gpas may alias to the same hva, but not
+vice versa.
+
+These hvas may be backed using any method available to the host: anonymous
+memory, file backed memory, and device memory. Memory might be paged by the
+host at any time.
+
+Events
+======
+
+The mmu is driven by events, some from the guest, some from the host.
+
+Guest generated events:
+- writes to control registers (especially cr3)
+- invlpg/invlpga instruction execution
+- access to missing or protected translations
+
+Host generated events:
+- changes in the gpa->hpa translation (either through gpa->hva changes or
+ through hva->hpa changes)
+- memory pressure (the shrinker)
+
+Shadow pages
+============
+
+The principal data structure is the shadow page, 'struct kvm_mmu_page'. A
+shadow page contains 512 sptes, which can be either leaf or nonleaf sptes. A
+shadow page may contain a mix of leaf and nonleaf sptes.
+
+A nonleaf spte allows the hardware mmu to reach the leaf pages and
+is not related to a translation directly. It points to other shadow pages.
+
+A leaf spte corresponds to either one or two translations encoded into
+one paging structure entry. These are always the lowest level of the
+translation stack, with optional higher level translations left to NPT/EPT.
+Leaf ptes point at guest pages.
+
+The following table shows translations encoded by leaf ptes, with higher-level
+translations in parentheses:
+
+ Non-nested guests:
+ nonpaging: gpa->hpa
+ paging: gva->gpa->hpa
+ paging, tdp: (gva->)gpa->hpa
+ Nested guests:
+ non-tdp: ngva->gpa->hpa (*)
+ tdp: (ngva->)ngpa->gpa->hpa
+
+(*) the guest hypervisor will encode the ngva->gpa translation into its page
+ tables if npt is not present
+
+Shadow pages contain the following information:
+ role.level:
+ The level in the shadow paging hierarchy that this shadow page belongs to.
+ 1=4k sptes, 2=2M sptes, 3=1G sptes, etc.
+ role.direct:
+ If set, leaf sptes reachable from this page are for a linear range.
+ Examples include real mode translation, large guest pages backed by small
+ host pages, and gpa->hpa translations when NPT or EPT is active.
+ The linear range starts at (gfn << PAGE_SHIFT) and its size is determined
+ by role.level (2MB for first level, 1GB for second level, 0.5TB for third
+ level, 256TB for fourth level)
+ If clear, this page corresponds to a guest page table denoted by the gfn
+ field.
+ role.quadrant:
+ When role.cr4_pae=0, the guest uses 32-bit gptes while the host uses 64-bit
+ sptes. That means a guest page table contains more ptes than the host,
+ so multiple shadow pages are needed to shadow one guest page.
+ For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
+ first or second 512-gpte block in the guest page table. For second-level
+ page tables, each 32-bit gpte is converted to two 64-bit sptes
+ (since each first-level guest page is shadowed by two first-level
+ shadow pages) so role.quadrant takes values in the range 0..3. Each
+ quadrant maps 1GB virtual address space.
+ role.access:
+ Inherited guest access permissions in the form uwx. Note execute
+ permission is positive, not negative.
+ role.invalid:
+ The page is invalid and should not be used. It is a root page that is
+ currently pinned (by a cpu hardware register pointing to it); once it is
+ unpinned it will be destroyed.
+ role.cr4_pae:
+ Contains the value of cr4.pae for which the page is valid (e.g. whether
+ 32-bit or 64-bit gptes are in use).
+ role.nxe:
+ Contains the value of efer.nxe for which the page is valid.
+ role.cr0_wp:
+ Contains the value of cr0.wp for which the page is valid.
+ role.smep_andnot_wp:
+ Contains the value of cr4.smep && !cr0.wp for which the page is valid
+ (pages for which this is true are different from other pages; see the
+ treatment of cr0.wp=0 below).
+ gfn:
+ Either the guest page table containing the translations shadowed by this
+ page, or the base page frame for linear translations. See role.direct.
+ spt:
+ A pageful of 64-bit sptes containing the translations for this page.
+ Accessed by both kvm and hardware.
+ The page pointed to by spt will have its page->private pointing back
+ at the shadow page structure.
+ sptes in spt point either at guest pages, or at lower-level shadow pages.
+ Specifically, if sp1 and sp2 are shadow pages, then sp1->spt[n] may point
+ at __pa(sp2->spt). sp2 will point back at sp1 through parent_pte.
+ The spt array forms a DAG structure with the shadow page as a node, and
+ guest pages as leaves.
+ gfns:
+ An array of 512 guest frame numbers, one for each present pte. Used to
+ perform a reverse map from a pte to a gfn. When role.direct is set, any
+ element of this array can be calculated from the gfn field when used, in
+ this case, the array of gfns is not allocated. See role.direct and gfn.
+ slot_bitmap:
+ A bitmap containing one bit per memory slot. If the page contains a pte
+ mapping a page from memory slot n, then bit n of slot_bitmap will be set
+ (if a page is aliased among several slots, then it is not guaranteed that
+ all slots will be marked).
+ Used during dirty logging to avoid scanning a shadow page if none if its
+ pages need tracking.
+ root_count:
+ A counter keeping track of how many hardware registers (guest cr3 or
+ pdptrs) are now pointing at the page. While this counter is nonzero, the
+ page cannot be destroyed. See role.invalid.
+ multimapped:
+ Whether there exist multiple sptes pointing at this page.
+ parent_pte/parent_ptes:
+ If multimapped is zero, parent_pte points at the single spte that points at
+ this page's spt. Otherwise, parent_ptes points at a data structure
+ with a list of parent_ptes.
+ unsync:
+ If true, then the translations in this page may not match the guest's
+ translation. This is equivalent to the state of the tlb when a pte is
+ changed but before the tlb entry is flushed. Accordingly, unsync ptes
+ are synchronized when the guest executes invlpg or flushes its tlb by
+ other means. Valid for leaf pages.
+ unsync_children:
+ How many sptes in the page point at pages that are unsync (or have
+ unsynchronized children).
+ unsync_child_bitmap:
+ A bitmap indicating which sptes in spt point (directly or indirectly) at
+ pages that may be unsynchronized. Used to quickly locate all unsychronized
+ pages reachable from a given page.
+
+Reverse map
+===========
+
+The mmu maintains a reverse mapping whereby all ptes mapping a page can be
+reached given its gfn. This is used, for example, when swapping out a page.
+
+Synchronized and unsynchronized pages
+=====================================
+
+The guest uses two events to synchronize its tlb and page tables: tlb flushes
+and page invalidations (invlpg).
+
+A tlb flush means that we need to synchronize all sptes reachable from the
+guest's cr3. This is expensive, so we keep all guest page tables write
+protected, and synchronize sptes to gptes when a gpte is written.
+
+A special case is when a guest page table is reachable from the current
+guest cr3. In this case, the guest is obliged to issue an invlpg instruction
+before using the translation. We take advantage of that by removing write
+protection from the guest page, and allowing the guest to modify it freely.
+We synchronize modified gptes when the guest invokes invlpg. This reduces
+the amount of emulation we have to do when the guest modifies multiple gptes,
+or when the a guest page is no longer used as a page table and is used for
+random guest data.
+
+As a side effect we have to resynchronize all reachable unsynchronized shadow
+pages on a tlb flush.
+
+
+Reaction to events
+==================
+
+- guest page fault (or npt page fault, or ept violation)
+
+This is the most complicated event. The cause of a page fault can be:
+
+ - a true guest fault (the guest translation won't allow the access) (*)
+ - access to a missing translation
+ - access to a protected translation
+ - when logging dirty pages, memory is write protected
+ - synchronized shadow pages are write protected (*)
+ - access to untranslatable memory (mmio)
+
+ (*) not applicable in direct mode
+
+Handling a page fault is performed as follows:
+
+ - if needed, walk the guest page tables to determine the guest translation
+ (gva->gpa or ngpa->gpa)
+ - if permissions are insufficient, reflect the fault back to the guest
+ - determine the host page
+ - if this is an mmio request, there is no host page; call the emulator
+ to emulate the instruction instead
+ - walk the shadow page table to find the spte for the translation,
+ instantiating missing intermediate page tables as necessary
+ - try to unsynchronize the page
+ - if successful, we can let the guest continue and modify the gpte
+ - emulate the instruction
+ - if failed, unshadow the page and let the guest continue
+ - update any translations that were modified by the instruction
+
+invlpg handling:
+
+ - walk the shadow page hierarchy and drop affected translations
+ - try to reinstantiate the indicated translation in the hope that the
+ guest will use it in the near future
+
+Guest control register updates:
+
+- mov to cr3
+ - look up new shadow roots
+ - synchronize newly reachable shadow pages
+
+- mov to cr0/cr4/efer
+ - set up mmu context for new paging mode
+ - look up new shadow roots
+ - synchronize newly reachable shadow pages
+
+Host translation updates:
+
+ - mmu notifier called with updated hva
+ - look up affected sptes through reverse map
+ - drop (or update) translations
+
+Emulating cr0.wp
+================
+
+If tdp is not enabled, the host must keep cr0.wp=1 so page write protection
+works for the guest kernel, not guest guest userspace. When the guest
+cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0,
+we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the
+semantics require allowing any guest kernel access plus user read access).
+
+We handle this by mapping the permissions to two possible sptes, depending
+on fault type:
+
+- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access,
+ disallows user access)
+- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel
+ write access)
+
+(user write faults generate a #PF)
+
+In the first case there is an additional complication if CR4.SMEP is
+enabled: since we've turned the page into a kernel page, the kernel may now
+execute it. We handle this by also setting spte.nx. If we get a user
+fetch or read fault, we'll change spte.u=1 and spte.nx=gpte.nx back.
+
+To prevent an spte that was converted into a kernel page with cr0.wp=0
+from being written by the kernel after cr0.wp has changed to 1, we make
+the value of cr0.wp part of the page role. This means that an spte created
+with one value of cr0.wp cannot be used when cr0.wp has a different value -
+it will simply be missed by the shadow page lookup code. A similar issue
+exists when an spte created with cr0.wp=0 and cr4.smep=0 is used after
+changing cr4.smep to 1. To avoid this, the value of !cr0.wp && cr4.smep
+is also made a part of the page role.
+
+Large pages
+===========
+
+The mmu supports all combinations of large and small guest and host pages.
+Supported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as
+two separate 2M pages, on both guest and host, since the mmu always uses PAE
+paging.
+
+To instantiate a large spte, four constraints must be satisfied:
+
+- the spte must point to a large host page
+- the guest pte must be a large pte of at least equivalent size (if tdp is
+ enabled, there is no guest pte and this condition is satisfied)
+- if the spte will be writeable, the large page frame may not overlap any
+ write-protected pages
+- the guest page must be wholly contained by a single memory slot
+
+To check the last two conditions, the mmu maintains a ->write_count set of
+arrays for each memory slot and large page size. Every write protected page
+causes its write_count to be incremented, thus preventing instantiation of
+a large spte. The frames at the end of an unaligned memory slot have
+artificially inflated ->write_counts so they can never be instantiated.
+
+Further reading
+===============
+
+- NPT presentation from KVM Forum 2008
+ http://www.linux-kvm.org/wiki/images/c/c8/KvmForum2008%24kdf2008_21.pdf
+
diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt
new file mode 100644
index 00000000..50317809
--- /dev/null
+++ b/Documentation/virtual/kvm/msr.txt
@@ -0,0 +1,221 @@
+KVM-specific MSRs.
+Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
+=====================================================
+
+KVM makes use of some custom MSRs to service some requests.
+
+Custom MSRs have a range reserved for them, that goes from
+0x4b564d00 to 0x4b564dff. There are MSRs outside this area,
+but they are deprecated and their use is discouraged.
+
+Custom MSR list
+--------
+
+The current supported Custom MSR list is:
+
+MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00
+
+ data: 4-byte alignment physical address of a memory area which must be
+ in guest RAM. This memory is expected to hold a copy of the following
+ structure:
+
+ struct pvclock_wall_clock {
+ u32 version;
+ u32 sec;
+ u32 nsec;
+ } __attribute__((__packed__));
+
+ whose data will be filled in by the hypervisor. The hypervisor is only
+ guaranteed to update this data at the moment of MSR write.
+ Users that want to reliably query this information more than once have
+ to write more than once to this MSR. Fields have the following meanings:
+
+ version: guest has to check version before and after grabbing
+ time information and check that they are both equal and even.
+ An odd version indicates an in-progress update.
+
+ sec: number of seconds for wallclock.
+
+ nsec: number of nanoseconds for wallclock.
+
+ Note that although MSRs are per-CPU entities, the effect of this
+ particular MSR is global.
+
+ Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
+ leaf prior to usage.
+
+MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01
+
+ data: 4-byte aligned physical address of a memory area which must be in
+ guest RAM, plus an enable bit in bit 0. This memory is expected to hold
+ a copy of the following structure:
+
+ struct pvclock_vcpu_time_info {
+ u32 version;
+ u32 pad0;
+ u64 tsc_timestamp;
+ u64 system_time;
+ u32 tsc_to_system_mul;
+ s8 tsc_shift;
+ u8 flags;
+ u8 pad[2];
+ } __attribute__((__packed__)); /* 32 bytes */
+
+ whose data will be filled in by the hypervisor periodically. Only one
+ write, or registration, is needed for each VCPU. The interval between
+ updates of this structure is arbitrary and implementation-dependent.
+ The hypervisor may update this structure at any time it sees fit until
+ anything with bit0 == 0 is written to it.
+
+ Fields have the following meanings:
+
+ version: guest has to check version before and after grabbing
+ time information and check that they are both equal and even.
+ An odd version indicates an in-progress update.
+
+ tsc_timestamp: the tsc value at the current VCPU at the time
+ of the update of this structure. Guests can subtract this value
+ from current tsc to derive a notion of elapsed time since the
+ structure update.
+
+ system_time: a host notion of monotonic time, including sleep
+ time at the time this structure was last updated. Unit is
+ nanoseconds.
+
+ tsc_to_system_mul: a function of the tsc frequency. One has
+ to multiply any tsc-related quantity by this value to get
+ a value in nanoseconds, besides dividing by 2^tsc_shift
+
+ tsc_shift: cycle to nanosecond divider, as a power of two, to
+ allow for shift rights. One has to shift right any tsc-related
+ quantity by this value to get a value in nanoseconds, besides
+ multiplying by tsc_to_system_mul.
+
+ With this information, guests can derive per-CPU time by
+ doing:
+
+ time = (current_tsc - tsc_timestamp)
+ time = (time * tsc_to_system_mul) >> tsc_shift
+ time = time + system_time
+
+ flags: bits in this field indicate extended capabilities
+ coordinated between the guest and the hypervisor. Availability
+ of specific flags has to be checked in 0x40000001 cpuid leaf.
+ Current flags are:
+
+ flag bit | cpuid bit | meaning
+ -------------------------------------------------------------
+ | | time measures taken across
+ 0 | 24 | multiple cpus are guaranteed to
+ | | be monotonic
+ -------------------------------------------------------------
+
+ Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
+ leaf prior to usage.
+
+
+MSR_KVM_WALL_CLOCK: 0x11
+
+ data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.
+
+ This MSR falls outside the reserved KVM range and may be removed in the
+ future. Its usage is deprecated.
+
+ Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
+ leaf prior to usage.
+
+MSR_KVM_SYSTEM_TIME: 0x12
+
+ data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.
+
+ This MSR falls outside the reserved KVM range and may be removed in the
+ future. Its usage is deprecated.
+
+ Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
+ leaf prior to usage.
+
+ The suggested algorithm for detecting kvmclock presence is then:
+
+ if (!kvm_para_available()) /* refer to cpuid.txt */
+ return NON_PRESENT;
+
+ flags = cpuid_eax(0x40000001);
+ if (flags & 3) {
+ msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
+ msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
+ return PRESENT;
+ } else if (flags & 0) {
+ msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
+ msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
+ return PRESENT;
+ } else
+ return NON_PRESENT;
+
+MSR_KVM_ASYNC_PF_EN: 0x4b564d02
+ data: Bits 63-6 hold 64-byte aligned physical address of a
+ 64 byte memory area which must be in guest RAM and must be
+ zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1
+ when asynchronous page faults are enabled on the vcpu 0 when
+ disabled. Bit 2 is 1 if asynchronous page faults can be injected
+ when vcpu is in cpl == 0.
+
+ First 4 byte of 64 byte memory location will be written to by
+ the hypervisor at the time of asynchronous page fault (APF)
+ injection to indicate type of asynchronous page fault. Value
+ of 1 means that the page referred to by the page fault is not
+ present. Value 2 means that the page is now available. Disabling
+ interrupt inhibits APFs. Guest must not enable interrupt
+ before the reason is read, or it may be overwritten by another
+ APF. Since APF uses the same exception vector as regular page
+ fault guest must reset the reason to 0 before it does
+ something that can generate normal page fault. If during page
+ fault APF reason is 0 it means that this is regular page
+ fault.
+
+ During delivery of type 1 APF cr2 contains a token that will
+ be used to notify a guest when missing page becomes
+ available. When page becomes available type 2 APF is sent with
+ cr2 set to the token associated with the page. There is special
+ kind of token 0xffffffff which tells vcpu that it should wake
+ up all processes waiting for APFs and no individual type 2 APFs
+ will be sent.
+
+ If APF is disabled while there are outstanding APFs, they will
+ not be delivered.
+
+ Currently type 2 APF will be always delivered on the same vcpu as
+ type 1 was, but guest should not rely on that.
+
+MSR_KVM_STEAL_TIME: 0x4b564d03
+
+ data: 64-byte alignment physical address of a memory area which must be
+ in guest RAM, plus an enable bit in bit 0. This memory is expected to
+ hold a copy of the following structure:
+
+ struct kvm_steal_time {
+ __u64 steal;
+ __u32 version;
+ __u32 flags;
+ __u32 pad[12];
+ }
+
+ whose data will be filled in by the hypervisor periodically. Only one
+ write, or registration, is needed for each VCPU. The interval between
+ updates of this structure is arbitrary and implementation-dependent.
+ The hypervisor may update this structure at any time it sees fit until
+ anything with bit0 == 0 is written to it. Guest is required to make sure
+ this structure is initialized to zero.
+
+ Fields have the following meanings:
+
+ version: a sequence counter. In other words, guest has to check
+ this field before and after grabbing time information and make
+ sure they are both equal and even. An odd version indicates an
+ in-progress update.
+
+ flags: At this point, always zero. May be used to indicate
+ changes in this structure in the future.
+
+ steal: the amount of time in which this vCPU did not run, in
+ nanoseconds. Time during which the vcpu is idle, will not be
+ reported as steal time.
diff --git a/Documentation/virtual/kvm/nested-vmx.txt b/Documentation/virtual/kvm/nested-vmx.txt
new file mode 100644
index 00000000..8ed937de
--- /dev/null
+++ b/Documentation/virtual/kvm/nested-vmx.txt
@@ -0,0 +1,251 @@
+Nested VMX
+==========
+
+Overview
+---------
+
+On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
+to easily and efficiently run guest operating systems. Normally, these guests
+*cannot* themselves be hypervisors running their own guests, because in VMX,
+guests cannot use VMX instructions.
+
+The "Nested VMX" feature adds this missing capability - of running guest
+hypervisors (which use VMX) with their own nested guests. It does so by
+allowing a guest to use VMX instructions, and correctly and efficiently
+emulating them using the single level of VMX available in the hardware.
+
+We describe in much greater detail the theory behind the nested VMX feature,
+its implementation and its performance characteristics, in the OSDI 2010 paper
+"The Turtles Project: Design and Implementation of Nested Virtualization",
+available at:
+
+ http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
+
+
+Terminology
+-----------
+
+Single-level virtualization has two levels - the host (KVM) and the guests.
+In nested virtualization, we have three levels: The host (KVM), which we call
+L0, the guest hypervisor, which we call L1, and its nested guest, which we
+call L2.
+
+
+Known limitations
+-----------------
+
+The current code supports running Linux guests under KVM guests.
+Only 64-bit guest hypervisors are supported.
+
+Additional patches for running Windows under guest KVM, and Linux under
+guest VMware server, and support for nested EPT, are currently running in
+the lab, and will be sent as follow-on patchsets.
+
+
+Running nested VMX
+------------------
+
+The nested VMX feature is disabled by default. It can be enabled by giving
+the "nested=1" option to the kvm-intel module.
+
+No modifications are required to user space (qemu). However, qemu's default
+emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
+explicitly enabled, by giving qemu one of the following options:
+
+ -cpu host (emulated CPU has all features of the real CPU)
+
+ -cpu qemu64,+vmx (add just the vmx feature to a named CPU type)
+
+
+ABIs
+----
+
+Nested VMX aims to present a standard and (eventually) fully-functional VMX
+implementation for the a guest hypervisor to use. As such, the official
+specification of the ABI that it provides is Intel's VMX specification,
+namely volume 3B of their "Intel 64 and IA-32 Architectures Software
+Developer's Manual". Not all of VMX's features are currently fully supported,
+but the goal is to eventually support them all, starting with the VMX features
+which are used in practice by popular hypervisors (KVM and others).
+
+As a VMX implementation, nested VMX presents a VMCS structure to L1.
+As mandated by the spec, other than the two fields revision_id and abort,
+this structure is *opaque* to its user, who is not supposed to know or care
+about its internal structure. Rather, the structure is accessed through the
+VMREAD and VMWRITE instructions.
+Still, for debugging purposes, KVM developers might be interested to know the
+internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
+
+The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we
+also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS
+which L0 builds to actually run L2 - how this is done is explained in the
+aforementioned paper.
+
+For convenience, we repeat the content of struct vmcs12 here. If the internals
+of this structure changes, this can break live migration across KVM versions.
+VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner
+struct shadow_vmcs is ever changed.
+
+ typedef u64 natural_width;
+ struct __packed vmcs12 {
+ /* According to the Intel spec, a VMCS region must start with
+ * these two user-visible fields */
+ u32 revision_id;
+ u32 abort;
+
+ u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+ u32 padding[7]; /* room for future expansion */
+
+ u64 io_bitmap_a;
+ u64 io_bitmap_b;
+ u64 msr_bitmap;
+ u64 vm_exit_msr_store_addr;
+ u64 vm_exit_msr_load_addr;
+ u64 vm_entry_msr_load_addr;
+ u64 tsc_offset;
+ u64 virtual_apic_page_addr;
+ u64 apic_access_addr;
+ u64 ept_pointer;
+ u64 guest_physical_address;
+ u64 vmcs_link_pointer;
+ u64 guest_ia32_debugctl;
+ u64 guest_ia32_pat;
+ u64 guest_ia32_efer;
+ u64 guest_pdptr0;
+ u64 guest_pdptr1;
+ u64 guest_pdptr2;
+ u64 guest_pdptr3;
+ u64 host_ia32_pat;
+ u64 host_ia32_efer;
+ u64 padding64[8]; /* room for future expansion */
+ natural_width cr0_guest_host_mask;
+ natural_width cr4_guest_host_mask;
+ natural_width cr0_read_shadow;
+ natural_width cr4_read_shadow;
+ natural_width cr3_target_value0;
+ natural_width cr3_target_value1;
+ natural_width cr3_target_value2;
+ natural_width cr3_target_value3;
+ natural_width exit_qualification;
+ natural_width guest_linear_address;
+ natural_width guest_cr0;
+ natural_width guest_cr3;
+ natural_width guest_cr4;
+ natural_width guest_es_base;
+ natural_width guest_cs_base;
+ natural_width guest_ss_base;
+ natural_width guest_ds_base;
+ natural_width guest_fs_base;
+ natural_width guest_gs_base;
+ natural_width guest_ldtr_base;
+ natural_width guest_tr_base;
+ natural_width guest_gdtr_base;
+ natural_width guest_idtr_base;
+ natural_width guest_dr7;
+ natural_width guest_rsp;
+ natural_width guest_rip;
+ natural_width guest_rflags;
+ natural_width guest_pending_dbg_exceptions;
+ natural_width guest_sysenter_esp;
+ natural_width guest_sysenter_eip;
+ natural_width host_cr0;
+ natural_width host_cr3;
+ natural_width host_cr4;
+ natural_width host_fs_base;
+ natural_width host_gs_base;
+ natural_width host_tr_base;
+ natural_width host_gdtr_base;
+ natural_width host_idtr_base;
+ natural_width host_ia32_sysenter_esp;
+ natural_width host_ia32_sysenter_eip;
+ natural_width host_rsp;
+ natural_width host_rip;
+ natural_width paddingl[8]; /* room for future expansion */
+ u32 pin_based_vm_exec_control;
+ u32 cpu_based_vm_exec_control;
+ u32 exception_bitmap;
+ u32 page_fault_error_code_mask;
+ u32 page_fault_error_code_match;
+ u32 cr3_target_count;
+ u32 vm_exit_controls;
+ u32 vm_exit_msr_store_count;
+ u32 vm_exit_msr_load_count;
+ u32 vm_entry_controls;
+ u32 vm_entry_msr_load_count;
+ u32 vm_entry_intr_info_field;
+ u32 vm_entry_exception_error_code;
+ u32 vm_entry_instruction_len;
+ u32 tpr_threshold;
+ u32 secondary_vm_exec_control;
+ u32 vm_instruction_error;
+ u32 vm_exit_reason;
+ u32 vm_exit_intr_info;
+ u32 vm_exit_intr_error_code;
+ u32 idt_vectoring_info_field;
+ u32 idt_vectoring_error_code;
+ u32 vm_exit_instruction_len;
+ u32 vmx_instruction_info;
+ u32 guest_es_limit;
+ u32 guest_cs_limit;
+ u32 guest_ss_limit;
+ u32 guest_ds_limit;
+ u32 guest_fs_limit;
+ u32 guest_gs_limit;
+ u32 guest_ldtr_limit;
+ u32 guest_tr_limit;
+ u32 guest_gdtr_limit;
+ u32 guest_idtr_limit;
+ u32 guest_es_ar_bytes;
+ u32 guest_cs_ar_bytes;
+ u32 guest_ss_ar_bytes;
+ u32 guest_ds_ar_bytes;
+ u32 guest_fs_ar_bytes;
+ u32 guest_gs_ar_bytes;
+ u32 guest_ldtr_ar_bytes;
+ u32 guest_tr_ar_bytes;
+ u32 guest_interruptibility_info;
+ u32 guest_activity_state;
+ u32 guest_sysenter_cs;
+ u32 host_ia32_sysenter_cs;
+ u32 padding32[8]; /* room for future expansion */
+ u16 virtual_processor_id;
+ u16 guest_es_selector;
+ u16 guest_cs_selector;
+ u16 guest_ss_selector;
+ u16 guest_ds_selector;
+ u16 guest_fs_selector;
+ u16 guest_gs_selector;
+ u16 guest_ldtr_selector;
+ u16 guest_tr_selector;
+ u16 host_es_selector;
+ u16 host_cs_selector;
+ u16 host_ss_selector;
+ u16 host_ds_selector;
+ u16 host_fs_selector;
+ u16 host_gs_selector;
+ u16 host_tr_selector;
+ };
+
+
+Authors
+-------
+
+These patches were written by:
+ Abel Gordon, abelg <at> il.ibm.com
+ Nadav Har'El, nyh <at> il.ibm.com
+ Orit Wasserman, oritw <at> il.ibm.com
+ Ben-Ami Yassor, benami <at> il.ibm.com
+ Muli Ben-Yehuda, muli <at> il.ibm.com
+
+With contributions by:
+ Anthony Liguori, aliguori <at> us.ibm.com
+ Mike Day, mdday <at> us.ibm.com
+ Michael Factor, factor <at> il.ibm.com
+ Zvi Dubitzky, dubi <at> il.ibm.com
+
+And valuable reviews by:
+ Avi Kivity, avi <at> redhat.com
+ Gleb Natapov, gleb <at> redhat.com
+ Marcelo Tosatti, mtosatti <at> redhat.com
+ Kevin Tian, kevin.tian <at> intel.com
+ and others.
diff --git a/Documentation/virtual/kvm/ppc-pv.txt b/Documentation/virtual/kvm/ppc-pv.txt
new file mode 100644
index 00000000..6e7c3705
--- /dev/null
+++ b/Documentation/virtual/kvm/ppc-pv.txt
@@ -0,0 +1,178 @@
+The PPC KVM paravirtual interface
+=================================
+
+The basic execution principle by which KVM on PowerPC works is to run all kernel
+space code in PR=1 which is user space. This way we trap all privileged
+instructions and can emulate them accordingly.
+
+Unfortunately that is also the downfall. There are quite some privileged
+instructions that needlessly return us to the hypervisor even though they
+could be handled differently.
+
+This is what the PPC PV interface helps with. It takes privileged instructions
+and transforms them into unprivileged ones with some help from the hypervisor.
+This cuts down virtualization costs by about 50% on some of my benchmarks.
+
+The code for that interface can be found in arch/powerpc/kernel/kvm*
+
+Querying for existence
+======================
+
+To find out if we're running on KVM or not, we leverage the device tree. When
+Linux is running on KVM, a node /hypervisor exists. That node contains a
+compatible property with the value "linux,kvm".
+
+Once you determined you're running under a PV capable KVM, you can now use
+hypercalls as described below.
+
+KVM hypercalls
+==============
+
+Inside the device tree's /hypervisor node there's a property called
+'hypercall-instructions'. This property contains at most 4 opcodes that make
+up the hypercall. To call a hypercall, just call these instructions.
+
+The parameters are as follows:
+
+ Register IN OUT
+
+ r0 - volatile
+ r3 1st parameter Return code
+ r4 2nd parameter 1st output value
+ r5 3rd parameter 2nd output value
+ r6 4th parameter 3rd output value
+ r7 5th parameter 4th output value
+ r8 6th parameter 5th output value
+ r9 7th parameter 6th output value
+ r10 8th parameter 7th output value
+ r11 hypercall number 8th output value
+ r12 - volatile
+
+Hypercall definitions are shared in generic code, so the same hypercall numbers
+apply for x86 and powerpc alike with the exception that each KVM hypercall
+also needs to be ORed with the KVM vendor code which is (42 << 16).
+
+Return codes can be as follows:
+
+ Code Meaning
+
+ 0 Success
+ 12 Hypercall not implemented
+ <0 Error
+
+The magic page
+==============
+
+To enable communication between the hypervisor and guest there is a new shared
+page that contains parts of supervisor visible register state. The guest can
+map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
+
+With this hypercall issued the guest always gets the magic page mapped at the
+desired location. The first parameter indicates the effective address when the
+MMU is enabled. The second parameter indicates the address in real mode, if
+applicable to the target. For now, we always map the page to -4096. This way we
+can access it using absolute load and store functions. The following
+instruction reads the first field of the magic page:
+
+ ld rX, -4096(0)
+
+The interface is designed to be extensible should there be need later to add
+additional registers to the magic page. If you add fields to the magic page,
+also define a new hypercall feature to indicate that the host can give you more
+registers. Only if the host supports the additional features, make use of them.
+
+The magic page layout is described by struct kvm_vcpu_arch_shared
+in arch/powerpc/include/asm/kvm_para.h.
+
+Magic page features
+===================
+
+When mapping the magic page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE,
+a second return value is passed to the guest. This second return value contains
+a bitmap of available features inside the magic page.
+
+The following enhancements to the magic page are currently available:
+
+ KVM_MAGIC_FEAT_SR Maps SR registers r/w in the magic page
+
+For enhanced features in the magic page, please check for the existence of the
+feature before using them!
+
+MSR bits
+========
+
+The MSR contains bits that require hypervisor intervention and bits that do
+not require direct hypervisor intervention because they only get interpreted
+when entering the guest or don't have any impact on the hypervisor's behavior.
+
+The following bits are safe to be set inside the guest:
+
+ MSR_EE
+ MSR_RI
+ MSR_CR
+ MSR_ME
+
+If any other bit changes in the MSR, please still use mtmsr(d).
+
+Patched instructions
+====================
+
+The "ld" and "std" instructions are transormed to "lwz" and "stw" instructions
+respectively on 32 bit systems with an added offset of 4 to accommodate for big
+endianness.
+
+The following is a list of mapping the Linux kernel performs when running as
+guest. Implementing any of those mappings is optional, as the instruction traps
+also act on the shared page. So calling privileged instructions still works as
+before.
+
+From To
+==== ==
+
+mfmsr rX ld rX, magic_page->msr
+mfsprg rX, 0 ld rX, magic_page->sprg0
+mfsprg rX, 1 ld rX, magic_page->sprg1
+mfsprg rX, 2 ld rX, magic_page->sprg2
+mfsprg rX, 3 ld rX, magic_page->sprg3
+mfsrr0 rX ld rX, magic_page->srr0
+mfsrr1 rX ld rX, magic_page->srr1
+mfdar rX ld rX, magic_page->dar
+mfdsisr rX lwz rX, magic_page->dsisr
+
+mtmsr rX std rX, magic_page->msr
+mtsprg 0, rX std rX, magic_page->sprg0
+mtsprg 1, rX std rX, magic_page->sprg1
+mtsprg 2, rX std rX, magic_page->sprg2
+mtsprg 3, rX std rX, magic_page->sprg3
+mtsrr0 rX std rX, magic_page->srr0
+mtsrr1 rX std rX, magic_page->srr1
+mtdar rX std rX, magic_page->dar
+mtdsisr rX stw rX, magic_page->dsisr
+
+tlbsync nop
+
+mtmsrd rX, 0 b <special mtmsr section>
+mtmsr rX b <special mtmsr section>
+
+mtmsrd rX, 1 b <special mtmsrd section>
+
+[Book3S only]
+mtsrin rX, rY b <special mtsrin section>
+
+[BookE only]
+wrteei [0|1] b <special wrteei section>
+
+
+Some instructions require more logic to determine what's going on than a load
+or store instruction can deliver. To enable patching of those, we keep some
+RAM around where we can live translate instructions to. What happens is the
+following:
+
+ 1) copy emulation code to memory
+ 2) patch that code to fit the emulated instruction
+ 3) patch that code to return to the original pc + 4
+ 4) patch the original instruction to branch to the new code
+
+That way we can inject an arbitrary amount of code as replacement for a single
+instruction. This allows us to check for pending interrupts when setting EE=1
+for example.
diff --git a/Documentation/virtual/kvm/review-checklist.txt b/Documentation/virtual/kvm/review-checklist.txt
new file mode 100644
index 00000000..a850986e
--- /dev/null
+++ b/Documentation/virtual/kvm/review-checklist.txt
@@ -0,0 +1,38 @@
+Review checklist for kvm patches
+================================
+
+1. The patch must follow Documentation/CodingStyle and
+ Documentation/SubmittingPatches.
+
+2. Patches should be against kvm.git master branch.
+
+3. If the patch introduces or modifies a new userspace API:
+ - the API must be documented in Documentation/virtual/kvm/api.txt
+ - the API must be discoverable using KVM_CHECK_EXTENSION
+
+4. New state must include support for save/restore.
+
+5. New features must default to off (userspace should explicitly request them).
+ Performance improvements can and should default to on.
+
+6. New cpu features should be exposed via KVM_GET_SUPPORTED_CPUID2
+
+7. Emulator changes should be accompanied by unit tests for qemu-kvm.git
+ kvm/test directory.
+
+8. Changes should be vendor neutral when possible. Changes to common code
+ are better than duplicating changes to vendor code.
+
+9. Similarly, prefer changes to arch independent code than to arch dependent
+ code.
+
+10. User/kernel interfaces and guest/host interfaces must be 64-bit clean
+ (all variables and sizes naturally aligned on 64-bit; use specific types
+ only - u64 rather than ulong).
+
+11. New guest visible features must either be documented in a hardware manual
+ or be accompanied by documentation.
+
+12. Features must be robust against reset and kexec - for example, shared
+ host/guest memory must be unshared to prevent the host from writing to
+ guest memory that the guest has not reserved for this purpose.
diff --git a/Documentation/virtual/kvm/timekeeping.txt b/Documentation/virtual/kvm/timekeeping.txt
new file mode 100644
index 00000000..df894637
--- /dev/null
+++ b/Documentation/virtual/kvm/timekeeping.txt
@@ -0,0 +1,612 @@
+
+ Timekeeping Virtualization for X86-Based Architectures
+
+ Zachary Amsden <zamsden@redhat.com>
+ Copyright (c) 2010, Red Hat. All rights reserved.
+
+1) Overview
+2) Timing Devices
+3) TSC Hardware
+4) Virtualization Problems
+
+=========================================================================
+
+1) Overview
+
+One of the most complicated parts of the X86 platform, and specifically,
+the virtualization of this platform is the plethora of timing devices available
+and the complexity of emulating those devices. In addition, virtualization of
+time introduces a new set of challenges because it introduces a multiplexed
+division of time beyond the control of the guest CPU.
+
+First, we will describe the various timekeeping hardware available, then
+present some of the problems which arise and solutions available, giving
+specific recommendations for certain classes of KVM guests.
+
+The purpose of this document is to collect data and information relevant to
+timekeeping which may be difficult to find elsewhere, specifically,
+information relevant to KVM and hardware-based virtualization.
+
+=========================================================================
+
+2) Timing Devices
+
+First we discuss the basic hardware devices available. TSC and the related
+KVM clock are special enough to warrant a full exposition and are described in
+the following section.
+
+2.1) i8254 - PIT
+
+One of the first timer devices available is the programmable interrupt timer,
+or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three
+channels which can be programmed to deliver periodic or one-shot interrupts.
+These three channels can be configured in different modes and have individual
+counters. Channel 1 and 2 were not available for general use in the original
+IBM PC, and historically were connected to control RAM refresh and the PC
+speaker. Now the PIT is typically integrated as part of an emulated chipset
+and a separate physical PIT is not used.
+
+The PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done
+using single or multiple byte access to the I/O ports. There are 6 modes
+available, but not all modes are available to all timers, as only timer 2
+has a connected gate input, required for modes 1 and 5. The gate line is
+controlled by port 61h, bit 0, as illustrated in the following diagram.
+
+ -------------- ----------------
+| | | |
+| 1.1932 MHz |---------->| CLOCK OUT | ---------> IRQ 0
+| Clock | | | |
+ -------------- | +->| GATE TIMER 0 |
+ | ----------------
+ |
+ | ----------------
+ | | |
+ |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM
+ | | | (aka /dev/null)
+ | +->| GATE TIMER 1 |
+ | ----------------
+ |
+ | ----------------
+ | | |
+ |------>| CLOCK OUT | ---------> Port 61h, bit 5
+ | | |
+Port 61h, bit 0 ---------->| GATE TIMER 2 | \_.---- ____
+ ---------------- _| )--|LPF|---Speaker
+ / *---- \___/
+Port 61h, bit 1 -----------------------------------/
+
+The timer modes are now described.
+
+Mode 0: Single Timeout. This is a one-shot software timeout that counts down
+ when the gate is high (always true for timers 0 and 1). When the count
+ reaches zero, the output goes high.
+
+Mode 1: Triggered One-shot. The output is initially set high. When the gate
+ line is set high, a countdown is initiated (which does not stop if the gate is
+ lowered), during which the output is set low. When the count reaches zero,
+ the output goes high.
+
+Mode 2: Rate Generator. The output is initially set high. When the countdown
+ reaches 1, the output goes low for one count and then returns high. The value
+ is reloaded and the countdown automatically resumes. If the gate line goes
+ low, the count is halted. If the output is low when the gate is lowered, the
+ output automatically goes high (this only affects timer 2).
+
+Mode 3: Square Wave. This generates a high / low square wave. The count
+ determines the length of the pulse, which alternates between high and low
+ when zero is reached. The count only proceeds when gate is high and is
+ automatically reloaded on reaching zero. The count is decremented twice at
+ each clock to generate a full high / low cycle at the full periodic rate.
+ If the count is even, the clock remains high for N/2 counts and low for N/2
+ counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
+ for (N-1)/2 counts. Only even values are latched by the counter, so odd
+ values are not observed when reading. This is the intended mode for timer 2,
+ which generates sine-like tones by low-pass filtering the square wave output.
+
+Mode 4: Software Strobe. After programming this mode and loading the counter,
+ the output remains high until the counter reaches zero. Then the output
+ goes low for 1 clock cycle and returns high. The counter is not reloaded.
+ Counting only occurs when gate is high.
+
+Mode 5: Hardware Strobe. After programming and loading the counter, the
+ output remains high. When the gate is raised, a countdown is initiated
+ (which does not stop if the gate is lowered). When the counter reaches zero,
+ the output goes low for 1 clock cycle and then returns high. The counter is
+ not reloaded.
+
+In addition to normal binary counting, the PIT supports BCD counting. The
+command port, 0x43 is used to set the counter and mode for each of the three
+timers.
+
+PIT commands, issued to port 0x43, using the following bit encoding:
+
+Bit 7-4: Command (See table below)
+Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
+Bit 0 : Binary (0) / BCD (1)
+
+Command table:
+
+0000 - Latch Timer 0 count for port 0x40
+ sample and hold the count to be read in port 0x40;
+ additional commands ignored until counter is read;
+ mode bits ignored.
+
+0001 - Set Timer 0 LSB mode for port 0x40
+ set timer to read LSB only and force MSB to zero;
+ mode bits set timer mode
+
+0010 - Set Timer 0 MSB mode for port 0x40
+ set timer to read MSB only and force LSB to zero;
+ mode bits set timer mode
+
+0011 - Set Timer 0 16-bit mode for port 0x40
+ set timer to read / write LSB first, then MSB;
+ mode bits set timer mode
+
+0100 - Latch Timer 1 count for port 0x41 - as described above
+0101 - Set Timer 1 LSB mode for port 0x41 - as described above
+0110 - Set Timer 1 MSB mode for port 0x41 - as described above
+0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
+
+1000 - Latch Timer 2 count for port 0x42 - as described above
+1001 - Set Timer 2 LSB mode for port 0x42 - as described above
+1010 - Set Timer 2 MSB mode for port 0x42 - as described above
+1011 - Set Timer 2 16-bit mode for port 0x42 as described above
+
+1101 - General counter latch
+ Latch combination of counters into corresponding ports
+ Bit 3 = Counter 2
+ Bit 2 = Counter 1
+ Bit 1 = Counter 0
+ Bit 0 = Unused
+
+1110 - Latch timer status
+ Latch combination of counter mode into corresponding ports
+ Bit 3 = Counter 2
+ Bit 2 = Counter 1
+ Bit 1 = Counter 0
+
+ The output of ports 0x40-0x42 following this command will be:
+
+ Bit 7 = Output pin
+ Bit 6 = Count loaded (0 if timer has expired)
+ Bit 5-4 = Read / Write mode
+ 01 = MSB only
+ 10 = LSB only
+ 11 = LSB / MSB (16-bit)
+ Bit 3-1 = Mode
+ Bit 0 = Binary (0) / BCD mode (1)
+
+2.2) RTC
+
+The second device which was available in the original PC was the MC146818 real
+time clock. The original device is now obsolete, and usually emulated by the
+system chipset, sometimes by an HPET and some frankenstein IRQ routing.
+
+The RTC is accessed through CMOS variables, which uses an index register to
+control which bytes are read. Since there is only one index register, read
+of the CMOS and read of the RTC require lock protection (in addition, it is
+dangerous to allow userspace utilities such as hwclock to have direct RTC
+access, as they could corrupt kernel reads and writes of CMOS memory).
+
+The RTC generates an interrupt which is usually routed to IRQ 8. The interrupt
+can function as a periodic timer, an additional once a day alarm, and can issue
+interrupts after an update of the CMOS registers by the MC146818 is complete.
+The type of interrupt is signalled in the RTC status registers.
+
+The RTC will update the current time fields by battery power even while the
+system is off. The current time fields should not be read while an update is
+in progress, as indicated in the status register.
+
+The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
+programmed to a 32kHz divider if the RTC is to count seconds.
+
+This is the RAM map originally used for the RTC/CMOS:
+
+Location Size Description
+------------------------------------------
+00h byte Current second (BCD)
+01h byte Seconds alarm (BCD)
+02h byte Current minute (BCD)
+03h byte Minutes alarm (BCD)
+04h byte Current hour (BCD)
+05h byte Hours alarm (BCD)
+06h byte Current day of week (BCD)
+07h byte Current day of month (BCD)
+08h byte Current month (BCD)
+09h byte Current year (BCD)
+0Ah byte Register A
+ bit 7 = Update in progress
+ bit 6-4 = Divider for clock
+ 000 = 4.194 MHz
+ 001 = 1.049 MHz
+ 010 = 32 kHz
+ 10X = test modes
+ 110 = reset / disable
+ 111 = reset / disable
+ bit 3-0 = Rate selection for periodic interrupt
+ 000 = periodic timer disabled
+ 001 = 3.90625 uS
+ 010 = 7.8125 uS
+ 011 = .122070 mS
+ 100 = .244141 mS
+ ...
+ 1101 = 125 mS
+ 1110 = 250 mS
+ 1111 = 500 mS
+0Bh byte Register B
+ bit 7 = Run (0) / Halt (1)
+ bit 6 = Periodic interrupt enable
+ bit 5 = Alarm interrupt enable
+ bit 4 = Update-ended interrupt enable
+ bit 3 = Square wave interrupt enable
+ bit 2 = BCD calendar (0) / Binary (1)
+ bit 1 = 12-hour mode (0) / 24-hour mode (1)
+ bit 0 = 0 (DST off) / 1 (DST enabled)
+OCh byte Register C (read only)
+ bit 7 = interrupt request flag (IRQF)
+ bit 6 = periodic interrupt flag (PF)
+ bit 5 = alarm interrupt flag (AF)
+ bit 4 = update interrupt flag (UF)
+ bit 3-0 = reserved
+ODh byte Register D (read only)
+ bit 7 = RTC has power
+ bit 6-0 = reserved
+32h byte Current century BCD (*)
+ (*) location vendor specific and now determined from ACPI global tables
+
+2.3) APIC
+
+On Pentium and later processors, an on-board timer is available to each CPU
+as part of the Advanced Programmable Interrupt Controller. The APIC is
+accessed through memory-mapped registers and provides interrupt service to each
+CPU, used for IPIs and local timer interrupts.
+
+Although in theory the APIC is a safe and stable source for local interrupts,
+in practice, many bugs and glitches have occurred due to the special nature of
+the APIC CPU-local memory-mapped hardware. Beware that CPU errata may affect
+the use of the APIC and that workarounds may be required. In addition, some of
+these workarounds pose unique constraints for virtualization - requiring either
+extra overhead incurred from extra reads of memory-mapped I/O or additional
+functionality that may be more computationally expensive to implement.
+
+Since the APIC is documented quite well in the Intel and AMD manuals, we will
+avoid repetition of the detail here. It should be pointed out that the APIC
+timer is programmed through the LVT (local vector timer) register, is capable
+of one-shot or periodic operation, and is based on the bus clock divided down
+by the programmable divider register.
+
+2.4) HPET
+
+HPET is quite complex, and was originally intended to replace the PIT / RTC
+support of the X86 PC. It remains to be seen whether that will be the case, as
+the de facto standard of PC hardware is to emulate these older devices. Some
+systems designated as legacy free may support only the HPET as a hardware timer
+device.
+
+The HPET spec is rather loose and vague, requiring at least 3 hardware timers,
+but allowing implementation freedom to support many more. It also imposes no
+fixed rate on the timer frequency, but does impose some extremal values on
+frequency, error and slew.
+
+In general, the HPET is recommended as a high precision (compared to PIT /RTC)
+time source which is independent of local variation (as there is only one HPET
+in any given system). The HPET is also memory-mapped, and its presence is
+indicated through ACPI tables by the BIOS.
+
+Detailed specification of the HPET is beyond the current scope of this
+document, as it is also very well documented elsewhere.
+
+2.5) Offboard Timers
+
+Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
+timing chips built into the cards which may have registers which are accessible
+to kernel or user drivers. To the author's knowledge, using these to generate
+a clocksource for a Linux or other kernel has not yet been attempted and is in
+general frowned upon as not playing by the agreed rules of the game. Such a
+timer device would require additional support to be virtualized properly and is
+not considered important at this time as no known operating system does this.
+
+=========================================================================
+
+3) TSC Hardware
+
+The TSC or time stamp counter is relatively simple in theory; it counts
+instruction cycles issued by the processor, which can be used as a measure of
+time. In practice, due to a number of problems, it is the most complicated
+timekeeping device to use.
+
+The TSC is represented internally as a 64-bit MSR which can be read with the
+RDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware
+limitations made it possible to write the TSC, but generally on old hardware it
+was only possible to write the low 32-bits of the 64-bit counter, and the upper
+32-bits of the counter were cleared. Now, however, on Intel processors family
+0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
+has been lifted and all 64-bits are writable. On AMD systems, the ability to
+write the TSC MSR is not an architectural guarantee.
+
+The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
+means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access.
+
+Some vendors have implemented an additional instruction, RDTSCP, which returns
+atomically not just the TSC, but an indicator which corresponds to the
+processor number. This can be used to index into an array of TSC variables to
+determine offset information in SMP systems where TSCs are not synchronized.
+The presence of this instruction must be determined by consulting CPUID feature
+bits.
+
+Both VMX and SVM provide extension fields in the virtualization hardware which
+allows the guest visible TSC to be offset by a constant. Newer implementations
+promise to allow the TSC to additionally be scaled, but this hardware is not
+yet widely available.
+
+3.1) TSC synchronization
+
+The TSC is a CPU-local clock in most implementations. This means, on SMP
+platforms, the TSCs of different CPUs may start at different times depending
+on when the CPUs are powered on. Generally, CPUs on the same die will share
+the same clock, however, this is not always the case.
+
+The BIOS may attempt to resynchronize the TSCs during the poweron process and
+the operating system or other system software may attempt to do this as well.
+Several hardware limitations make the problem worse - if it is not possible to
+write the full 64-bits of the TSC, it may be impossible to match the TSC in
+newly arriving CPUs to that of the rest of the system, resulting in
+unsynchronized TSCs. This may be done by BIOS or system software, but in
+practice, getting a perfectly synchronized TSC will not be possible unless all
+values are read from the same clock, which generally only is possible on single
+socket systems or those with special hardware support.
+
+3.2) TSC and CPU hotplug
+
+As touched on already, CPUs which arrive later than the boot time of the system
+may not have a TSC value that is synchronized with the rest of the system.
+Either system software, BIOS, or SMM code may actually try to establish the TSC
+to a value matching the rest of the system, but a perfect match is usually not
+a guarantee. This can have the effect of bringing a system from a state where
+TSC is synchronized back to a state where TSC synchronization flaws, however
+small, may be exposed to the OS and any virtualization environment.
+
+3.3) TSC and multi-socket / NUMA
+
+Multi-socket systems, especially large multi-socket systems are likely to have
+individual clocksources rather than a single, universally distributed clock.
+Since these clocks are driven by different crystals, they will not have
+perfectly matched frequency, and temperature and electrical variations will
+cause the CPU clocks, and thus the TSCs to drift over time. Depending on the
+exact clock and bus design, the drift may or may not be fixed in absolute
+error, and may accumulate over time.
+
+In addition, very large systems may deliberately slew the clocks of individual
+cores. This technique, known as spread-spectrum clocking, reduces EMI at the
+clock frequency and harmonics of it, which may be required to pass FCC
+standards for telecommunications and computer equipment.
+
+It is recommended not to trust the TSCs to remain synchronized on NUMA or
+multiple socket systems for these reasons.
+
+3.4) TSC and C-states
+
+C-states, or idling states of the processor, especially C1E and deeper sleep
+states may be problematic for TSC as well. The TSC may stop advancing in such
+a state, resulting in a TSC which is behind that of other CPUs when execution
+is resumed. Such CPUs must be detected and flagged by the operating system
+based on CPU and chipset identifications.
+
+The TSC in such a case may be corrected by catching it up to a known external
+clocksource.
+
+3.5) TSC frequency change / P-states
+
+To make things slightly more interesting, some CPUs may change frequency. They
+may or may not run the TSC at the same rate, and because the frequency change
+may be staggered or slewed, at some points in time, the TSC rate may not be
+known other than falling within a range of values. In this case, the TSC will
+not be a stable time source, and must be calibrated against a known, stable,
+external clock to be a usable source of time.
+
+Whether the TSC runs at a constant rate or scales with the P-state is model
+dependent and must be determined by inspecting CPUID, chipset or vendor
+specific MSR fields.
+
+In addition, some vendors have known bugs where the P-state is actually
+compensated for properly during normal operation, but when the processor is
+inactive, the P-state may be raised temporarily to service cache misses from
+other processors. In such cases, the TSC on halted CPUs could advance faster
+than that of non-halted processors. AMD Turion processors are known to have
+this problem.
+
+3.6) TSC and STPCLK / T-states
+
+External signals given to the processor may also have the effect of stopping
+the TSC. This is typically done for thermal emergency power control to prevent
+an overheating condition, and typically, there is no way to detect that this
+condition has happened.
+
+3.7) TSC virtualization - VMX
+
+VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
+instructions, which is enough for full virtualization of TSC in any manner. In
+addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
+field specified in the VMCS. Special instructions must be used to read and
+write the VMCS field.
+
+3.8) TSC virtualization - SVM
+
+SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
+instructions, which is enough for full virtualization of TSC in any manner. In
+addition, SVM allows passing through the host TSC plus an additional offset
+field specified in the SVM control block.
+
+3.9) TSC feature bits in Linux
+
+In summary, there is no way to guarantee the TSC remains in perfect
+synchronization unless it is explicitly guaranteed by the architecture. Even
+if so, the TSCs in multi-sockets or NUMA systems may still run independently
+despite being locally consistent.
+
+The following feature bits are used by Linux to signal various TSC attributes,
+but they can only be taken to be meaningful for UP or single node systems.
+
+X86_FEATURE_TSC : The TSC is available in hardware
+X86_FEATURE_RDTSCP : The RDTSCP instruction is available
+X86_FEATURE_CONSTANT_TSC : The TSC rate is unchanged with P-states
+X86_FEATURE_NONSTOP_TSC : The TSC does not stop in C-states
+X86_FEATURE_TSC_RELIABLE : TSC sync checks are skipped (VMware)
+
+4) Virtualization Problems
+
+Timekeeping is especially problematic for virtualization because a number of
+challenges arise. The most obvious problem is that time is now shared between
+the host and, potentially, a number of virtual machines. Thus the virtual
+operating system does not run with 100% usage of the CPU, despite the fact that
+it may very well make that assumption. It may expect it to remain true to very
+exacting bounds when interrupt sources are disabled, but in reality only its
+virtual interrupt sources are disabled, and the machine may still be preempted
+at any time. This causes problems as the passage of real time, the injection
+of machine interrupts and the associated clock sources are no longer completely
+synchronized with real time.
+
+This same problem can occur on native harware to a degree, as SMM mode may
+steal cycles from the naturally on X86 systems when SMM mode is used by the
+BIOS, but not in such an extreme fashion. However, the fact that SMM mode may
+cause similar problems to virtualization makes it a good justification for
+solving many of these problems on bare metal.
+
+4.1) Interrupt clocking
+
+One of the most immediate problems that occurs with legacy operating systems
+is that the system timekeeping routines are often designed to keep track of
+time by counting periodic interrupts. These interrupts may come from the PIT
+or the RTC, but the problem is the same: the host virtualization engine may not
+be able to deliver the proper number of interrupts per second, and so guest
+time may fall behind. This is especially problematic if a high interrupt rate
+is selected, such as 1000 HZ, which is unfortunately the default for many Linux
+guests.
+
+There are three approaches to solving this problem; first, it may be possible
+to simply ignore it. Guests which have a separate time source for tracking
+'wall clock' or 'real time' may not need any adjustment of their interrupts to
+maintain proper time. If this is not sufficient, it may be necessary to inject
+additional interrupts into the guest in order to increase the effective
+interrupt rate. This approach leads to complications in extreme conditions,
+where host load or guest lag is too much to compensate for, and thus another
+solution to the problem has risen: the guest may need to become aware of lost
+ticks and compensate for them internally. Although promising in theory, the
+implementation of this policy in Linux has been extremely error prone, and a
+number of buggy variants of lost tick compensation are distributed across
+commonly used Linux systems.
+
+Windows uses periodic RTC clocking as a means of keeping time internally, and
+thus requires interrupt slewing to keep proper time. It does use a low enough
+rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
+practice.
+
+4.2) TSC sampling and serialization
+
+As the highest precision time source available, the cycle counter of the CPU
+has aroused much interest from developers. As explained above, this timer has
+many problems unique to its nature as a local, potentially unstable and
+potentially unsynchronized source. One issue which is not unique to the TSC,
+but is highlighted because of its very precise nature is sampling delay. By
+definition, the counter, once read is already old. However, it is also
+possible for the counter to be read ahead of the actual use of the result.
+This is a consequence of the superscalar execution of the instruction stream,
+which may execute instructions out of order. Such execution is called
+non-serialized. Forcing serialized execution is necessary for precise
+measurement with the TSC, and requires a serializing instruction, such as CPUID
+or an MSR read.
+
+Since CPUID may actually be virtualized by a trap and emulate mechanism, this
+serialization can pose a performance issue for hardware virtualization. An
+accurate time stamp counter reading may therefore not always be available, and
+it may be necessary for an implementation to guard against "backwards" reads of
+the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
+system.
+
+4.3) Timespec aliasing
+
+Additionally, this lack of serialization from the TSC poses another challenge
+when using results of the TSC when measured against another time source. As
+the TSC is much higher precision, many possible values of the TSC may be read
+while another clock is still expressing the same value.
+
+That is, you may read (T,T+10) while external clock C maintains the same value.
+Due to non-serialized reads, you may actually end up with a range which
+fluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but
+calibrated against an external value may have a range of valid values.
+Re-calibrating this computation may actually cause time, as computed after the
+calibration, to go backwards, compared with time computed before the
+calibration.
+
+This problem is particularly pronounced with an internal time source in Linux,
+the kernel time, which is expressed in the theoretically high resolution
+timespec - but which advances in much larger granularity intervals, sometimes
+at the rate of jiffies, and possibly in catchup modes, at a much larger step.
+
+This aliasing requires care in the computation and recalibration of kvmclock
+and any other values derived from TSC computation (such as TSC virtualization
+itself).
+
+4.4) Migration
+
+Migration of a virtual machine raises problems for timekeeping in two ways.
+First, the migration itself may take time, during which interrupts cannot be
+delivered, and after which, the guest time may need to be caught up. NTP may
+be able to help to some degree here, as the clock correction required is
+typically small enough to fall in the NTP-correctable window.
+
+An additional concern is that timers based off the TSC (or HPET, if the raw bus
+clock is exposed) may now be running at different rates, requiring compensation
+in some way in the hypervisor by virtualizing these timers. In addition,
+migrating to a faster machine may preclude the use of a passthrough TSC, as a
+faster clock cannot be made visible to a guest without the potential of time
+advancing faster than usual. A slower clock is less of a problem, as it can
+always be caught up to the original rate. KVM clock avoids these problems by
+simply storing multipliers and offsets against the TSC for the guest to convert
+back into nanosecond resolution values.
+
+4.5) Scheduling
+
+Since scheduling may be based on precise timing and firing of interrupts, the
+scheduling algorithms of an operating system may be adversely affected by
+virtualization. In theory, the effect is random and should be universally
+distributed, but in contrived as well as real scenarios (guest device access,
+causes of virtualization exits, possible context switch), this may not always
+be the case. The effect of this has not been well studied.
+
+In an attempt to work around this, several implementations have provided a
+paravirtualized scheduler clock, which reveals the true amount of CPU time for
+which a virtual machine has been running.
+
+4.6) Watchdogs
+
+Watchdog timers, such as the lock detector in Linux may fire accidentally when
+running under hardware virtualization due to timer interrupts being delayed or
+misinterpretation of the passage of real time. Usually, these warnings are
+spurious and can be ignored, but in some circumstances it may be necessary to
+disable such detection.
+
+4.7) Delays and precision timing
+
+Precise timing and delays may not be possible in a virtualized system. This
+can happen if the system is controlling physical hardware, or issues delays to
+compensate for slower I/O to and from devices. The first issue is not solvable
+in general for a virtualized system; hardware control software can't be
+adequately virtualized without a full real-time operating system, which would
+require an RT aware virtualization platform.
+
+The second issue may cause performance problems, but this is unlikely to be a
+significant issue. In many cases these delays may be eliminated through
+configuration or paravirtualization.
+
+4.8) Covert channels and leaks
+
+In addition to the above problems, time information will inevitably leak to the
+guest about the host in anything but a perfect implementation of virtualized
+time. This may allow the guest to infer the presence of a hypervisor (as in a
+red-pill type detection), and it may allow information to leak between guests
+by using CPU utilization itself as a signalling channel. Preventing such
+problems would require completely isolated virtual time which may not track
+real time any longer. This may be useful in certain security or QA contexts,
+but in general isn't recommended for real-world deployment scenarios.
diff --git a/Documentation/virtual/uml/UserModeLinux-HOWTO.txt b/Documentation/virtual/uml/UserModeLinux-HOWTO.txt
new file mode 100644
index 00000000..77dfecf4
--- /dev/null
+++ b/Documentation/virtual/uml/UserModeLinux-HOWTO.txt
@@ -0,0 +1,4589 @@
+ User Mode Linux HOWTO
+ User Mode Linux Core Team
+ Mon Nov 18 14:16:16 EST 2002
+
+ This document describes the use and abuse of Jeff Dike's User Mode
+ Linux: a port of the Linux kernel as a normal Intel Linux process.
+ ______________________________________________________________________
+
+ Table of Contents
+
+ 1. Introduction
+
+ 1.1 How is User Mode Linux Different?
+ 1.2 Why Would I Want User Mode Linux?
+
+ 2. Compiling the kernel and modules
+
+ 2.1 Compiling the kernel
+ 2.2 Compiling and installing kernel modules
+ 2.3 Compiling and installing uml_utilities
+
+ 3. Running UML and logging in
+
+ 3.1 Running UML
+ 3.2 Logging in
+ 3.3 Examples
+
+ 4. UML on 2G/2G hosts
+
+ 4.1 Introduction
+ 4.2 The problem
+ 4.3 The solution
+
+ 5. Setting up serial lines and consoles
+
+ 5.1 Specifying the device
+ 5.2 Specifying the channel
+ 5.3 Examples
+
+ 6. Setting up the network
+
+ 6.1 General setup
+ 6.2 Userspace daemons
+ 6.3 Specifying ethernet addresses
+ 6.4 UML interface setup
+ 6.5 Multicast
+ 6.6 TUN/TAP with the uml_net helper
+ 6.7 TUN/TAP with a preconfigured tap device
+ 6.8 Ethertap
+ 6.9 The switch daemon
+ 6.10 Slip
+ 6.11 Slirp
+ 6.12 pcap
+ 6.13 Setting up the host yourself
+
+ 7. Sharing Filesystems between Virtual Machines
+
+ 7.1 A warning
+ 7.2 Using layered block devices
+ 7.3 Note!
+ 7.4 Another warning
+ 7.5 uml_moo : Merging a COW file with its backing file
+
+ 8. Creating filesystems
+
+ 8.1 Create the filesystem file
+ 8.2 Assign the file to a UML device
+ 8.3 Creating and mounting the filesystem
+
+ 9. Host file access
+
+ 9.1 Using hostfs
+ 9.2 hostfs as the root filesystem
+ 9.3 Building hostfs
+
+ 10. The Management Console
+ 10.1 version
+ 10.2 halt and reboot
+ 10.3 config
+ 10.4 remove
+ 10.5 sysrq
+ 10.6 help
+ 10.7 cad
+ 10.8 stop
+ 10.9 go
+
+ 11. Kernel debugging
+
+ 11.1 Starting the kernel under gdb
+ 11.2 Examining sleeping processes
+ 11.3 Running ddd on UML
+ 11.4 Debugging modules
+ 11.5 Attaching gdb to the kernel
+ 11.6 Using alternate debuggers
+
+ 12. Kernel debugging examples
+
+ 12.1 The case of the hung fsck
+ 12.2 Episode 2: The case of the hung fsck
+
+ 13. What to do when UML doesn't work
+
+ 13.1 Strange compilation errors when you build from source
+ 13.2 (obsolete)
+ 13.3 A variety of panics and hangs with /tmp on a reiserfs filesystem
+ 13.4 The compile fails with errors about conflicting types for 'open', 'dup', and 'waitpid'
+ 13.5 UML doesn't work when /tmp is an NFS filesystem
+ 13.6 UML hangs on boot when compiled with gprof support
+ 13.7 syslogd dies with a SIGTERM on startup
+ 13.8 TUN/TAP networking doesn't work on a 2.4 host
+ 13.9 You can network to the host but not to other machines on the net
+ 13.10 I have no root and I want to scream
+ 13.11 UML build conflict between ptrace.h and ucontext.h
+ 13.12 The UML BogoMips is exactly half the host's BogoMips
+ 13.13 When you run UML, it immediately segfaults
+ 13.14 xterms appear, then immediately disappear
+ 13.15 Any other panic, hang, or strange behavior
+
+ 14. Diagnosing Problems
+
+ 14.1 Case 1 : Normal kernel panics
+ 14.2 Case 2 : Tracing thread panics
+ 14.3 Case 3 : Tracing thread panics caused by other threads
+ 14.4 Case 4 : Hangs
+
+ 15. Thanks
+
+ 15.1 Code and Documentation
+ 15.2 Flushing out bugs
+ 15.3 Buglets and clean-ups
+ 15.4 Case Studies
+ 15.5 Other contributions
+
+
+ ______________________________________________________________________
+
+ 1. Introduction
+
+ Welcome to User Mode Linux. It's going to be fun.
+
+
+
+ 1.1. How is User Mode Linux Different?
+
+ Normally, the Linux Kernel talks straight to your hardware (video
+ card, keyboard, hard drives, etc), and any programs which run ask the
+ kernel to operate the hardware, like so:
+
+
+
+ +-----------+-----------+----+
+ | Process 1 | Process 2 | ...|
+ +-----------+-----------+----+
+ | Linux Kernel |
+ +----------------------------+
+ | Hardware |
+ +----------------------------+
+
+
+
+
+ The User Mode Linux Kernel is different; instead of talking to the
+ hardware, it talks to a `real' Linux kernel (called the `host kernel'
+ from now on), like any other program. Programs can then run inside
+ User-Mode Linux as if they were running under a normal kernel, like
+ so:
+
+
+
+ +----------------+
+ | Process 2 | ...|
+ +-----------+----------------+
+ | Process 1 | User-Mode Linux|
+ +----------------------------+
+ | Linux Kernel |
+ +----------------------------+
+ | Hardware |
+ +----------------------------+
+
+
+
+
+
+ 1.2. Why Would I Want User Mode Linux?
+
+
+ 1. If User Mode Linux crashes, your host kernel is still fine.
+
+ 2. You can run a usermode kernel as a non-root user.
+
+ 3. You can debug the User Mode Linux like any normal process.
+
+ 4. You can run gprof (profiling) and gcov (coverage testing).
+
+ 5. You can play with your kernel without breaking things.
+
+ 6. You can use it as a sandbox for testing new apps.
+
+ 7. You can try new development kernels safely.
+
+ 8. You can run different distributions simultaneously.
+
+ 9. It's extremely fun.
+
+
+
+
+
+ 2. Compiling the kernel and modules
+
+
+
+
+ 2.1. Compiling the kernel
+
+
+ Compiling the user mode kernel is just like compiling any other
+ kernel. Let's go through the steps, using 2.4.0-prerelease (current
+ as of this writing) as an example:
+
+
+ 1. Download the latest UML patch from
+
+ the download page <http://user-mode-linux.sourceforge.net/
+
+ In this example, the file is uml-patch-2.4.0-prerelease.bz2.
+
+
+ 2. Download the matching kernel from your favourite kernel mirror,
+ such as:
+
+ ftp://ftp.ca.kernel.org/pub/kernel/v2.4/linux-2.4.0-prerelease.tar.bz2
+ <ftp://ftp.ca.kernel.org/pub/kernel/v2.4/linux-2.4.0-prerelease.tar.bz2>
+ .
+
+
+ 3. Make a directory and unpack the kernel into it.
+
+
+
+ host%
+ mkdir ~/uml
+
+
+
+
+
+
+ host%
+ cd ~/uml
+
+
+
+
+
+
+ host%
+ tar -xzvf linux-2.4.0-prerelease.tar.bz2
+
+
+
+
+
+
+ 4. Apply the patch using
+
+
+
+ host%
+ cd ~/uml/linux
+
+
+
+ host%
+ bzcat uml-patch-2.4.0-prerelease.bz2 | patch -p1
+
+
+
+
+
+
+ 5. Run your favorite config; `make xconfig ARCH=um' is the most
+ convenient. `make config ARCH=um' and 'make menuconfig ARCH=um'
+ will work as well. The defaults will give you a useful kernel. If
+ you want to change something, go ahead, it probably won't hurt
+ anything.
+
+
+ Note: If the host is configured with a 2G/2G address space split
+ rather than the usual 3G/1G split, then the packaged UML binaries
+ will not run. They will immediately segfault. See ``UML on 2G/2G
+ hosts'' for the scoop on running UML on your system.
+
+
+
+ 6. Finish with `make linux ARCH=um': the result is a file called
+ `linux' in the top directory of your source tree.
+
+ Make sure that you don't build this kernel in /usr/src/linux. On some
+ distributions, /usr/include/asm is a link into this pool. The user-
+ mode build changes the other end of that link, and things that include
+ <asm/anything.h> stop compiling.
+
+ The sources are also available from cvs at the project's cvs page,
+ which has directions on getting the sources. You can also browse the
+ CVS pool from there.
+
+ If you get the CVS sources, you will have to check them out into an
+ empty directory. You will then have to copy each file into the
+ corresponding directory in the appropriate kernel pool.
+
+ If you don't have the latest kernel pool, you can get the
+ corresponding user-mode sources with
+
+
+ host% cvs co -r v_2_3_x linux
+
+
+
+
+ where 'x' is the version in your pool. Note that you will not get the
+ bug fixes and enhancements that have gone into subsequent releases.
+
+
+ 2.2. Compiling and installing kernel modules
+
+ UML modules are built in the same way as the native kernel (with the
+ exception of the 'ARCH=um' that you always need for UML):
+
+
+ host% make modules ARCH=um
+
+
+
+
+ Any modules that you want to load into this kernel need to be built in
+ the user-mode pool. Modules from the native kernel won't work.
+
+ You can install them by using ftp or something to copy them into the
+ virtual machine and dropping them into /lib/modules/`uname -r`.
+
+ You can also get the kernel build process to install them as follows:
+
+ 1. with the kernel not booted, mount the root filesystem in the top
+ level of the kernel pool:
+
+
+ host% mount root_fs mnt -o loop
+
+
+
+
+
+
+ 2. run
+
+
+ host%
+ make modules_install INSTALL_MOD_PATH=`pwd`/mnt ARCH=um
+
+
+
+
+
+
+ 3. unmount the filesystem
+
+
+ host% umount mnt
+
+
+
+
+
+
+ 4. boot the kernel on it
+
+
+ When the system is booted, you can use insmod as usual to get the
+ modules into the kernel. A number of things have been loaded into UML
+ as modules, especially filesystems and network protocols and filters,
+ so most symbols which need to be exported probably already are.
+ However, if you do find symbols that need exporting, let us
+ <http://user-mode-linux.sourceforge.net/> know, and
+ they'll be "taken care of".
+
+
+
+ 2.3. Compiling and installing uml_utilities
+
+ Many features of the UML kernel require a user-space helper program,
+ so a uml_utilities package is distributed separately from the kernel
+ patch which provides these helpers. Included within this is:
+
+ o port-helper - Used by consoles which connect to xterms or ports
+
+ o tunctl - Configuration tool to create and delete tap devices
+
+ o uml_net - Setuid binary for automatic tap device configuration
+
+ o uml_switch - User-space virtual switch required for daemon
+ transport
+
+ The uml_utilities tree is compiled with:
+
+
+ host#
+ make && make install
+
+
+
+
+ Note that UML kernel patches may require a specific version of the
+ uml_utilities distribution. If you don't keep up with the mailing
+ lists, ensure that you have the latest release of uml_utilities if you
+ are experiencing problems with your UML kernel, particularly when
+ dealing with consoles or command-line switches to the helper programs
+
+
+
+
+
+
+
+
+ 3. Running UML and logging in
+
+
+
+ 3.1. Running UML
+
+ It runs on 2.2.15 or later, and all 2.4 kernels.
+
+
+ Booting UML is straightforward. Simply run 'linux': it will try to
+ mount the file `root_fs' in the current directory. You do not need to
+ run it as root. If your root filesystem is not named `root_fs', then
+ you need to put a `ubd0=root_fs_whatever' switch on the linux command
+ line.
+
+
+ You will need a filesystem to boot UML from. There are a number
+ available for download from here <http://user-mode-
+ linux.sourceforge.net/> . There are also several tools
+ <http://user-mode-linux.sourceforge.net/> which can be
+ used to generate UML-compatible filesystem images from media.
+ The kernel will boot up and present you with a login prompt.
+
+
+ Note: If the host is configured with a 2G/2G address space split
+ rather than the usual 3G/1G split, then the packaged UML binaries will
+ not run. They will immediately segfault. See ``UML on 2G/2G hosts''
+ for the scoop on running UML on your system.
+
+
+
+ 3.2. Logging in
+
+
+
+ The prepackaged filesystems have a root account with password 'root'
+ and a user account with password 'user'. The login banner will
+ generally tell you how to log in. So, you log in and you will find
+ yourself inside a little virtual machine. Our filesystems have a
+ variety of commands and utilities installed (and it is fairly easy to
+ add more), so you will have a lot of tools with which to poke around
+ the system.
+
+ There are a couple of other ways to log in:
+
+ o On a virtual console
+
+
+
+ Each virtual console that is configured (i.e. the device exists in
+ /dev and /etc/inittab runs a getty on it) will come up in its own
+ xterm. If you get tired of the xterms, read ``Setting up serial
+ lines and consoles'' to see how to attach the consoles to
+ something else, like host ptys.
+
+
+
+ o Over the serial line
+
+
+ In the boot output, find a line that looks like:
+
+
+
+ serial line 0 assigned pty /dev/ptyp1
+
+
+
+
+ Attach your favorite terminal program to the corresponding tty. I.e.
+ for minicom, the command would be
+
+
+ host% minicom -o -p /dev/ttyp1
+
+
+
+
+
+
+ o Over the net
+
+
+ If the network is running, then you can telnet to the virtual
+ machine and log in to it. See ``Setting up the network'' to learn
+ about setting up a virtual network.
+
+ When you're done using it, run halt, and the kernel will bring itself
+ down and the process will exit.
+
+
+ 3.3. Examples
+
+ Here are some examples of UML in action:
+
+ o A login session <http://user-mode-linux.sourceforge.net/login.html>
+
+ o A virtual network <http://user-mode-linux.sourceforge.net/net.html>
+
+
+
+
+
+
+
+ 4. UML on 2G/2G hosts
+
+
+
+
+ 4.1. Introduction
+
+
+ Most Linux machines are configured so that the kernel occupies the
+ upper 1G (0xc0000000 - 0xffffffff) of the 4G address space and
+ processes use the lower 3G (0x00000000 - 0xbfffffff). However, some
+ machine are configured with a 2G/2G split, with the kernel occupying
+ the upper 2G (0x80000000 - 0xffffffff) and processes using the lower
+ 2G (0x00000000 - 0x7fffffff).
+
+
+
+
+ 4.2. The problem
+
+
+ The prebuilt UML binaries on this site will not run on 2G/2G hosts
+ because UML occupies the upper .5G of the 3G process address space
+ (0xa0000000 - 0xbfffffff). Obviously, on 2G/2G hosts, this is right
+ in the middle of the kernel address space, so UML won't even load - it
+ will immediately segfault.
+
+
+
+
+ 4.3. The solution
+
+
+ The fix for this is to rebuild UML from source after enabling
+ CONFIG_HOST_2G_2G (under 'General Setup'). This will cause UML to
+ load itself in the top .5G of that smaller process address space,
+ where it will run fine. See ``Compiling the kernel and modules'' if
+ you need help building UML from source.
+
+
+
+
+
+
+
+
+
+
+ 5. Setting up serial lines and consoles
+
+
+ It is possible to attach UML serial lines and consoles to many types
+ of host I/O channels by specifying them on the command line.
+
+
+ You can attach them to host ptys, ttys, file descriptors, and ports.
+ This allows you to do things like
+
+ o have a UML console appear on an unused host console,
+
+ o hook two virtual machines together by having one attach to a pty
+ and having the other attach to the corresponding tty
+
+ o make a virtual machine accessible from the net by attaching a
+ console to a port on the host.
+
+
+ The general format of the command line option is device=channel.
+
+
+
+ 5.1. Specifying the device
+
+ Devices are specified with "con" or "ssl" (console or serial line,
+ respectively), optionally with a device number if you are talking
+ about a specific device.
+
+
+ Using just "con" or "ssl" describes all of the consoles or serial
+ lines. If you want to talk about console #3 or serial line #10, they
+ would be "con3" and "ssl10", respectively.
+
+
+ A specific device name will override a less general "con=" or "ssl=".
+ So, for example, you can assign a pty to each of the serial lines
+ except for the first two like this:
+
+
+ ssl=pty ssl0=tty:/dev/tty0 ssl1=tty:/dev/tty1
+
+
+
+
+ The specificity of the device name is all that matters; order on the
+ command line is irrelevant.
+
+
+
+ 5.2. Specifying the channel
+
+ There are a number of different types of channels to attach a UML
+ device to, each with a different way of specifying exactly what to
+ attach to.
+
+ o pseudo-terminals - device=pty pts terminals - device=pts
+
+
+ This will cause UML to allocate a free host pseudo-terminal for the
+ device. The terminal that it got will be announced in the boot
+ log. You access it by attaching a terminal program to the
+ corresponding tty:
+
+ o screen /dev/pts/n
+
+ o screen /dev/ttyxx
+
+ o minicom -o -p /dev/ttyxx - minicom seems not able to handle pts
+ devices
+
+ o kermit - start it up, 'open' the device, then 'connect'
+
+
+
+
+
+ o terminals - device=tty:tty device file
+
+
+ This will make UML attach the device to the specified tty (i.e
+
+
+ con1=tty:/dev/tty3
+
+
+
+
+ will attach UML's console 1 to the host's /dev/tty3). If the tty that
+ you specify is the slave end of a tty/pty pair, something else must
+ have already opened the corresponding pty in order for this to work.
+
+
+
+
+
+ o xterms - device=xterm
+
+
+ UML will run an xterm and the device will be attached to it.
+
+
+
+
+
+ o Port - device=port:port number
+
+
+ This will attach the UML devices to the specified host port.
+ Attaching console 1 to the host's port 9000 would be done like
+ this:
+
+
+ con1=port:9000
+
+
+
+
+ Attaching all the serial lines to that port would be done similarly:
+
+
+ ssl=port:9000
+
+
+
+
+ You access these devices by telnetting to that port. Each active tel-
+ net session gets a different device. If there are more telnets to a
+ port than UML devices attached to it, then the extra telnet sessions
+ will block until an existing telnet detaches, or until another device
+ becomes active (i.e. by being activated in /etc/inittab).
+
+ This channel has the advantage that you can both attach multiple UML
+ devices to it and know how to access them without reading the UML boot
+ log. It is also unique in allowing access to a UML from remote
+ machines without requiring that the UML be networked. This could be
+ useful in allowing public access to UMLs because they would be
+ accessible from the net, but wouldn't need any kind of network
+ filtering or access control because they would have no network access.
+
+
+ If you attach the main console to a portal, then the UML boot will
+ appear to hang. In reality, it's waiting for a telnet to connect, at
+ which point the boot will proceed.
+
+
+
+
+
+ o already-existing file descriptors - device=file descriptor
+
+
+ If you set up a file descriptor on the UML command line, you can
+ attach a UML device to it. This is most commonly used to put the
+ main console back on stdin and stdout after assigning all the other
+ consoles to something else:
+
+
+ con0=fd:0,fd:1 con=pts
+
+
+
+
+
+
+
+
+ o Nothing - device=null
+
+
+ This allows the device to be opened, in contrast to 'none', but
+ reads will block, and writes will succeed and the data will be
+ thrown out.
+
+
+
+
+
+ o None - device=none
+
+
+ This causes the device to disappear.
+
+
+
+ You can also specify different input and output channels for a device
+ by putting a comma between them:
+
+
+ ssl3=tty:/dev/tty2,xterm
+
+
+
+
+ will cause serial line 3 to accept input on the host's /dev/tty2 and
+ display output on an xterm. That's a silly example - the most common
+ use of this syntax is to reattach the main console to stdin and stdout
+ as shown above.
+
+
+ If you decide to move the main console away from stdin/stdout, the
+ initial boot output will appear in the terminal that you're running
+ UML in. However, once the console driver has been officially
+ initialized, then the boot output will start appearing wherever you
+ specified that console 0 should be. That device will receive all
+ subsequent output.
+
+
+
+ 5.3. Examples
+
+ There are a number of interesting things you can do with this
+ capability.
+
+
+ First, this is how you get rid of those bleeding console xterms by
+ attaching them to host ptys:
+
+
+ con=pty con0=fd:0,fd:1
+
+
+
+
+ This will make a UML console take over an unused host virtual console,
+ so that when you switch to it, you will see the UML login prompt
+ rather than the host login prompt:
+
+
+ con1=tty:/dev/tty6
+
+
+
+
+ You can attach two virtual machines together with what amounts to a
+ serial line as follows:
+
+ Run one UML with a serial line attached to a pty -
+
+
+ ssl1=pty
+
+
+
+
+ Look at the boot log to see what pty it got (this example will assume
+ that it got /dev/ptyp1).
+
+ Boot the other UML with a serial line attached to the corresponding
+ tty -
+
+
+ ssl1=tty:/dev/ttyp1
+
+
+
+
+ Log in, make sure that it has no getty on that serial line, attach a
+ terminal program like minicom to it, and you should see the login
+ prompt of the other virtual machine.
+
+
+ 6. Setting up the network
+
+
+
+ This page describes how to set up the various transports and to
+ provide a UML instance with network access to the host, other machines
+ on the local net, and the rest of the net.
+
+
+ As of 2.4.5, UML networking has been completely redone to make it much
+ easier to set up, fix bugs, and add new features.
+
+
+ There is a new helper, uml_net, which does the host setup that
+ requires root privileges.
+
+
+ There are currently five transport types available for a UML virtual
+ machine to exchange packets with other hosts:
+
+ o ethertap
+
+ o TUN/TAP
+
+ o Multicast
+
+ o a switch daemon
+
+ o slip
+
+ o slirp
+
+ o pcap
+
+ The TUN/TAP, ethertap, slip, and slirp transports allow a UML
+ instance to exchange packets with the host. They may be directed
+ to the host or the host may just act as a router to provide access
+ to other physical or virtual machines.
+
+
+ The pcap transport is a synthetic read-only interface, using the
+ libpcap binary to collect packets from interfaces on the host and
+ filter them. This is useful for building preconfigured traffic
+ monitors or sniffers.
+
+
+ The daemon and multicast transports provide a completely virtual
+ network to other virtual machines. This network is completely
+ disconnected from the physical network unless one of the virtual
+ machines on it is acting as a gateway.
+
+
+ With so many host transports, which one should you use? Here's when
+ you should use each one:
+
+ o ethertap - if you want access to the host networking and it is
+ running 2.2
+
+ o TUN/TAP - if you want access to the host networking and it is
+ running 2.4. Also, the TUN/TAP transport is able to use a
+ preconfigured device, allowing it to avoid using the setuid uml_net
+ helper, which is a security advantage.
+
+ o Multicast - if you want a purely virtual network and you don't want
+ to set up anything but the UML
+
+ o a switch daemon - if you want a purely virtual network and you
+ don't mind running the daemon in order to get somewhat better
+ performance
+
+ o slip - there is no particular reason to run the slip backend unless
+ ethertap and TUN/TAP are just not available for some reason
+
+ o slirp - if you don't have root access on the host to setup
+ networking, or if you don't want to allocate an IP to your UML
+
+ o pcap - not much use for actual network connectivity, but great for
+ monitoring traffic on the host
+
+ Ethertap is available on 2.4 and works fine. TUN/TAP is preferred
+ to it because it has better performance and ethertap is officially
+ considered obsolete in 2.4. Also, the root helper only needs to
+ run occasionally for TUN/TAP, rather than handling every packet, as
+ it does with ethertap. This is a slight security advantage since
+ it provides fewer opportunities for a nasty UML user to somehow
+ exploit the helper's root privileges.
+
+
+ 6.1. General setup
+
+ First, you must have the virtual network enabled in your UML. If are
+ running a prebuilt kernel from this site, everything is already
+ enabled. If you build the kernel yourself, under the "Network device
+ support" menu, enable "Network device support", and then the three
+ transports.
+
+
+ The next step is to provide a network device to the virtual machine.
+ This is done by describing it on the kernel command line.
+
+ The general format is
+
+
+ eth <n> = <transport> , <transport args>
+
+
+
+
+ For example, a virtual ethernet device may be attached to a host
+ ethertap device as follows:
+
+
+ eth0=ethertap,tap0,fe:fd:0:0:0:1,192.168.0.254
+
+
+
+
+ This sets up eth0 inside the virtual machine to attach itself to the
+ host /dev/tap0, assigns it an ethernet address, and assigns the host
+ tap0 interface an IP address.
+
+
+
+ Note that the IP address you assign to the host end of the tap device
+ must be different than the IP you assign to the eth device inside UML.
+ If you are short on IPs and don't want to consume two per UML, then
+ you can reuse the host's eth IP address for the host ends of the tap
+ devices. Internally, the UMLs must still get unique IPs for their eth
+ devices. You can also give the UMLs non-routable IPs (192.168.x.x or
+ 10.x.x.x) and have the host masquerade them. This will let outgoing
+ connections work, but incoming connections won't without more work,
+ such as port forwarding from the host.
+ Also note that when you configure the host side of an interface, it is
+ only acting as a gateway. It will respond to pings sent to it
+ locally, but is not useful to do that since it's a host interface.
+ You are not talking to the UML when you ping that interface and get a
+ response.
+
+
+ You can also add devices to a UML and remove them at runtime. See the
+ ``The Management Console'' page for details.
+
+
+ The sections below describe this in more detail.
+
+
+ Once you've decided how you're going to set up the devices, you boot
+ UML, log in, configure the UML side of the devices, and set up routes
+ to the outside world. At that point, you will be able to talk to any
+ other machines, physical or virtual, on the net.
+
+
+ If ifconfig inside UML fails and the network refuses to come up, run
+ tell you what went wrong.
+
+
+
+ 6.2. Userspace daemons
+
+ You will likely need the setuid helper, or the switch daemon, or both.
+ They are both installed with the RPM and deb, so if you've installed
+ either, you can skip the rest of this section.
+
+
+ If not, then you need to check them out of CVS, build them, and
+ install them. The helper is uml_net, in CVS /tools/uml_net, and the
+ daemon is uml_switch, in CVS /tools/uml_router. They are both built
+ with a plain 'make'. Both need to be installed in a directory that's
+ in your path - /usr/bin is recommend. On top of that, uml_net needs
+ to be setuid root.
+
+
+
+ 6.3. Specifying ethernet addresses
+
+ Below, you will see that the TUN/TAP, ethertap, and daemon interfaces
+ allow you to specify hardware addresses for the virtual ethernet
+ devices. This is generally not necessary. If you don't have a
+ specific reason to do it, you probably shouldn't. If one is not
+ specified on the command line, the driver will assign one based on the
+ device IP address. It will provide the address fe:fd:nn:nn:nn:nn
+ where nn.nn.nn.nn is the device IP address. This is nearly always
+ sufficient to guarantee a unique hardware address for the device. A
+ couple of exceptions are:
+
+ o Another set of virtual ethernet devices are on the same network and
+ they are assigned hardware addresses using a different scheme which
+ may conflict with the UML IP address-based scheme
+
+ o You aren't going to use the device for IP networking, so you don't
+ assign the device an IP address
+
+ If you let the driver provide the hardware address, you should make
+ sure that the device IP address is known before the interface is
+ brought up. So, inside UML, this will guarantee that:
+
+
+
+ UML#
+ ifconfig eth0 192.168.0.250 up
+
+
+
+
+ If you decide to assign the hardware address yourself, make sure that
+ the first byte of the address is even. Addresses with an odd first
+ byte are broadcast addresses, which you don't want assigned to a
+ device.
+
+
+
+ 6.4. UML interface setup
+
+ Once the network devices have been described on the command line, you
+ should boot UML and log in.
+
+
+ The first thing to do is bring the interface up:
+
+
+ UML# ifconfig ethn ip-address up
+
+
+
+
+ You should be able to ping the host at this point.
+
+
+ To reach the rest of the world, you should set a default route to the
+ host:
+
+
+ UML# route add default gw host ip
+
+
+
+
+ Again, with host ip of 192.168.0.4:
+
+
+ UML# route add default gw 192.168.0.4
+
+
+
+
+ This page used to recommend setting a network route to your local net.
+ This is wrong, because it will cause UML to try to figure out hardware
+ addresses of the local machines by arping on the interface to the
+ host. Since that interface is basically a single strand of ethernet
+ with two nodes on it (UML and the host) and arp requests don't cross
+ networks, they will fail to elicit any responses. So, what you want
+ is for UML to just blindly throw all packets at the host and let it
+ figure out what to do with them, which is what leaving out the network
+ route and adding the default route does.
+
+
+ Note: If you can't communicate with other hosts on your physical
+ ethernet, it's probably because of a network route that's
+ automatically set up. If you run 'route -n' and see a route that
+ looks like this:
+
+
+
+
+ Destination Gateway Genmask Flags Metric Ref Use Iface
+ 192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
+
+
+
+
+ with a mask that's not 255.255.255.255, then replace it with a route
+ to your host:
+
+
+ UML#
+ route del -net 192.168.0.0 dev eth0 netmask 255.255.255.0
+
+
+
+
+
+
+ UML#
+ route add -host 192.168.0.4 dev eth0
+
+
+
+
+ This, plus the default route to the host, will allow UML to exchange
+ packets with any machine on your ethernet.
+
+
+
+ 6.5. Multicast
+
+ The simplest way to set up a virtual network between multiple UMLs is
+ to use the mcast transport. This was written by Harald Welte and is
+ present in UML version 2.4.5-5um and later. Your system must have
+ multicast enabled in the kernel and there must be a multicast-capable
+ network device on the host. Normally, this is eth0, but if there is
+ no ethernet card on the host, then you will likely get strange error
+ messages when you bring the device up inside UML.
+
+
+ To use it, run two UMLs with
+
+
+ eth0=mcast
+
+
+
+
+ on their command lines. Log in, configure the ethernet device in each
+ machine with different IP addresses:
+
+
+ UML1# ifconfig eth0 192.168.0.254
+
+
+
+
+
+
+ UML2# ifconfig eth0 192.168.0.253
+
+
+
+
+ and they should be able to talk to each other.
+
+ The full set of command line options for this transport are
+
+
+
+ ethn=mcast,ethernet address,multicast
+ address,multicast port,ttl
+
+
+
+
+ Harald's original README is here <http://user-mode-linux.source-
+ forge.net/> and explains these in detail, as well as
+ some other issues.
+
+ There is also a related point-to-point only "ucast" transport.
+ This is useful when your network does not support multicast, and
+ all network connections are simple point to point links.
+
+ The full set of command line options for this transport are
+
+
+ ethn=ucast,ethernet address,remote address,listen port,remote port
+
+
+
+
+ 6.6. TUN/TAP with the uml_net helper
+
+ TUN/TAP is the preferred mechanism on 2.4 to exchange packets with the
+ host. The TUN/TAP backend has been in UML since 2.4.9-3um.
+
+
+ The easiest way to get up and running is to let the setuid uml_net
+ helper do the host setup for you. This involves insmod-ing the tun.o
+ module if necessary, configuring the device, and setting up IP
+ forwarding, routing, and proxy arp. If you are new to UML networking,
+ do this first. If you're concerned about the security implications of
+ the setuid helper, use it to get up and running, then read the next
+ section to see how to have UML use a preconfigured tap device, which
+ avoids the use of uml_net.
+
+
+ If you specify an IP address for the host side of the device, the
+ uml_net helper will do all necessary setup on the host - the only
+ requirement is that TUN/TAP be available, either built in to the host
+ kernel or as the tun.o module.
+
+ The format of the command line switch to attach a device to a TUN/TAP
+ device is
+
+
+ eth <n> =tuntap,,, <IP address>
+
+
+
+
+ For example, this argument will attach the UML's eth0 to the next
+ available tap device and assign an ethernet address to it based on its
+ IP address
+
+
+ eth0=tuntap,,,192.168.0.254
+
+
+
+
+
+
+ Note that the IP address that must be used for the eth device inside
+ UML is fixed by the routing and proxy arp that is set up on the
+ TUN/TAP device on the host. You can use a different one, but it won't
+ work because reply packets won't reach the UML. This is a feature.
+ It prevents a nasty UML user from doing things like setting the UML IP
+ to the same as the network's nameserver or mail server.
+
+
+ There are a couple potential problems with running the TUN/TAP
+ transport on a 2.4 host kernel
+
+ o TUN/TAP seems not to work on 2.4.3 and earlier. Upgrade the host
+ kernel or use the ethertap transport.
+
+ o With an upgraded kernel, TUN/TAP may fail with
+
+
+ File descriptor in bad state
+
+
+
+
+ This is due to a header mismatch between the upgraded kernel and the
+ kernel that was originally installed on the machine. The fix is to
+ make sure that /usr/src/linux points to the headers for the running
+ kernel.
+
+ These were pointed out by Tim Robinson <timro at trkr dot net> in
+ <http://www.geocrawler.com/> name="this uml-
+ user post"> .
+
+
+
+ 6.7. TUN/TAP with a preconfigured tap device
+
+ If you prefer not to have UML use uml_net (which is somewhat
+ insecure), with UML 2.4.17-11, you can set up a TUN/TAP device
+ beforehand. The setup needs to be done as root, but once that's done,
+ there is no need for root assistance. Setting up the device is done
+ as follows:
+
+ o Create the device with tunctl (available from the UML utilities
+ tarball)
+
+
+
+
+ host# tunctl -u uid
+
+
+
+
+ where uid is the user id or username that UML will be run as. This
+ will tell you what device was created.
+
+ o Configure the device IP (change IP addresses and device name to
+ suit)
+
+
+
+
+ host# ifconfig tap0 192.168.0.254 up
+
+
+
+
+
+ o Set up routing and arping if desired - this is my recipe, there are
+ other ways of doing the same thing
+
+
+ host#
+ bash -c 'echo 1 > /proc/sys/net/ipv4/ip_forward'
+
+ host#
+ route add -host 192.168.0.253 dev tap0
+
+
+
+
+
+
+ host#
+ bash -c 'echo 1 > /proc/sys/net/ipv4/conf/tap0/proxy_arp'
+
+
+
+
+
+
+ host#
+ arp -Ds 192.168.0.253 eth0 pub
+
+
+
+
+ Note that this must be done every time the host boots - this configu-
+ ration is not stored across host reboots. So, it's probably a good
+ idea to stick it in an rc file. An even better idea would be a little
+ utility which reads the information from a config file and sets up
+ devices at boot time.
+
+ o Rather than using up two IPs and ARPing for one of them, you can
+ also provide direct access to your LAN by the UML by using a
+ bridge.
+
+
+ host#
+ brctl addbr br0
+
+
+
+
+
+
+ host#
+ ifconfig eth0 0.0.0.0 promisc up
+
+
+
+
+
+
+ host#
+ ifconfig tap0 0.0.0.0 promisc up
+
+
+
+
+
+
+ host#
+ ifconfig br0 192.168.0.1 netmask 255.255.255.0 up
+
+
+
+
+
+
+
+ host#
+ brctl stp br0 off
+
+
+
+
+
+
+ host#
+ brctl setfd br0 1
+
+
+
+
+
+
+ host#
+ brctl sethello br0 1
+
+
+
+
+
+
+ host#
+ brctl addif br0 eth0
+
+
+
+
+
+
+ host#
+ brctl addif br0 tap0
+
+
+
+
+ Note that 'br0' should be setup using ifconfig with the existing IP
+ address of eth0, as eth0 no longer has its own IP.
+
+ o
+
+
+ Also, the /dev/net/tun device must be writable by the user running
+ UML in order for the UML to use the device that's been configured
+ for it. The simplest thing to do is
+
+
+ host# chmod 666 /dev/net/tun
+
+
+
+
+ Making it world-writable looks bad, but it seems not to be
+ exploitable as a security hole. However, it does allow anyone to cre-
+ ate useless tap devices (useless because they can't configure them),
+ which is a DOS attack. A somewhat more secure alternative would to be
+ to create a group containing all the users who have preconfigured tap
+ devices and chgrp /dev/net/tun to that group with mode 664 or 660.
+
+
+ o Once the device is set up, run UML with 'eth0=tuntap,device name'
+ (i.e. 'eth0=tuntap,tap0') on the command line (or do it with the
+ mconsole config command).
+
+ o Bring the eth device up in UML and you're in business.
+
+ If you don't want that tap device any more, you can make it non-
+ persistent with
+
+
+ host# tunctl -d tap device
+
+
+
+
+ Finally, tunctl has a -b (for brief mode) switch which causes it to
+ output only the name of the tap device it created. This makes it
+ suitable for capture by a script:
+
+
+ host# TAP=`tunctl -u 1000 -b`
+
+
+
+
+
+
+ 6.8. Ethertap
+
+ Ethertap is the general mechanism on 2.2 for userspace processes to
+ exchange packets with the kernel.
+
+
+
+ To use this transport, you need to describe the virtual network device
+ on the UML command line. The general format for this is
+
+
+ eth <n> =ethertap, <device> , <ethernet address> , <tap IP address>
+
+
+
+
+ So, the previous example
+
+
+ eth0=ethertap,tap0,fe:fd:0:0:0:1,192.168.0.254
+
+
+
+
+ attaches the UML eth0 device to the host /dev/tap0, assigns it the
+ ethernet address fe:fd:0:0:0:1, and assigns the IP address
+ 192.168.0.254 to the tap device.
+
+
+
+ The tap device is mandatory, but the others are optional. If the
+ ethernet address is omitted, one will be assigned to it.
+
+
+ The presence of the tap IP address will cause the helper to run and do
+ whatever host setup is needed to allow the virtual machine to
+ communicate with the outside world. If you're not sure you know what
+ you're doing, this is the way to go.
+
+
+ If it is absent, then you must configure the tap device and whatever
+ arping and routing you will need on the host. However, even in this
+ case, the uml_net helper still needs to be in your path and it must be
+ setuid root if you're not running UML as root. This is because the
+ tap device doesn't support SIGIO, which UML needs in order to use
+ something as a source of input. So, the helper is used as a
+ convenient asynchronous IO thread.
+
+ If you're using the uml_net helper, you can ignore the following host
+ setup - uml_net will do it for you. You just need to make sure you
+ have ethertap available, either built in to the host kernel or
+ available as a module.
+
+
+ If you want to set things up yourself, you need to make sure that the
+ appropriate /dev entry exists. If it doesn't, become root and create
+ it as follows:
+
+
+ mknod /dev/tap <minor> c 36 <minor> + 16
+
+
+
+
+ For example, this is how to create /dev/tap0:
+
+
+ mknod /dev/tap0 c 36 0 + 16
+
+
+
+
+ You also need to make sure that the host kernel has ethertap support.
+ If ethertap is enabled as a module, you apparently need to insmod
+ ethertap once for each ethertap device you want to enable. So,
+
+
+ host#
+ insmod ethertap
+
+
+
+
+ will give you the tap0 interface. To get the tap1 interface, you need
+ to run
+
+
+ host#
+ insmod ethertap unit=1 -o ethertap1
+
+
+
+
+
+
+
+ 6.9. The switch daemon
+
+ Note: This is the daemon formerly known as uml_router, but which was
+ renamed so the network weenies of the world would stop growling at me.
+
+
+ The switch daemon, uml_switch, provides a mechanism for creating a
+ totally virtual network. By default, it provides no connection to the
+ host network (but see -tap, below).
+
+
+ The first thing you need to do is run the daemon. Running it with no
+ arguments will make it listen on a default pair of unix domain
+ sockets.
+
+
+ If you want it to listen on a different pair of sockets, use
+
+
+ -unix control socket data socket
+
+
+
+
+
+ If you want it to act as a hub rather than a switch, use
+
+
+ -hub
+
+
+
+
+
+ If you want the switch to be connected to host networking (allowing
+ the umls to get access to the outside world through the host), use
+
+
+ -tap tap0
+
+
+
+
+
+ Note that the tap device must be preconfigured (see "TUN/TAP with a
+ preconfigured tap device", above). If you're using a different tap
+ device than tap0, specify that instead of tap0.
+
+
+ uml_switch can be backgrounded as follows
+
+
+ host%
+ uml_switch [ options ] < /dev/null > /dev/null
+
+
+
+
+ The reason it doesn't background by default is that it listens to
+ stdin for EOF. When it sees that, it exits.
+
+
+ The general format of the kernel command line switch is
+
+
+
+ ethn=daemon,ethernet address,socket
+ type,control socket,data socket
+
+
+
+
+ You can leave off everything except the 'daemon'. You only need to
+ specify the ethernet address if the one that will be assigned to it
+ isn't acceptable for some reason. The rest of the arguments describe
+ how to communicate with the daemon. You should only specify them if
+ you told the daemon to use different sockets than the default. So, if
+ you ran the daemon with no arguments, running the UML on the same
+ machine with
+ eth0=daemon
+
+
+
+
+ will cause the eth0 driver to attach itself to the daemon correctly.
+
+
+
+ 6.10. Slip
+
+ Slip is another, less general, mechanism for a process to communicate
+ with the host networking. In contrast to the ethertap interface,
+ which exchanges ethernet frames with the host and can be used to
+ transport any higher-level protocol, it can only be used to transport
+ IP.
+
+
+ The general format of the command line switch is
+
+
+
+ ethn=slip,slip IP
+
+
+
+
+ The slip IP argument is the IP address that will be assigned to the
+ host end of the slip device. If it is specified, the helper will run
+ and will set up the host so that the virtual machine can reach it and
+ the rest of the network.
+
+
+ There are some oddities with this interface that you should be aware
+ of. You should only specify one slip device on a given virtual
+ machine, and its name inside UML will be 'umn', not 'eth0' or whatever
+ you specified on the command line. These problems will be fixed at
+ some point.
+
+
+
+ 6.11. Slirp
+
+ slirp uses an external program, usually /usr/bin/slirp, to provide IP
+ only networking connectivity through the host. This is similar to IP
+ masquerading with a firewall, although the translation is performed in
+ user-space, rather than by the kernel. As slirp does not set up any
+ interfaces on the host, or changes routing, slirp does not require
+ root access or setuid binaries on the host.
+
+
+ The general format of the command line switch for slirp is:
+
+
+
+ ethn=slirp,ethernet address,slirp path
+
+
+
+
+ The ethernet address is optional, as UML will set up the interface
+ with an ethernet address based upon the initial IP address of the
+ interface. The slirp path is generally /usr/bin/slirp, although it
+ will depend on distribution.
+
+
+ The slirp program can have a number of options passed to the command
+ line and we can't add them to the UML command line, as they will be
+ parsed incorrectly. Instead, a wrapper shell script can be written or
+ the options inserted into the /.slirprc file. More information on
+ all of the slirp options can be found in its man pages.
+
+
+ The eth0 interface on UML should be set up with the IP 10.2.0.15,
+ although you can use anything as long as it is not used by a network
+ you will be connecting to. The default route on UML should be set to
+ use
+
+
+ UML#
+ route add default dev eth0
+
+
+
+
+ slirp provides a number of useful IP addresses which can be used by
+ UML, such as 10.0.2.3 which is an alias for the DNS server specified
+ in /etc/resolv.conf on the host or the IP given in the 'dns' option
+ for slirp.
+
+
+ Even with a baudrate setting higher than 115200, the slirp connection
+ is limited to 115200. If you need it to go faster, the slirp binary
+ needs to be compiled with FULL_BOLT defined in config.h.
+
+
+
+ 6.12. pcap
+
+ The pcap transport is attached to a UML ethernet device on the command
+ line or with uml_mconsole with the following syntax:
+
+
+
+ ethn=pcap,host interface,filter
+ expression,option1,option2
+
+
+
+
+ The expression and options are optional.
+
+
+ The interface is whatever network device on the host you want to
+ sniff. The expression is a pcap filter expression, which is also what
+ tcpdump uses, so if you know how to specify tcpdump filters, you will
+ use the same expressions here. The options are up to two of
+ 'promisc', control whether pcap puts the host interface into
+ promiscuous mode. 'optimize' and 'nooptimize' control whether the pcap
+ expression optimizer is used.
+
+
+ Example:
+
+
+
+ eth0=pcap,eth0,tcp
+
+ eth1=pcap,eth0,!tcp
+
+
+
+ will cause the UML eth0 to emit all tcp packets on the host eth0 and
+ the UML eth1 to emit all non-tcp packets on the host eth0.
+
+
+
+ 6.13. Setting up the host yourself
+
+ If you don't specify an address for the host side of the ethertap or
+ slip device, UML won't do any setup on the host. So this is what is
+ needed to get things working (the examples use a host-side IP of
+ 192.168.0.251 and a UML-side IP of 192.168.0.250 - adjust to suit your
+ own network):
+
+ o The device needs to be configured with its IP address. Tap devices
+ are also configured with an mtu of 1484. Slip devices are
+ configured with a point-to-point address pointing at the UML ip
+ address.
+
+
+ host# ifconfig tap0 arp mtu 1484 192.168.0.251 up
+
+
+
+
+
+
+ host#
+ ifconfig sl0 192.168.0.251 pointopoint 192.168.0.250 up
+
+
+
+
+
+ o If a tap device is being set up, a route is set to the UML IP.
+
+
+ UML# route add -host 192.168.0.250 gw 192.168.0.251
+
+
+
+
+
+ o To allow other hosts on your network to see the virtual machine,
+ proxy arp is set up for it.
+
+
+ host# arp -Ds 192.168.0.250 eth0 pub
+
+
+
+
+
+ o Finally, the host is set up to route packets.
+
+
+ host# echo 1 > /proc/sys/net/ipv4/ip_forward
+
+
+
+
+
+
+
+
+
+
+ 7. Sharing Filesystems between Virtual Machines
+
+
+
+
+ 7.1. A warning
+
+ Don't attempt to share filesystems simply by booting two UMLs from the
+ same file. That's the same thing as booting two physical machines
+ from a shared disk. It will result in filesystem corruption.
+
+
+
+ 7.2. Using layered block devices
+
+ The way to share a filesystem between two virtual machines is to use
+ the copy-on-write (COW) layering capability of the ubd block driver.
+ As of 2.4.6-2um, the driver supports layering a read-write private
+ device over a read-only shared device. A machine's writes are stored
+ in the private device, while reads come from either device - the
+ private one if the requested block is valid in it, the shared one if
+ not. Using this scheme, the majority of data which is unchanged is
+ shared between an arbitrary number of virtual machines, each of which
+ has a much smaller file containing the changes that it has made. With
+ a large number of UMLs booting from a large root filesystem, this
+ leads to a huge disk space saving. It will also help performance,
+ since the host will be able to cache the shared data using a much
+ smaller amount of memory, so UML disk requests will be served from the
+ host's memory rather than its disks.
+
+
+
+
+ To add a copy-on-write layer to an existing block device file, simply
+ add the name of the COW file to the appropriate ubd switch:
+
+
+ ubd0=root_fs_cow,root_fs_debian_22
+
+
+
+
+ where 'root_fs_cow' is the private COW file and 'root_fs_debian_22' is
+ the existing shared filesystem. The COW file need not exist. If it
+ doesn't, the driver will create and initialize it. Once the COW file
+ has been initialized, it can be used on its own on the command line:
+
+
+ ubd0=root_fs_cow
+
+
+
+
+ The name of the backing file is stored in the COW file header, so it
+ would be redundant to continue specifying it on the command line.
+
+
+
+ 7.3. Note!
+
+ When checking the size of the COW file in order to see the gobs of
+ space that you're saving, make sure you use 'ls -ls' to see the actual
+ disk consumption rather than the length of the file. The COW file is
+ sparse, so the length will be very different from the disk usage.
+ Here is a 'ls -l' of a COW file and backing file from one boot and
+ shutdown:
+ host% ls -l cow.debian debian2.2
+ -rw-r--r-- 1 jdike jdike 492504064 Aug 6 21:16 cow.debian
+ -rwxrw-rw- 1 jdike jdike 537919488 Aug 6 20:42 debian2.2
+
+
+
+
+ Doesn't look like much saved space, does it? Well, here's 'ls -ls':
+
+
+ host% ls -ls cow.debian debian2.2
+ 880 -rw-r--r-- 1 jdike jdike 492504064 Aug 6 21:16 cow.debian
+ 525832 -rwxrw-rw- 1 jdike jdike 537919488 Aug 6 20:42 debian2.2
+
+
+
+
+ Now, you can see that the COW file has less than a meg of disk, rather
+ than 492 meg.
+
+
+
+ 7.4. Another warning
+
+ Once a filesystem is being used as a readonly backing file for a COW
+ file, do not boot directly from it or modify it in any way. Doing so
+ will invalidate any COW files that are using it. The mtime and size
+ of the backing file are stored in the COW file header at its creation,
+ and they must continue to match. If they don't, the driver will
+ refuse to use the COW file.
+
+
+
+
+ If you attempt to evade this restriction by changing either the
+ backing file or the COW header by hand, you will get a corrupted
+ filesystem.
+
+
+
+
+ Among other things, this means that upgrading the distribution in a
+ backing file and expecting that all of the COW files using it will see
+ the upgrade will not work.
+
+
+
+
+ 7.5. uml_moo : Merging a COW file with its backing file
+
+ Depending on how you use UML and COW devices, it may be advisable to
+ merge the changes in the COW file into the backing file every once in
+ a while.
+
+
+
+
+ The utility that does this is uml_moo. Its usage is
+
+
+ host% uml_moo COW file new backing file
+
+
+
+
+ There's no need to specify the backing file since that information is
+ already in the COW file header. If you're paranoid, boot the new
+ merged file, and if you're happy with it, move it over the old backing
+ file.
+
+
+
+
+ uml_moo creates a new backing file by default as a safety measure. It
+ also has a destructive merge option which will merge the COW file
+ directly into its current backing file. This is really only usable
+ when the backing file only has one COW file associated with it. If
+ there are multiple COWs associated with a backing file, a -d merge of
+ one of them will invalidate all of the others. However, it is
+ convenient if you're short of disk space, and it should also be
+ noticeably faster than a non-destructive merge.
+
+
+
+
+ uml_moo is installed with the UML deb and RPM. If you didn't install
+ UML from one of those packages, you can also get it from the UML
+ utilities <http://user-mode-linux.sourceforge.net/
+ utilities> tar file in tools/moo.
+
+
+
+
+
+
+
+
+ 8. Creating filesystems
+
+
+ You may want to create and mount new UML filesystems, either because
+ your root filesystem isn't large enough or because you want to use a
+ filesystem other than ext2.
+
+
+ This was written on the occasion of reiserfs being included in the
+ 2.4.1 kernel pool, and therefore the 2.4.1 UML, so the examples will
+ talk about reiserfs. This information is generic, and the examples
+ should be easy to translate to the filesystem of your choice.
+
+
+ 8.1. Create the filesystem file
+
+ dd is your friend. All you need to do is tell dd to create an empty
+ file of the appropriate size. I usually make it sparse to save time
+ and to avoid allocating disk space until it's actually used. For
+ example, the following command will create a sparse 100 meg file full
+ of zeroes.
+
+
+ host%
+ dd if=/dev/zero of=new_filesystem seek=100 count=1 bs=1M
+
+
+
+
+
+
+ 8.2. Assign the file to a UML device
+
+ Add an argument like the following to the UML command line:
+
+ ubd4=new_filesystem
+
+
+
+
+ making sure that you use an unassigned ubd device number.
+
+
+
+ 8.3. Creating and mounting the filesystem
+
+ Make sure that the filesystem is available, either by being built into
+ the kernel, or available as a module, then boot up UML and log in. If
+ the root filesystem doesn't have the filesystem utilities (mkfs, fsck,
+ etc), then get them into UML by way of the net or hostfs.
+
+
+ Make the new filesystem on the device assigned to the new file:
+
+
+ host# mkreiserfs /dev/ubd/4
+
+
+ <----------- MKREISERFSv2 ----------->
+
+ ReiserFS version 3.6.25
+ Block size 4096 bytes
+ Block count 25856
+ Used blocks 8212
+ Journal - 8192 blocks (18-8209), journal header is in block 8210
+ Bitmaps: 17
+ Root block 8211
+ Hash function "r5"
+ ATTENTION: ALL DATA WILL BE LOST ON '/dev/ubd/4'! (y/n)y
+ journal size 8192 (from 18)
+ Initializing journal - 0%....20%....40%....60%....80%....100%
+ Syncing..done.
+
+
+
+
+ Now, mount it:
+
+
+ UML#
+ mount /dev/ubd/4 /mnt
+
+
+
+
+ and you're in business.
+
+
+
+
+
+
+
+
+
+ 9. Host file access
+
+
+ If you want to access files on the host machine from inside UML, you
+ can treat it as a separate machine and either nfs mount directories
+ from the host or copy files into the virtual machine with scp or rcp.
+ However, since UML is running on the host, it can access those
+ files just like any other process and make them available inside the
+ virtual machine without needing to use the network.
+
+
+ This is now possible with the hostfs virtual filesystem. With it, you
+ can mount a host directory into the UML filesystem and access the
+ files contained in it just as you would on the host.
+
+
+ 9.1. Using hostfs
+
+ To begin with, make sure that hostfs is available inside the virtual
+ machine with
+
+
+ UML# cat /proc/filesystems
+
+
+
+ . hostfs should be listed. If it's not, either rebuild the kernel
+ with hostfs configured into it or make sure that hostfs is built as a
+ module and available inside the virtual machine, and insmod it.
+
+
+ Now all you need to do is run mount:
+
+
+ UML# mount none /mnt/host -t hostfs
+
+
+
+
+ will mount the host's / on the virtual machine's /mnt/host.
+
+
+ If you don't want to mount the host root directory, then you can
+ specify a subdirectory to mount with the -o switch to mount:
+
+
+ UML# mount none /mnt/home -t hostfs -o /home
+
+
+
+
+ will mount the hosts's /home on the virtual machine's /mnt/home.
+
+
+
+ 9.2. hostfs as the root filesystem
+
+ It's possible to boot from a directory hierarchy on the host using
+ hostfs rather than using the standard filesystem in a file.
+
+ To start, you need that hierarchy. The easiest way is to loop mount
+ an existing root_fs file:
+
+
+ host# mount root_fs uml_root_dir -o loop
+
+
+
+
+ You need to change the filesystem type of / in etc/fstab to be
+ 'hostfs', so that line looks like this:
+
+ /dev/ubd/0 / hostfs defaults 1 1
+
+
+
+
+ Then you need to chown to yourself all the files in that directory
+ that are owned by root. This worked for me:
+
+
+ host# find . -uid 0 -exec chown jdike {} \;
+
+
+
+
+ Next, make sure that your UML kernel has hostfs compiled in, not as a
+ module. Then run UML with the boot device pointing at that directory:
+
+
+ ubd0=/path/to/uml/root/directory
+
+
+
+
+ UML should then boot as it does normally.
+
+
+ 9.3. Building hostfs
+
+ If you need to build hostfs because it's not in your kernel, you have
+ two choices:
+
+
+
+ o Compiling hostfs into the kernel:
+
+
+ Reconfigure the kernel and set the 'Host filesystem' option under
+
+
+ o Compiling hostfs as a module:
+
+
+ Reconfigure the kernel and set the 'Host filesystem' option under
+ be in arch/um/fs/hostfs/hostfs.o. Install that in
+ /lib/modules/`uname -r`/fs in the virtual machine, boot it up, and
+
+
+ UML# insmod hostfs
+
+
+
+
+
+
+
+
+
+
+
+
+ 10. The Management Console
+
+
+
+ The UML management console is a low-level interface to the kernel,
+ somewhat like the i386 SysRq interface. Since there is a full-blown
+ operating system under UML, there is much greater flexibility possible
+ than with the SysRq mechanism.
+
+
+ There are a number of things you can do with the mconsole interface:
+
+ o get the kernel version
+
+ o add and remove devices
+
+ o halt or reboot the machine
+
+ o Send SysRq commands
+
+ o Pause and resume the UML
+
+
+ You need the mconsole client (uml_mconsole) which is present in CVS
+ (/tools/mconsole) in 2.4.5-9um and later, and will be in the RPM in
+ 2.4.6.
+
+
+ You also need CONFIG_MCONSOLE (under 'General Setup') enabled in UML.
+ When you boot UML, you'll see a line like:
+
+
+ mconsole initialized on /home/jdike/.uml/umlNJ32yL/mconsole
+
+
+
+
+ If you specify a unique machine id one the UML command line, i.e.
+
+
+ umid=debian
+
+
+
+
+ you'll see this
+
+
+ mconsole initialized on /home/jdike/.uml/debian/mconsole
+
+
+
+
+ That file is the socket that uml_mconsole will use to communicate with
+ UML. Run it with either the umid or the full path as its argument:
+
+
+ host% uml_mconsole debian
+
+
+
+
+ or
+
+
+ host% uml_mconsole /home/jdike/.uml/debian/mconsole
+
+
+
+
+ You'll get a prompt, at which you can run one of these commands:
+
+ o version
+
+ o halt
+
+ o reboot
+
+ o config
+
+ o remove
+
+ o sysrq
+
+ o help
+
+ o cad
+
+ o stop
+
+ o go
+
+
+ 10.1. version
+
+ This takes no arguments. It prints the UML version.
+
+
+ (mconsole) version
+ OK Linux usermode 2.4.5-9um #1 Wed Jun 20 22:47:08 EDT 2001 i686
+
+
+
+
+ There are a couple actual uses for this. It's a simple no-op which
+ can be used to check that a UML is running. It's also a way of
+ sending an interrupt to the UML. This is sometimes useful on SMP
+ hosts, where there's a bug which causes signals to UML to be lost,
+ often causing it to appear to hang. Sending such a UML the mconsole
+ version command is a good way to 'wake it up' before networking has
+ been enabled, as it does not do anything to the function of the UML.
+
+
+
+ 10.2. halt and reboot
+
+ These take no arguments. They shut the machine down immediately, with
+ no syncing of disks and no clean shutdown of userspace. So, they are
+ pretty close to crashing the machine.
+
+
+ (mconsole) halt
+ OK
+
+
+
+
+
+
+ 10.3. config
+
+ "config" adds a new device to the virtual machine. Currently the ubd
+ and network drivers support this. It takes one argument, which is the
+ device to add, with the same syntax as the kernel command line.
+
+
+
+
+ (mconsole)
+ config ubd3=/home/jdike/incoming/roots/root_fs_debian22
+
+ OK
+ (mconsole) config eth1=mcast
+ OK
+
+
+
+
+
+
+ 10.4. remove
+
+ "remove" deletes a device from the system. Its argument is just the
+ name of the device to be removed. The device must be idle in whatever
+ sense the driver considers necessary. In the case of the ubd driver,
+ the removed block device must not be mounted, swapped on, or otherwise
+ open, and in the case of the network driver, the device must be down.
+
+
+ (mconsole) remove ubd3
+ OK
+ (mconsole) remove eth1
+ OK
+
+
+
+
+
+
+ 10.5. sysrq
+
+ This takes one argument, which is a single letter. It calls the
+ generic kernel's SysRq driver, which does whatever is called for by
+ that argument. See the SysRq documentation in Documentation/sysrq.txt
+ in your favorite kernel tree to see what letters are valid and what
+ they do.
+
+
+
+ 10.6. help
+
+ "help" returns a string listing the valid commands and what each one
+ does.
+
+
+
+ 10.7. cad
+
+ This invokes the Ctl-Alt-Del action on init. What exactly this ends
+ up doing is up to /etc/inittab. Normally, it reboots the machine.
+ With UML, this is usually not desired, so if a halt would be better,
+ then find the section of inittab that looks like this
+
+
+ # What to do when CTRL-ALT-DEL is pressed.
+ ca:12345:ctrlaltdel:/sbin/shutdown -t1 -a -r now
+
+
+
+
+ and change the command to halt.
+
+
+
+ 10.8. stop
+
+ This puts the UML in a loop reading mconsole requests until a 'go'
+ mconsole command is received. This is very useful for making backups
+ of UML filesystems, as the UML can be stopped, then synced via 'sysrq
+ s', so that everything is written to the filesystem. You can then copy
+ the filesystem and then send the UML 'go' via mconsole.
+
+
+ Note that a UML running with more than one CPU will have problems
+ after you send the 'stop' command, as only one CPU will be held in a
+ mconsole loop and all others will continue as normal. This is a bug,
+ and will be fixed.
+
+
+
+ 10.9. go
+
+ This resumes a UML after being paused by a 'stop' command. Note that
+ when the UML has resumed, TCP connections may have timed out and if
+ the UML is paused for a long period of time, crond might go a little
+ crazy, running all the jobs it didn't do earlier.
+
+
+
+
+
+
+
+
+ 11. Kernel debugging
+
+
+ Note: The interface that makes debugging, as described here, possible
+ is present in 2.4.0-test6 kernels and later.
+
+
+ Since the user-mode kernel runs as a normal Linux process, it is
+ possible to debug it with gdb almost like any other process. It is
+ slightly different because the kernel's threads are already being
+ ptraced for system call interception, so gdb can't ptrace them.
+ However, a mechanism has been added to work around that problem.
+
+
+ In order to debug the kernel, you need build it from source. See
+ ``Compiling the kernel and modules'' for information on doing that.
+ Make sure that you enable CONFIG_DEBUGSYM and CONFIG_PT_PROXY during
+ the config. These will compile the kernel with -g, and enable the
+ ptrace proxy so that gdb works with UML, respectively.
+
+
+
+
+ 11.1. Starting the kernel under gdb
+
+ You can have the kernel running under the control of gdb from the
+ beginning by putting 'debug' on the command line. You will get an
+ xterm with gdb running inside it. The kernel will send some commands
+ to gdb which will leave it stopped at the beginning of start_kernel.
+ At this point, you can get things going with 'next', 'step', or
+ 'cont'.
+
+
+ There is a transcript of a debugging session here <debug-
+ session.html> , with breakpoints being set in the scheduler and in an
+ interrupt handler.
+ 11.2. Examining sleeping processes
+
+ Not every bug is evident in the currently running process. Sometimes,
+ processes hang in the kernel when they shouldn't because they've
+ deadlocked on a semaphore or something similar. In this case, when
+ you ^C gdb and get a backtrace, you will see the idle thread, which
+ isn't very relevant.
+
+
+ What you want is the stack of whatever process is sleeping when it
+ shouldn't be. You need to figure out which process that is, which is
+ generally fairly easy. Then you need to get its host process id,
+ which you can do either by looking at ps on the host or at
+ task.thread.extern_pid in gdb.
+
+
+ Now what you do is this:
+
+ o detach from the current thread
+
+
+ (UML gdb) det
+
+
+
+
+
+ o attach to the thread you are interested in
+
+
+ (UML gdb) att <host pid>
+
+
+
+
+
+ o look at its stack and anything else of interest
+
+
+ (UML gdb) bt
+
+
+
+
+ Note that you can't do anything at this point that requires that a
+ process execute, e.g. calling a function
+
+ o when you're done looking at that process, reattach to the current
+ thread and continue it
+
+
+ (UML gdb)
+ att 1
+
+
+
+
+
+
+ (UML gdb)
+ c
+
+
+
+
+ Here, specifying any pid which is not the process id of a UML thread
+ will cause gdb to reattach to the current thread. I commonly use 1,
+ but any other invalid pid would work.
+
+
+
+ 11.3. Running ddd on UML
+
+ ddd works on UML, but requires a special kludge. The process goes
+ like this:
+
+ o Start ddd
+
+
+ host% ddd linux
+
+
+
+
+
+ o With ps, get the pid of the gdb that ddd started. You can ask the
+ gdb to tell you, but for some reason that confuses things and
+ causes a hang.
+
+ o run UML with 'debug=parent gdb-pid=<pid>' added to the command line
+ - it will just sit there after you hit return
+
+ o type 'att 1' to the ddd gdb and you will see something like
+
+
+ 0xa013dc51 in __kill ()
+
+
+ (gdb)
+
+
+
+
+
+ o At this point, type 'c', UML will boot up, and you can use ddd just
+ as you do on any other process.
+
+
+
+ 11.4. Debugging modules
+
+ gdb has support for debugging code which is dynamically loaded into
+ the process. This support is what is needed to debug kernel modules
+ under UML.
+
+
+ Using that support is somewhat complicated. You have to tell gdb what
+ object file you just loaded into UML and where in memory it is. Then,
+ it can read the symbol table, and figure out where all the symbols are
+ from the load address that you provided. It gets more interesting
+ when you load the module again (i.e. after an rmmod). You have to
+ tell gdb to forget about all its symbols, including the main UML ones
+ for some reason, then load then all back in again.
+
+
+ There's an easy way and a hard way to do this. The easy way is to use
+ the umlgdb expect script written by Chandan Kudige. It basically
+ automates the process for you.
+
+
+ First, you must tell it where your modules are. There is a list in
+ the script that looks like this:
+ set MODULE_PATHS {
+ "fat" "/usr/src/uml/linux-2.4.18/fs/fat/fat.o"
+ "isofs" "/usr/src/uml/linux-2.4.18/fs/isofs/isofs.o"
+ "minix" "/usr/src/uml/linux-2.4.18/fs/minix/minix.o"
+ }
+
+
+
+
+ You change that to list the names and paths of the modules that you
+ are going to debug. Then you run it from the toplevel directory of
+ your UML pool and it basically tells you what to do:
+
+
+
+
+ ******** GDB pid is 21903 ********
+ Start UML as: ./linux <kernel switches> debug gdb-pid=21903
+
+
+
+ GNU gdb 5.0rh-5 Red Hat Linux 7.1
+ Copyright 2001 Free Software Foundation, Inc.
+ GDB is free software, covered by the GNU General Public License, and you are
+ welcome to change it and/or distribute copies of it under certain conditions.
+ Type "show copying" to see the conditions.
+ There is absolutely no warranty for GDB. Type "show warranty" for details.
+ This GDB was configured as "i386-redhat-linux"...
+ (gdb) b sys_init_module
+ Breakpoint 1 at 0xa0011923: file module.c, line 349.
+ (gdb) att 1
+
+
+
+
+ After you run UML and it sits there doing nothing, you hit return at
+ the 'att 1' and continue it:
+
+
+ Attaching to program: /home/jdike/linux/2.4/um/./linux, process 1
+ 0xa00f4221 in __kill ()
+ (UML gdb) c
+ Continuing.
+
+
+
+
+ At this point, you debug normally. When you insmod something, the
+ expect magic will kick in and you'll see something like:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ *** Module hostfs loaded ***
+ Breakpoint 1, sys_init_module (name_user=0x805abb0 "hostfs",
+ mod_user=0x8070e00) at module.c:349
+ 349 char *name, *n_name, *name_tmp = NULL;
+ (UML gdb) finish
+ Run till exit from #0 sys_init_module (name_user=0x805abb0 "hostfs",
+ mod_user=0x8070e00) at module.c:349
+ 0xa00e2e23 in execute_syscall (r=0xa8140284) at syscall_kern.c:411
+ 411 else res = EXECUTE_SYSCALL(syscall, regs);
+ Value returned is $1 = 0
+ (UML gdb)
+ p/x (int)module_list + module_list->size_of_struct
+
+ $2 = 0xa9021054
+ (UML gdb) symbol-file ./linux
+ Load new symbol table from "./linux"? (y or n) y
+ Reading symbols from ./linux...
+ done.
+ (UML gdb)
+ add-symbol-file /home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o 0xa9021054
+
+ add symbol table from file "/home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o" at
+ .text_addr = 0xa9021054
+ (y or n) y
+
+ Reading symbols from /home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o...
+ done.
+ (UML gdb) p *module_list
+ $1 = {size_of_struct = 84, next = 0xa0178720, name = 0xa9022de0 "hostfs",
+ size = 9016, uc = {usecount = {counter = 0}, pad = 0}, flags = 1,
+ nsyms = 57, ndeps = 0, syms = 0xa9023170, deps = 0x0, refs = 0x0,
+ init = 0xa90221f0 <init_hostfs>, cleanup = 0xa902222c <exit_hostfs>,
+ ex_table_start = 0x0, ex_table_end = 0x0, persist_start = 0x0,
+ persist_end = 0x0, can_unload = 0, runsize = 0, kallsyms_start = 0x0,
+ kallsyms_end = 0x0,
+ archdata_start = 0x1b855 <Address 0x1b855 out of bounds>,
+ archdata_end = 0xe5890000 <Address 0xe5890000 out of bounds>,
+ kernel_data = 0xf689c35d <Address 0xf689c35d out of bounds>}
+ >> Finished loading symbols for hostfs ...
+
+
+
+
+ That's the easy way. It's highly recommended. The hard way is
+ described below in case you're interested in what's going on.
+
+
+ Boot the kernel under the debugger and load the module with insmod or
+ modprobe. With gdb, do:
+
+
+ (UML gdb) p module_list
+
+
+
+
+ This is a list of modules that have been loaded into the kernel, with
+ the most recently loaded module first. Normally, the module you want
+ is at module_list. If it's not, walk down the next links, looking at
+ the name fields until find the module you want to debug. Take the
+ address of that structure, and add module.size_of_struct (which in
+ 2.4.10 kernels is 96 (0x60)) to it. Gdb can make this hard addition
+ for you :-):
+
+
+
+ (UML gdb)
+ printf "%#x\n", (int)module_list module_list->size_of_struct
+
+
+
+
+ The offset from the module start occasionally changes (before 2.4.0,
+ it was module.size_of_struct + 4), so it's a good idea to check the
+ init and cleanup addresses once in a while, as describe below. Now
+ do:
+
+
+ (UML gdb)
+ add-symbol-file /path/to/module/on/host that_address
+
+
+
+
+ Tell gdb you really want to do it, and you're in business.
+
+
+ If there's any doubt that you got the offset right, like breakpoints
+ appear not to work, or they're appearing in the wrong place, you can
+ check it by looking at the module structure. The init and cleanup
+ fields should look like:
+
+
+ init = 0x588066b0 <init_hostfs>, cleanup = 0x588066c0 <exit_hostfs>
+
+
+
+
+ with no offsets on the symbol names. If the names are right, but they
+ are offset, then the offset tells you how much you need to add to the
+ address you gave to add-symbol-file.
+
+
+ When you want to load in a new version of the module, you need to get
+ gdb to forget about the old one. The only way I've found to do that
+ is to tell gdb to forget about all symbols that it knows about:
+
+
+ (UML gdb) symbol-file
+
+
+
+
+ Then reload the symbols from the kernel binary:
+
+
+ (UML gdb) symbol-file /path/to/kernel
+
+
+
+
+ and repeat the process above. You'll also need to re-enable break-
+ points. They were disabled when you dumped all the symbols because
+ gdb couldn't figure out where they should go.
+
+
+
+ 11.5. Attaching gdb to the kernel
+
+ If you don't have the kernel running under gdb, you can attach gdb to
+ it later by sending the tracing thread a SIGUSR1. The first line of
+ the console output identifies its pid:
+ tracing thread pid = 20093
+
+
+
+
+ When you send it the signal:
+
+
+ host% kill -USR1 20093
+
+
+
+
+ you will get an xterm with gdb running in it.
+
+
+ If you have the mconsole compiled into UML, then the mconsole client
+ can be used to start gdb:
+
+
+ (mconsole) (mconsole) config gdb=xterm
+
+
+
+
+ will fire up an xterm with gdb running in it.
+
+
+
+ 11.6. Using alternate debuggers
+
+ UML has support for attaching to an already running debugger rather
+ than starting gdb itself. This is present in CVS as of 17 Apr 2001.
+ I sent it to Alan for inclusion in the ac tree, and it will be in my
+ 2.4.4 release.
+
+
+ This is useful when gdb is a subprocess of some UI, such as emacs or
+ ddd. It can also be used to run debuggers other than gdb on UML.
+ Below is an example of using strace as an alternate debugger.
+
+
+ To do this, you need to get the pid of the debugger and pass it in
+ with the
+
+
+ If you are using gdb under some UI, then tell it to 'att 1', and
+ you'll find yourself attached to UML.
+
+
+ If you are using something other than gdb as your debugger, then
+ you'll need to get it to do the equivalent of 'att 1' if it doesn't do
+ it automatically.
+
+
+ An example of an alternate debugger is strace. You can strace the
+ actual kernel as follows:
+
+ o Run the following in a shell
+
+
+ host%
+ sh -c 'echo pid=$$; echo -n hit return; read x; exec strace -p 1 -o strace.out'
+
+
+
+ o Run UML with 'debug' and 'gdb-pid=<pid>' with the pid printed out
+ by the previous command
+
+ o Hit return in the shell, and UML will start running, and strace
+ output will start accumulating in the output file.
+
+ Note that this is different from running
+
+
+ host% strace ./linux
+
+
+
+
+ That will strace only the main UML thread, the tracing thread, which
+ doesn't do any of the actual kernel work. It just oversees the vir-
+ tual machine. In contrast, using strace as described above will show
+ you the low-level activity of the virtual machine.
+
+
+
+
+
+ 12. Kernel debugging examples
+
+ 12.1. The case of the hung fsck
+
+ When booting up the kernel, fsck failed, and dropped me into a shell
+ to fix things up. I ran fsck -y, which hung:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Setting hostname uml [ OK ]
+ Checking root filesystem
+ /dev/fhd0 was not cleanly unmounted, check forced.
+ Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780.
+
+ /dev/fhd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
+ (i.e., without -a or -p options)
+ [ FAILED ]
+
+ *** An error occurred during the file system check.
+ *** Dropping you to a shell; the system will reboot
+ *** when you leave the shell.
+ Give root password for maintenance
+ (or type Control-D for normal startup):
+
+ [root@uml /root]# fsck -y /dev/fhd0
+ fsck -y /dev/fhd0
+ Parallelizing fsck version 1.14 (9-Jan-1999)
+ e2fsck 1.14, 9-Jan-1999 for EXT2 FS 0.5b, 95/08/09
+ /dev/fhd0 contains a file system with errors, check forced.
+ Pass 1: Checking inodes, blocks, and sizes
+ Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. Ignore error? yes
+
+ Inode 19780, i_blocks is 1548, should be 540. Fix? yes
+
+ Pass 2: Checking directory structure
+ Error reading block 49405 (Attempt to read block from filesystem resulted in short read). Ignore error? yes
+
+ Directory inode 11858, block 0, offset 0: directory corrupted
+ Salvage? yes
+
+ Missing '.' in directory inode 11858.
+ Fix? yes
+
+ Missing '..' in directory inode 11858.
+ Fix? yes
+
+
+
+
+
+ The standard drill in this sort of situation is to fire up gdb on the
+ signal thread, which, in this case, was pid 1935. In another window,
+ I run gdb and attach pid 1935.
+
+
+
+
+ ~/linux/2.3.26/um 1016: gdb linux
+ GNU gdb 4.17.0.11 with Linux support
+ Copyright 1998 Free Software Foundation, Inc.
+ GDB is free software, covered by the GNU General Public License, and you are
+ welcome to change it and/or distribute copies of it under certain conditions.
+ Type "show copying" to see the conditions.
+ There is absolutely no warranty for GDB. Type "show warranty" for details.
+ This GDB was configured as "i386-redhat-linux"...
+
+ (gdb) att 1935
+ Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 1935
+ 0x100756d9 in __wait4 ()
+
+
+
+
+
+
+ Let's see what's currently running:
+
+
+
+ (gdb) p current_task.pid
+ $1 = 0
+
+
+
+
+
+ It's the idle thread, which means that fsck went to sleep for some
+ reason and never woke up.
+
+
+ Let's guess that the last process in the process list is fsck:
+
+
+
+ (gdb) p current_task.prev_task.comm
+ $13 = "fsck.ext2\000\000\000\000\000\000"
+
+
+
+
+
+ It is, so let's see what it thinks it's up to:
+
+
+
+ (gdb) p current_task.prev_task.thread
+ $14 = {extern_pid = 1980, tracing = 0, want_tracing = 0, forking = 0,
+ kernel_stack_page = 0, signal_stack = 1342627840, syscall = {id = 4, args = {
+ 3, 134973440, 1024, 0, 1024}, have_result = 0, result = 50590720},
+ request = {op = 2, u = {exec = {ip = 1350467584, sp = 2952789424}, fork = {
+ regs = {1350467584, 2952789424, 0 <repeats 15 times>}, sigstack = 0,
+ pid = 0}, switch_to = 0x507e8000, thread = {proc = 0x507e8000,
+ arg = 0xaffffdb0, flags = 0, new_pid = 0}, input_request = {
+ op = 1350467584, fd = -1342177872, proc = 0, pid = 0}}}}
+
+
+
+
+
+ The interesting things here are the fact that its .thread.syscall.id
+ is __NR_write (see the big switch in arch/um/kernel/syscall_kern.c or
+ the defines in include/asm-um/arch/unistd.h), and that it never
+ returned. Also, its .request.op is OP_SWITCH (see
+ arch/um/include/user_util.h). These mean that it went into a write,
+ and, for some reason, called schedule().
+
+
+ The fact that it never returned from write means that its stack should
+ be fairly interesting. Its pid is 1980 (.thread.extern_pid). That
+ process is being ptraced by the signal thread, so it must be detached
+ before gdb can attach it:
+
+
+
+
+
+
+
+
+
+
+ (gdb) call detach(1980)
+
+ Program received signal SIGSEGV, Segmentation fault.
+ <function called from gdb>
+ The program being debugged stopped while in a function called from GDB.
+ When the function (detach) is done executing, GDB will silently
+ stop (instead of continuing to evaluate the expression containing
+ the function call).
+ (gdb) call detach(1980)
+ $15 = 0
+
+
+
+
+
+ The first detach segfaults for some reason, and the second one
+ succeeds.
+
+
+ Now I detach from the signal thread, attach to the fsck thread, and
+ look at its stack:
+
+
+ (gdb) det
+ Detaching from program: /home/dike/linux/2.3.26/um/linux Pid 1935
+ (gdb) att 1980
+ Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 1980
+ 0x10070451 in __kill ()
+ (gdb) bt
+ #0 0x10070451 in __kill ()
+ #1 0x10068ccd in usr1_pid (pid=1980) at process.c:30
+ #2 0x1006a03f in _switch_to (prev=0x50072000, next=0x507e8000)
+ at process_kern.c:156
+ #3 0x1006a052 in switch_to (prev=0x50072000, next=0x507e8000, last=0x50072000)
+ at process_kern.c:161
+ #4 0x10001d12 in schedule () at sched.c:777
+ #5 0x1006a744 in __down (sem=0x507d241c) at semaphore.c:71
+ #6 0x1006aa10 in __down_failed () at semaphore.c:157
+ #7 0x1006c5d8 in segv_handler (sc=0x5006e940) at trap_user.c:174
+ #8 0x1006c5ec in kern_segv_handler (sig=11) at trap_user.c:182
+ #9 <signal handler called>
+ #10 0x10155404 in errno ()
+ #11 0x1006c0aa in segv (address=1342179328, is_write=2) at trap_kern.c:50
+ #12 0x1006c5d8 in segv_handler (sc=0x5006eaf8) at trap_user.c:174
+ #13 0x1006c5ec in kern_segv_handler (sig=11) at trap_user.c:182
+ #14 <signal handler called>
+ #15 0xc0fd in ?? ()
+ #16 0x10016647 in sys_write (fd=3,
+ buf=0x80b8800 <Address 0x80b8800 out of bounds>, count=1024)
+ at read_write.c:159
+ #17 0x1006d5b3 in execute_syscall (syscall=4, args=0x5006ef08)
+ at syscall_kern.c:254
+ #18 0x1006af87 in really_do_syscall (sig=12) at syscall_user.c:35
+ #19 <signal handler called>
+ #20 0x400dc8b0 in ?? ()
+
+
+
+
+
+ The interesting things here are :
+
+ o There are two segfaults on this stack (frames 9 and 14)
+
+ o The first faulting address (frame 11) is 0x50000800
+
+ (gdb) p (void *)1342179328
+ $16 = (void *) 0x50000800
+
+
+
+
+
+ The initial faulting address is interesting because it is on the idle
+ thread's stack. I had been seeing the idle thread segfault for no
+ apparent reason, and the cause looked like stack corruption. In hopes
+ of catching the culprit in the act, I had turned off all protections
+ to that stack while the idle thread wasn't running. This apparently
+ tripped that trap.
+
+
+ However, the more immediate problem is that second segfault and I'm
+ going to concentrate on that. First, I want to see where the fault
+ happened, so I have to go look at the sigcontent struct in frame 8:
+
+
+
+ (gdb) up
+ #1 0x10068ccd in usr1_pid (pid=1980) at process.c:30
+ 30 kill(pid, SIGUSR1);
+ (gdb)
+ #2 0x1006a03f in _switch_to (prev=0x50072000, next=0x507e8000)
+ at process_kern.c:156
+ 156 usr1_pid(getpid());
+ (gdb)
+ #3 0x1006a052 in switch_to (prev=0x50072000, next=0x507e8000, last=0x50072000)
+ at process_kern.c:161
+ 161 _switch_to(prev, next);
+ (gdb)
+ #4 0x10001d12 in schedule () at sched.c:777
+ 777 switch_to(prev, next, prev);
+ (gdb)
+ #5 0x1006a744 in __down (sem=0x507d241c) at semaphore.c:71
+ 71 schedule();
+ (gdb)
+ #6 0x1006aa10 in __down_failed () at semaphore.c:157
+ 157 }
+ (gdb)
+ #7 0x1006c5d8 in segv_handler (sc=0x5006e940) at trap_user.c:174
+ 174 segv(sc->cr2, sc->err & 2);
+ (gdb)
+ #8 0x1006c5ec in kern_segv_handler (sig=11) at trap_user.c:182
+ 182 segv_handler(sc);
+ (gdb) p *sc
+ Cannot access memory at address 0x0.
+
+
+
+
+ That's not very useful, so I'll try a more manual method:
+
+
+ (gdb) p *((struct sigcontext *) (&sig + 1))
+ $19 = {gs = 0, __gsh = 0, fs = 0, __fsh = 0, es = 43, __esh = 0, ds = 43,
+ __dsh = 0, edi = 1342179328, esi = 1350378548, ebp = 1342630440,
+ esp = 1342630420, ebx = 1348150624, edx = 1280, ecx = 0, eax = 0,
+ trapno = 14, err = 4, eip = 268480945, cs = 35, __csh = 0, eflags = 66118,
+ esp_at_signal = 1342630420, ss = 43, __ssh = 0, fpstate = 0x0, oldmask = 0,
+ cr2 = 1280}
+
+
+
+ The ip is in handle_mm_fault:
+
+
+ (gdb) p (void *)268480945
+ $20 = (void *) 0x1000b1b1
+ (gdb) i sym $20
+ handle_mm_fault + 57 in section .text
+
+
+
+
+
+ Specifically, it's in pte_alloc:
+
+
+ (gdb) i line *$20
+ Line 124 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
+ starts at address 0x1000b1b1 <handle_mm_fault+57>
+ and ends at 0x1000b1b7 <handle_mm_fault+63>.
+
+
+
+
+
+ To find where in handle_mm_fault this is, I'll jump forward in the
+ code until I see an address in that procedure:
+
+
+
+ (gdb) i line *0x1000b1c0
+ Line 126 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
+ starts at address 0x1000b1b7 <handle_mm_fault+63>
+ and ends at 0x1000b1c3 <handle_mm_fault+75>.
+ (gdb) i line *0x1000b1d0
+ Line 131 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
+ starts at address 0x1000b1d0 <handle_mm_fault+88>
+ and ends at 0x1000b1da <handle_mm_fault+98>.
+ (gdb) i line *0x1000b1e0
+ Line 61 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
+ starts at address 0x1000b1da <handle_mm_fault+98>
+ and ends at 0x1000b1e1 <handle_mm_fault+105>.
+ (gdb) i line *0x1000b1f0
+ Line 134 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
+ starts at address 0x1000b1f0 <handle_mm_fault+120>
+ and ends at 0x1000b200 <handle_mm_fault+136>.
+ (gdb) i line *0x1000b200
+ Line 135 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
+ starts at address 0x1000b200 <handle_mm_fault+136>
+ and ends at 0x1000b208 <handle_mm_fault+144>.
+ (gdb) i line *0x1000b210
+ Line 139 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h"
+ starts at address 0x1000b210 <handle_mm_fault+152>
+ and ends at 0x1000b219 <handle_mm_fault+161>.
+ (gdb) i line *0x1000b220
+ Line 1168 of "memory.c" starts at address 0x1000b21e <handle_mm_fault+166>
+ and ends at 0x1000b222 <handle_mm_fault+170>.
+
+
+
+
+
+ Something is apparently wrong with the page tables or vma_structs, so
+ lets go back to frame 11 and have a look at them:
+
+
+
+ #11 0x1006c0aa in segv (address=1342179328, is_write=2) at trap_kern.c:50
+ 50 handle_mm_fault(current, vma, address, is_write);
+ (gdb) call pgd_offset_proc(vma->vm_mm, address)
+ $22 = (pgd_t *) 0x80a548c
+
+
+
+
+
+ That's pretty bogus. Page tables aren't supposed to be in process
+ text or data areas. Let's see what's in the vma:
+
+
+ (gdb) p *vma
+ $23 = {vm_mm = 0x507d2434, vm_start = 0, vm_end = 134512640,
+ vm_next = 0x80a4f8c, vm_page_prot = {pgprot = 0}, vm_flags = 31200,
+ vm_avl_height = 2058, vm_avl_left = 0x80a8c94, vm_avl_right = 0x80d1000,
+ vm_next_share = 0xaffffdb0, vm_pprev_share = 0xaffffe63,
+ vm_ops = 0xaffffe7a, vm_pgoff = 2952789626, vm_file = 0xafffffec,
+ vm_private_data = 0x62}
+ (gdb) p *vma.vm_mm
+ $24 = {mmap = 0x507d2434, mmap_avl = 0x0, mmap_cache = 0x8048000,
+ pgd = 0x80a4f8c, mm_users = {counter = 0}, mm_count = {counter = 134904288},
+ map_count = 134909076, mmap_sem = {count = {counter = 135073792},
+ sleepers = -1342177872, wait = {lock = <optimized out or zero length>,
+ task_list = {next = 0xaffffe63, prev = 0xaffffe7a},
+ __magic = -1342177670, __creator = -1342177300}, __magic = 98},
+ page_table_lock = {}, context = 138, start_code = 0, end_code = 0,
+ start_data = 0, end_data = 0, start_brk = 0, brk = 0, start_stack = 0,
+ arg_start = 0, arg_end = 0, env_start = 0, env_end = 0, rss = 1350381536,
+ total_vm = 0, locked_vm = 0, def_flags = 0, cpu_vm_mask = 0, swap_cnt = 0,
+ swap_address = 0, segments = 0x0}
+
+
+
+
+
+ This also pretty bogus. With all of the 0x80xxxxx and 0xaffffxxx
+ addresses, this is looking like a stack was plonked down on top of
+ these structures. Maybe it's a stack overflow from the next page:
+
+
+
+ (gdb) p vma
+ $25 = (struct vm_area_struct *) 0x507d2434
+
+
+
+
+
+ That's towards the lower quarter of the page, so that would have to
+ have been pretty heavy stack overflow:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ (gdb) x/100x $25
+ 0x507d2434: 0x507d2434 0x00000000 0x08048000 0x080a4f8c
+ 0x507d2444: 0x00000000 0x080a79e0 0x080a8c94 0x080d1000
+ 0x507d2454: 0xaffffdb0 0xaffffe63 0xaffffe7a 0xaffffe7a
+ 0x507d2464: 0xafffffec 0x00000062 0x0000008a 0x00000000
+ 0x507d2474: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d2484: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d2494: 0x00000000 0x00000000 0x507d2fe0 0x00000000
+ 0x507d24a4: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d24b4: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d24c4: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d24d4: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d24e4: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d24f4: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d2504: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d2514: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d2524: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d2534: 0x00000000 0x00000000 0x507d25dc 0x00000000
+ 0x507d2544: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d2554: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d2564: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d2574: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d2584: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d2594: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d25a4: 0x00000000 0x00000000 0x00000000 0x00000000
+ 0x507d25b4: 0x00000000 0x00000000 0x00000000 0x00000000
+
+
+
+
+
+ It's not stack overflow. The only "stack-like" piece of this data is
+ the vma_struct itself.
+
+
+ At this point, I don't see any avenues to pursue, so I just have to
+ admit that I have no idea what's going on. What I will do, though, is
+ stick a trap on the segfault handler which will stop if it sees any
+ writes to the idle thread's stack. That was the thing that happened
+ first, and it may be that if I can catch it immediately, what's going
+ on will be somewhat clearer.
+
+
+ 12.2. Episode 2: The case of the hung fsck
+
+ After setting a trap in the SEGV handler for accesses to the signal
+ thread's stack, I reran the kernel.
+
+
+ fsck hung again, this time by hitting the trap:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Setting hostname uml [ OK ]
+ Checking root filesystem
+ /dev/fhd0 contains a file system with errors, check forced.
+ Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780.
+
+ /dev/fhd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
+ (i.e., without -a or -p options)
+ [ FAILED ]
+
+ *** An error occurred during the file system check.
+ *** Dropping you to a shell; the system will reboot
+ *** when you leave the shell.
+ Give root password for maintenance
+ (or type Control-D for normal startup):
+
+ [root@uml /root]# fsck -y /dev/fhd0
+ fsck -y /dev/fhd0
+ Parallelizing fsck version 1.14 (9-Jan-1999)
+ e2fsck 1.14, 9-Jan-1999 for EXT2 FS 0.5b, 95/08/09
+ /dev/fhd0 contains a file system with errors, check forced.
+ Pass 1: Checking inodes, blocks, and sizes
+ Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. Ignore error? yes
+
+ Pass 2: Checking directory structure
+ Error reading block 49405 (Attempt to read block from filesystem resulted in short read). Ignore error? yes
+
+ Directory inode 11858, block 0, offset 0: directory corrupted
+ Salvage? yes
+
+ Missing '.' in directory inode 11858.
+ Fix? yes
+
+ Missing '..' in directory inode 11858.
+ Fix? yes
+
+ Untested (4127) [100fe44c]: trap_kern.c line 31
+
+
+
+
+
+ I need to get the signal thread to detach from pid 4127 so that I can
+ attach to it with gdb. This is done by sending it a SIGUSR1, which is
+ caught by the signal thread, which detaches the process:
+
+
+ kill -USR1 4127
+
+
+
+
+
+ Now I can run gdb on it:
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ~/linux/2.3.26/um 1034: gdb linux
+ GNU gdb 4.17.0.11 with Linux support
+ Copyright 1998 Free Software Foundation, Inc.
+ GDB is free software, covered by the GNU General Public License, and you are
+ welcome to change it and/or distribute copies of it under certain conditions.
+ Type "show copying" to see the conditions.
+ There is absolutely no warranty for GDB. Type "show warranty" for details.
+ This GDB was configured as "i386-redhat-linux"...
+ (gdb) att 4127
+ Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 4127
+ 0x10075891 in __libc_nanosleep ()
+
+
+
+
+
+ The backtrace shows that it was in a write and that the fault address
+ (address in frame 3) is 0x50000800, which is right in the middle of
+ the signal thread's stack page:
+
+
+ (gdb) bt
+ #0 0x10075891 in __libc_nanosleep ()
+ #1 0x1007584d in __sleep (seconds=1000000)
+ at ../sysdeps/unix/sysv/linux/sleep.c:78
+ #2 0x1006ce9a in stop () at user_util.c:191
+ #3 0x1006bf88 in segv (address=1342179328, is_write=2) at trap_kern.c:31
+ #4 0x1006c628 in segv_handler (sc=0x5006eaf8) at trap_user.c:174
+ #5 0x1006c63c in kern_segv_handler (sig=11) at trap_user.c:182
+ #6 <signal handler called>
+ #7 0xc0fd in ?? ()
+ #8 0x10016647 in sys_write (fd=3, buf=0x80b8800 "R.", count=1024)
+ at read_write.c:159
+ #9 0x1006d603 in execute_syscall (syscall=4, args=0x5006ef08)
+ at syscall_kern.c:254
+ #10 0x1006af87 in really_do_syscall (sig=12) at syscall_user.c:35
+ #11 <signal handler called>
+ #12 0x400dc8b0 in ?? ()
+ #13 <signal handler called>
+ #14 0x400dc8b0 in ?? ()
+ #15 0x80545fd in ?? ()
+ #16 0x804daae in ?? ()
+ #17 0x8054334 in ?? ()
+ #18 0x804d23e in ?? ()
+ #19 0x8049632 in ?? ()
+ #20 0x80491d2 in ?? ()
+ #21 0x80596b5 in ?? ()
+ (gdb) p (void *)1342179328
+ $3 = (void *) 0x50000800
+
+
+
+
+
+ Going up the stack to the segv_handler frame and looking at where in
+ the code the access happened shows that it happened near line 110 of
+ block_dev.c:
+
+
+
+
+
+
+
+
+
+ (gdb) up
+ #1 0x1007584d in __sleep (seconds=1000000)
+ at ../sysdeps/unix/sysv/linux/sleep.c:78
+ ../sysdeps/unix/sysv/linux/sleep.c:78: No such file or directory.
+ (gdb)
+ #2 0x1006ce9a in stop () at user_util.c:191
+ 191 while(1) sleep(1000000);
+ (gdb)
+ #3 0x1006bf88 in segv (address=1342179328, is_write=2) at trap_kern.c:31
+ 31 KERN_UNTESTED();
+ (gdb)
+ #4 0x1006c628 in segv_handler (sc=0x5006eaf8) at trap_user.c:174
+ 174 segv(sc->cr2, sc->err & 2);
+ (gdb) p *sc
+ $1 = {gs = 0, __gsh = 0, fs = 0, __fsh = 0, es = 43, __esh = 0, ds = 43,
+ __dsh = 0, edi = 1342179328, esi = 134973440, ebp = 1342631484,
+ esp = 1342630864, ebx = 256, edx = 0, ecx = 256, eax = 1024, trapno = 14,
+ err = 6, eip = 268550834, cs = 35, __csh = 0, eflags = 66070,
+ esp_at_signal = 1342630864, ss = 43, __ssh = 0, fpstate = 0x0, oldmask = 0,
+ cr2 = 1342179328}
+ (gdb) p (void *)268550834
+ $2 = (void *) 0x1001c2b2
+ (gdb) i sym $2
+ block_write + 1090 in section .text
+ (gdb) i line *$2
+ Line 209 of "/home/dike/linux/2.3.26/um/include/asm/arch/string.h"
+ starts at address 0x1001c2a1 <block_write+1073>
+ and ends at 0x1001c2bf <block_write+1103>.
+ (gdb) i line *0x1001c2c0
+ Line 110 of "block_dev.c" starts at address 0x1001c2bf <block_write+1103>
+ and ends at 0x1001c2e3 <block_write+1139>.
+
+
+
+
+
+ Looking at the source shows that the fault happened during a call to
+ copy_to_user to copy the data into the kernel:
+
+
+ 107 count -= chars;
+ 108 copy_from_user(p,buf,chars);
+ 109 p += chars;
+ 110 buf += chars;
+
+
+
+
+
+ p is the pointer which must contain 0x50000800, since buf contains
+ 0x80b8800 (frame 8 above). It is defined as:
+
+
+ p = offset + bh->b_data;
+
+
+
+
+
+ I need to figure out what bh is, and it just so happens that bh is
+ passed as an argument to mark_buffer_uptodate and mark_buffer_dirty a
+ few lines later, so I do a little disassembly:
+
+
+
+
+ (gdb) disas 0x1001c2bf 0x1001c2e0
+ Dump of assembler code from 0x1001c2bf to 0x1001c2d0:
+ 0x1001c2bf <block_write+1103>: addl %eax,0xc(%ebp)
+ 0x1001c2c2 <block_write+1106>: movl 0xfffffdd4(%ebp),%edx
+ 0x1001c2c8 <block_write+1112>: btsl $0x0,0x18(%edx)
+ 0x1001c2cd <block_write+1117>: btsl $0x1,0x18(%edx)
+ 0x1001c2d2 <block_write+1122>: sbbl %ecx,%ecx
+ 0x1001c2d4 <block_write+1124>: testl %ecx,%ecx
+ 0x1001c2d6 <block_write+1126>: jne 0x1001c2e3 <block_write+1139>
+ 0x1001c2d8 <block_write+1128>: pushl $0x0
+ 0x1001c2da <block_write+1130>: pushl %edx
+ 0x1001c2db <block_write+1131>: call 0x1001819c <__mark_buffer_dirty>
+ End of assembler dump.
+
+
+
+
+
+ At that point, bh is in %edx (address 0x1001c2da), which is calculated
+ at 0x1001c2c2 as %ebp + 0xfffffdd4, so I figure exactly what that is,
+ taking %ebp from the sigcontext_struct above:
+
+
+ (gdb) p (void *)1342631484
+ $5 = (void *) 0x5006ee3c
+ (gdb) p 0x5006ee3c+0xfffffdd4
+ $6 = 1342630928
+ (gdb) p (void *)$6
+ $7 = (void *) 0x5006ec10
+ (gdb) p *((void **)$7)
+ $8 = (void *) 0x50100200
+
+
+
+
+
+ Now, I look at the structure to see what's in it, and particularly,
+ what its b_data field contains:
+
+
+ (gdb) p *((struct buffer_head *)0x50100200)
+ $13 = {b_next = 0x50289380, b_blocknr = 49405, b_size = 1024, b_list = 0,
+ b_dev = 15872, b_count = {counter = 1}, b_rdev = 15872, b_state = 24,
+ b_flushtime = 0, b_next_free = 0x501001a0, b_prev_free = 0x50100260,
+ b_this_page = 0x501001a0, b_reqnext = 0x0, b_pprev = 0x507fcf58,
+ b_data = 0x50000800 "", b_page = 0x50004000,
+ b_end_io = 0x10017f60 <end_buffer_io_sync>, b_dev_id = 0x0,
+ b_rsector = 98810, b_wait = {lock = <optimized out or zero length>,
+ task_list = {next = 0x50100248, prev = 0x50100248}, __magic = 1343226448,
+ __creator = 0}, b_kiobuf = 0x0}
+
+
+
+
+
+ The b_data field is indeed 0x50000800, so the question becomes how
+ that happened. The rest of the structure looks fine, so this probably
+ is not a case of data corruption. It happened on purpose somehow.
+
+
+ The b_page field is a pointer to the page_struct representing the
+ 0x50000000 page. Looking at it shows the kernel's idea of the state
+ of that page:
+
+
+
+ (gdb) p *$13.b_page
+ $17 = {list = {next = 0x50004a5c, prev = 0x100c5174}, mapping = 0x0,
+ index = 0, next_hash = 0x0, count = {counter = 1}, flags = 132, lru = {
+ next = 0x50008460, prev = 0x50019350}, wait = {
+ lock = <optimized out or zero length>, task_list = {next = 0x50004024,
+ prev = 0x50004024}, __magic = 1342193708, __creator = 0},
+ pprev_hash = 0x0, buffers = 0x501002c0, virtual = 1342177280,
+ zone = 0x100c5160}
+
+
+
+
+
+ Some sanity-checking: the virtual field shows the "virtual" address of
+ this page, which in this kernel is the same as its "physical" address,
+ and the page_struct itself should be mem_map[0], since it represents
+ the first page of memory:
+
+
+
+ (gdb) p (void *)1342177280
+ $18 = (void *) 0x50000000
+ (gdb) p mem_map
+ $19 = (mem_map_t *) 0x50004000
+
+
+
+
+
+ These check out fine.
+
+
+ Now to check out the page_struct itself. In particular, the flags
+ field shows whether the page is considered free or not:
+
+
+ (gdb) p (void *)132
+ $21 = (void *) 0x84
+
+
+
+
+
+ The "reserved" bit is the high bit, which is definitely not set, so
+ the kernel considers the signal stack page to be free and available to
+ be used.
+
+
+ At this point, I jump to conclusions and start looking at my early
+ boot code, because that's where that page is supposed to be reserved.
+
+
+ In my setup_arch procedure, I have the following code which looks just
+ fine:
+
+
+
+ bootmap_size = init_bootmem(start_pfn, end_pfn - start_pfn);
+ free_bootmem(__pa(low_physmem) + bootmap_size, high_physmem - low_physmem);
+
+
+
+
+
+ Two stack pages have already been allocated, and low_physmem points to
+ the third page, which is the beginning of free memory.
+ The init_bootmem call declares the entire memory to the boot memory
+ manager, which marks it all reserved. The free_bootmem call frees up
+ all of it, except for the first two pages. This looks correct to me.
+
+
+ So, I decide to see init_bootmem run and make sure that it is marking
+ those first two pages as reserved. I never get that far.
+
+
+ Stepping into init_bootmem, and looking at bootmem_map before looking
+ at what it contains shows the following:
+
+
+
+ (gdb) p bootmem_map
+ $3 = (void *) 0x50000000
+
+
+
+
+
+ Aha! The light dawns. That first page is doing double duty as a
+ stack and as the boot memory map. The last thing that the boot memory
+ manager does is to free the pages used by its memory map, so this page
+ is getting freed even its marked as reserved.
+
+
+ The fix was to initialize the boot memory manager before allocating
+ those two stack pages, and then allocate them through the boot memory
+ manager. After doing this, and fixing a couple of subsequent buglets,
+ the stack corruption problem disappeared.
+
+
+
+
+
+ 13. What to do when UML doesn't work
+
+
+
+
+ 13.1. Strange compilation errors when you build from source
+
+ As of test11, it is necessary to have "ARCH=um" in the environment or
+ on the make command line for all steps in building UML, including
+ clean, distclean, or mrproper, config, menuconfig, or xconfig, dep,
+ and linux. If you forget for any of them, the i386 build seems to
+ contaminate the UML build. If this happens, start from scratch with
+
+
+ host%
+ make mrproper ARCH=um
+
+
+
+
+ and repeat the build process with ARCH=um on all the steps.
+
+
+ See ``Compiling the kernel and modules'' for more details.
+
+
+ Another cause of strange compilation errors is building UML in
+ /usr/src/linux. If you do this, the first thing you need to do is
+ clean up the mess you made. The /usr/src/linux/asm link will now
+ point to /usr/src/linux/asm-um. Make it point back to
+ /usr/src/linux/asm-i386. Then, move your UML pool someplace else and
+ build it there. Also see below, where a more specific set of symptoms
+ is described.
+
+
+
+ 13.3. A variety of panics and hangs with /tmp on a reiserfs filesys-
+ tem
+
+ I saw this on reiserfs 3.5.21 and it seems to be fixed in 3.5.27.
+ Panics preceded by
+
+
+ Detaching pid nnnn
+
+
+
+ are diagnostic of this problem. This is a reiserfs bug which causes a
+ thread to occasionally read stale data from a mmapped page shared with
+ another thread. The fix is to upgrade the filesystem or to have /tmp
+ be an ext2 filesystem.
+
+
+
+ 13.4. The compile fails with errors about conflicting types for
+ 'open', 'dup', and 'waitpid'
+
+ This happens when you build in /usr/src/linux. The UML build makes
+ the include/asm link point to include/asm-um. /usr/include/asm points
+ to /usr/src/linux/include/asm, so when that link gets moved, files
+ which need to include the asm-i386 versions of headers get the
+ incompatible asm-um versions. The fix is to move the include/asm link
+ back to include/asm-i386 and to do UML builds someplace else.
+
+
+
+ 13.5. UML doesn't work when /tmp is an NFS filesystem
+
+ This seems to be a similar situation with the ReiserFS problem above.
+ Some versions of NFS seems not to handle mmap correctly, which UML
+ depends on. The workaround is have /tmp be a non-NFS directory.
+
+
+ 13.6. UML hangs on boot when compiled with gprof support
+
+ If you build UML with gprof support and, early in the boot, it does
+ this
+
+
+ kernel BUG at page_alloc.c:100!
+
+
+
+
+ you have a buggy gcc. You can work around the problem by removing
+ UM_FASTCALL from CFLAGS in arch/um/Makefile-i386. This will open up
+ another bug, but that one is fairly hard to reproduce.
+
+
+
+ 13.7. syslogd dies with a SIGTERM on startup
+
+ The exact boot error depends on the distribution that you're booting,
+ but Debian produces this:
+
+
+ /etc/rc2.d/S10sysklogd: line 49: 93 Terminated
+ start-stop-daemon --start --quiet --exec /sbin/syslogd -- $SYSLOGD
+
+
+
+
+ This is a syslogd bug. There's a race between a parent process
+ installing a signal handler and its child sending the signal. See
+ this uml-devel post <http://www.geocrawler.com/lists/3/Source-
+ Forge/709/0/6612801> for the details.
+
+
+
+ 13.8. TUN/TAP networking doesn't work on a 2.4 host
+
+ There are a couple of problems which were
+ <http://www.geocrawler.com/lists/3/SourceForge/597/0/> name="pointed
+ out"> by Tim Robinson <timro at trkr dot net>
+
+ o It doesn't work on hosts running 2.4.7 (or thereabouts) or earlier.
+ The fix is to upgrade to something more recent and then read the
+ next item.
+
+ o If you see
+
+
+ File descriptor in bad state
+
+
+
+ when you bring up the device inside UML, you have a header mismatch
+ between the original kernel and the upgraded one. Make /usr/src/linux
+ point at the new headers. This will only be a problem if you build
+ uml_net yourself.
+
+
+
+ 13.9. You can network to the host but not to other machines on the
+ net
+
+ If you can connect to the host, and the host can connect to UML, but
+ you cannot connect to any other machines, then you may need to enable
+ IP Masquerading on the host. Usually this is only experienced when
+ using private IP addresses (192.168.x.x or 10.x.x.x) for host/UML
+ networking, rather than the public address space that your host is
+ connected to. UML does not enable IP Masquerading, so you will need
+ to create a static rule to enable it:
+
+
+ host%
+ iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
+
+
+
+
+ Replace eth0 with the interface that you use to talk to the rest of
+ the world.
+
+
+ Documentation on IP Masquerading, and SNAT, can be found at
+ www.netfilter.org <http://www.netfilter.org> .
+
+
+ If you can reach the local net, but not the outside Internet, then
+ that is usually a routing problem. The UML needs a default route:
+
+
+ UML#
+ route add default gw gateway IP
+
+
+
+
+ The gateway IP can be any machine on the local net that knows how to
+ reach the outside world. Usually, this is the host or the local net-
+ work's gateway.
+
+
+ Occasionally, we hear from someone who can reach some machines, but
+ not others on the same net, or who can reach some ports on other
+ machines, but not others. These are usually caused by strange
+ firewalling somewhere between the UML and the other box. You track
+ this down by running tcpdump on every interface the packets travel
+ over and see where they disappear. When you find a machine that takes
+ the packets in, but does not send them onward, that's the culprit.
+
+
+
+ 13.10. I have no root and I want to scream
+
+ Thanks to Birgit Wahlich for telling me about this strange one. It
+ turns out that there's a limit of six environment variables on the
+ kernel command line. When that limit is reached or exceeded, argument
+ processing stops, which means that the 'root=' argument that UML
+ usually adds is not seen. So, the filesystem has no idea what the
+ root device is, so it panics.
+
+
+ The fix is to put less stuff on the command line. Glomming all your
+ setup variables into one is probably the best way to go.
+
+
+
+ 13.11. UML build conflict between ptrace.h and ucontext.h
+
+ On some older systems, /usr/include/asm/ptrace.h and
+ /usr/include/sys/ucontext.h define the same names. So, when they're
+ included together, the defines from one completely mess up the parsing
+ of the other, producing errors like:
+ /usr/include/sys/ucontext.h:47: parse error before
+ `10'
+
+
+
+
+ plus a pile of warnings.
+
+
+ This is a libc botch, which has since been fixed, and I don't see any
+ way around it besides upgrading.
+
+
+
+ 13.12. The UML BogoMips is exactly half the host's BogoMips
+
+ On i386 kernels, there are two ways of running the loop that is used
+ to calculate the BogoMips rating, using the TSC if it's there or using
+ a one-instruction loop. The TSC produces twice the BogoMips as the
+ loop. UML uses the loop, since it has nothing resembling a TSC, and
+ will get almost exactly the same BogoMips as a host using the loop.
+ However, on a host with a TSC, its BogoMips will be double the loop
+ BogoMips, and therefore double the UML BogoMips.
+
+
+
+ 13.13. When you run UML, it immediately segfaults
+
+ If the host is configured with the 2G/2G address space split, that's
+ why. See ``UML on 2G/2G hosts'' for the details on getting UML to
+ run on your host.
+
+
+
+ 13.14. xterms appear, then immediately disappear
+
+ If you're running an up to date kernel with an old release of
+ uml_utilities, the port-helper program will not work properly, so
+ xterms will exit straight after they appear. The solution is to
+ upgrade to the latest release of uml_utilities. Usually this problem
+ occurs when you have installed a packaged release of UML then compiled
+ your own development kernel without upgrading the uml_utilities from
+ the source distribution.
+
+
+
+ 13.15. Any other panic, hang, or strange behavior
+
+ If you're seeing truly strange behavior, such as hangs or panics that
+ happen in random places, or you try running the debugger to see what's
+ happening and it acts strangely, then it could be a problem in the
+ host kernel. If you're not running a stock Linus or -ac kernel, then
+ try that. An early version of the preemption patch and a 2.4.10 SuSE
+ kernel have caused very strange problems in UML.
+
+
+ Otherwise, let me know about it. Send a message to one of the UML
+ mailing lists - either the developer list - user-mode-linux-devel at
+ lists dot sourceforge dot net (subscription info) or the user list -
+ user-mode-linux-user at lists dot sourceforge do net (subscription
+ info), whichever you prefer. Don't assume that everyone knows about
+ it and that a fix is imminent.
+
+
+ If you want to be super-helpful, read ``Diagnosing Problems'' and
+ follow the instructions contained therein.
+ 14. Diagnosing Problems
+
+
+ If you get UML to crash, hang, or otherwise misbehave, you should
+ report this on one of the project mailing lists, either the developer
+ list - user-mode-linux-devel at lists dot sourceforge dot net
+ (subscription info) or the user list - user-mode-linux-user at lists
+ dot sourceforge dot net (subscription info). When you do, it is
+ likely that I will want more information. So, it would be helpful to
+ read the stuff below, do whatever is applicable in your case, and
+ report the results to the list.
+
+
+ For any diagnosis, you're going to need to build a debugging kernel.
+ The binaries from this site aren't debuggable. If you haven't done
+ this before, read about ``Compiling the kernel and modules'' and
+ ``Kernel debugging'' UML first.
+
+
+ 14.1. Case 1 : Normal kernel panics
+
+ The most common case is for a normal thread to panic. To debug this,
+ you will need to run it under the debugger (add 'debug' to the command
+ line). An xterm will start up with gdb running inside it. Continue
+ it when it stops in start_kernel and make it crash. Now ^C gdb and
+
+
+ If the panic was a "Kernel mode fault", then there will be a segv
+ frame on the stack and I'm going to want some more information. The
+ stack might look something like this:
+
+
+ (UML gdb) backtrace
+ #0 0x1009bf76 in __sigprocmask (how=1, set=0x5f347940, oset=0x0)
+ at ../sysdeps/unix/sysv/linux/sigprocmask.c:49
+ #1 0x10091411 in change_sig (signal=10, on=1) at process.c:218
+ #2 0x10094785 in timer_handler (sig=26) at time_kern.c:32
+ #3 0x1009bf38 in __restore ()
+ at ../sysdeps/unix/sysv/linux/i386/sigaction.c:125
+ #4 0x1009534c in segv (address=8, ip=268849158, is_write=2, is_user=0)
+ at trap_kern.c:66
+ #5 0x10095c04 in segv_handler (sig=11) at trap_user.c:285
+ #6 0x1009bf38 in __restore ()
+
+
+
+
+ I'm going to want to see the symbol and line information for the value
+ of ip in the segv frame. In this case, you would do the following:
+
+
+ (UML gdb) i sym 268849158
+
+
+
+
+ and
+
+
+ (UML gdb) i line *268849158
+
+
+
+
+ The reason for this is the __restore frame right above the segv_han-
+ dler frame is hiding the frame that actually segfaulted. So, I have
+ to get that information from the faulting ip.
+
+
+ 14.2. Case 2 : Tracing thread panics
+
+ The less common and more painful case is when the tracing thread
+ panics. In this case, the kernel debugger will be useless because it
+ needs a healthy tracing thread in order to work. The first thing to
+ do is get a backtrace from the tracing thread. This is done by
+ figuring out what its pid is, firing up gdb, and attaching it to that
+ pid. You can figure out the tracing thread pid by looking at the
+ first line of the console output, which will look like this:
+
+
+ tracing thread pid = 15851
+
+
+
+
+ or by running ps on the host and finding the line that looks like
+ this:
+
+
+ jdike 15851 4.5 0.4 132568 1104 pts/0 S 21:34 0:05 ./linux [(tracing thread)]
+
+
+
+
+ If the panic was 'segfault in signals', then follow the instructions
+ above for collecting information about the location of the seg fault.
+
+
+ If the tracing thread flaked out all by itself, then send that
+ backtrace in and wait for our crack debugging team to fix the problem.
+
+
+ 14.3. Case 3 : Tracing thread panics caused by other threads
+
+ However, there are cases where the misbehavior of another thread
+ caused the problem. The most common panic of this type is:
+
+
+ wait_for_stop failed to wait for <pid> to stop with <signal number>
+
+
+
+
+ In this case, you'll need to get a backtrace from the process men-
+ tioned in the panic, which is complicated by the fact that the kernel
+ debugger is defunct and without some fancy footwork, another gdb can't
+ attach to it. So, this is how the fancy footwork goes:
+
+ In a shell:
+
+
+ host% kill -STOP pid
+
+
+
+
+ Run gdb on the tracing thread as described in case 2 and do:
+
+
+ (host gdb) call detach(pid)
+
+
+ If you get a segfault, do it again. It always works the second time.
+
+ Detach from the tracing thread and attach to that other thread:
+
+
+ (host gdb) detach
+
+
+
+
+
+
+ (host gdb) attach pid
+
+
+
+
+ If gdb hangs when attaching to that process, go back to a shell and
+ do:
+
+
+ host%
+ kill -CONT pid
+
+
+
+
+ And then get the backtrace:
+
+
+ (host gdb) backtrace
+
+
+
+
+
+ 14.4. Case 4 : Hangs
+
+ Hangs seem to be fairly rare, but they sometimes happen. When a hang
+ happens, we need a backtrace from the offending process. Run the
+ kernel debugger as described in case 1 and get a backtrace. If the
+ current process is not the idle thread, then send in the backtrace.
+ You can tell that it's the idle thread if the stack looks like this:
+
+
+ #0 0x100b1401 in __libc_nanosleep ()
+ #1 0x100a2885 in idle_sleep (secs=10) at time.c:122
+ #2 0x100a546f in do_idle () at process_kern.c:445
+ #3 0x100a5508 in cpu_idle () at process_kern.c:471
+ #4 0x100ec18f in start_kernel () at init/main.c:592
+ #5 0x100a3e10 in start_kernel_proc (unused=0x0) at um_arch.c:71
+ #6 0x100a383f in signal_tramp (arg=0x100a3dd8) at trap_user.c:50
+
+
+
+
+ If this is the case, then some other process is at fault, and went to
+ sleep when it shouldn't have. Run ps on the host and figure out which
+ process should not have gone to sleep and stayed asleep. Then attach
+ to it with gdb and get a backtrace as described in case 3.
+
+
+
+
+
+
+ 15. Thanks
+
+
+ A number of people have helped this project in various ways, and this
+ page gives recognition where recognition is due.
+
+
+ If you're listed here and you would prefer a real link on your name,
+ or no link at all, instead of the despammed email address pseudo-link,
+ let me know.
+
+
+ If you're not listed here and you think maybe you should be, please
+ let me know that as well. I try to get everyone, but sometimes my
+ bookkeeping lapses and I forget about contributions.
+
+
+ 15.1. Code and Documentation
+
+ Rusty Russell <rusty at linuxcare.com.au> -
+
+ o wrote the HOWTO <http://user-mode-
+ linux.sourceforge.net/UserModeLinux-HOWTO.html>
+
+ o prodded me into making this project official and putting it on
+ SourceForge
+
+ o came up with the way cool UML logo <http://user-mode-
+ linux.sourceforge.net/uml-small.png>
+
+ o redid the config process
+
+
+ Peter Moulder <reiter at netspace.net.au> - Fixed my config and build
+ processes, and added some useful code to the block driver
+
+
+ Bill Stearns <wstearns at pobox.com> -
+
+ o HOWTO updates
+
+ o lots of bug reports
+
+ o lots of testing
+
+ o dedicated a box (uml.ists.dartmouth.edu) to support UML development
+
+ o wrote the mkrootfs script, which allows bootable filesystems of
+ RPM-based distributions to be cranked out
+
+ o cranked out a large number of filesystems with said script
+
+
+ Jim Leu <jleu at mindspring.com> - Wrote the virtual ethernet driver
+ and associated usermode tools
+
+ Lars Brinkhoff <http://lars.nocrew.org/> - Contributed the ptrace
+ proxy from his own project <http://a386.nocrew.org/> to allow easier
+ kernel debugging
+
+
+ Andrea Arcangeli <andrea at suse.de> - Redid some of the early boot
+ code so that it would work on machines with Large File Support
+
+
+ Chris Emerson <http://www.chiark.greenend.org.uk/~cemerson/> - Did
+ the first UML port to Linux/ppc
+
+
+ Harald Welte <laforge at gnumonks.org> - Wrote the multicast
+ transport for the network driver
+
+
+ Jorgen Cederlof - Added special file support to hostfs
+
+
+ Greg Lonnon <glonnon at ridgerun dot com> - Changed the ubd driver
+ to allow it to layer a COW file on a shared read-only filesystem and
+ wrote the iomem emulation support
+
+
+ Henrik Nordstrom <http://hem.passagen.se/hno/> - Provided a variety
+ of patches, fixes, and clues
+
+
+ Lennert Buytenhek - Contributed various patches, a rewrite of the
+ network driver, the first implementation of the mconsole driver, and
+ did the bulk of the work needed to get SMP working again.
+
+
+ Yon Uriarte - Fixed the TUN/TAP network backend while I slept.
+
+
+ Adam Heath - Made a bunch of nice cleanups to the initialization code,
+ plus various other small patches.
+
+
+ Matt Zimmerman - Matt volunteered to be the UML Debian maintainer and
+ is doing a real nice job of it. He also noticed and fixed a number of
+ actually and potentially exploitable security holes in uml_net. Plus
+ the occasional patch. I like patches.
+
+
+ James McMechan - James seems to have taken over maintenance of the ubd
+ driver and is doing a nice job of it.
+
+
+ Chandan Kudige - wrote the umlgdb script which automates the reloading
+ of module symbols.
+
+
+ Steve Schmidtke - wrote the UML slirp transport and hostaudio drivers,
+ enabling UML processes to access audio devices on the host. He also
+ submitted patches for the slip transport and lots of other things.
+
+
+ David Coulson <http://davidcoulson.net> -
+
+ o Set up the usermodelinux.org <http://usermodelinux.org> site,
+ which is a great way of keeping the UML user community on top of
+ UML goings-on.
+
+ o Site documentation and updates
+
+ o Nifty little UML management daemon UMLd
+ <http://uml.openconsultancy.com/umld/>
+
+ o Lots of testing and bug reports
+
+
+
+
+ 15.2. Flushing out bugs
+
+
+
+ o Yuri Pudgorodsky
+
+ o Gerald Britton
+
+ o Ian Wehrman
+
+ o Gord Lamb
+
+ o Eugene Koontz
+
+ o John H. Hartman
+
+ o Anders Karlsson
+
+ o Daniel Phillips
+
+ o John Fremlin
+
+ o Rainer Burgstaller
+
+ o James Stevenson
+
+ o Matt Clay
+
+ o Cliff Jefferies
+
+ o Geoff Hoff
+
+ o Lennert Buytenhek
+
+ o Al Viro
+
+ o Frank Klingenhoefer
+
+ o Livio Baldini Soares
+
+ o Jon Burgess
+
+ o Petru Paler
+
+ o Paul
+
+ o Chris Reahard
+
+ o Sverker Nilsson
+
+ o Gong Su
+
+ o johan verrept
+
+ o Bjorn Eriksson
+
+ o Lorenzo Allegrucci
+
+ o Muli Ben-Yehuda
+
+ o David Mansfield
+
+ o Howard Goff
+
+ o Mike Anderson
+
+ o John Byrne
+
+ o Sapan J. Batia
+
+ o Iris Huang
+
+ o Jan Hudec
+
+ o Voluspa
+
+
+
+
+ 15.3. Buglets and clean-ups
+
+
+
+ o Dave Zarzycki
+
+ o Adam Lazur
+
+ o Boria Feigin
+
+ o Brian J. Murrell
+
+ o JS
+
+ o Roman Zippel
+
+ o Wil Cooley
+
+ o Ayelet Shemesh
+
+ o Will Dyson
+
+ o Sverker Nilsson
+
+ o dvorak
+
+ o v.naga srinivas
+
+ o Shlomi Fish
+
+ o Roger Binns
+
+ o johan verrept
+
+ o MrChuoi
+
+ o Peter Cleve
+
+ o Vincent Guffens
+
+ o Nathan Scott
+
+ o Patrick Caulfield
+
+ o jbearce
+
+ o Catalin Marinas
+
+ o Shane Spencer
+
+ o Zou Min
+
+
+ o Ryan Boder
+
+ o Lorenzo Colitti
+
+ o Gwendal Grignou
+
+ o Andre' Breiler
+
+ o Tsutomu Yasuda
+
+
+
+ 15.4. Case Studies
+
+
+ o Jon Wright
+
+ o William McEwan
+
+ o Michael Richardson
+
+
+
+ 15.5. Other contributions
+
+
+ Bill Carr <Bill.Carr at compaq.com> made the Red Hat mkrootfs script
+ work with RH 6.2.
+
+ Michael Jennings <mikejen at hevanet.com> sent in some material which
+ is now gracing the top of the index page <http://user-mode-
+ linux.sourceforge.net/> of this site.
+
+ SGI <http://www.sgi.com> (and more specifically Ralf Baechle <ralf at
+ uni-koblenz.de> ) gave me an account on oss.sgi.com
+ <http://www.oss.sgi.com> . The bandwidth there made it possible to
+ produce most of the filesystems available on the project download
+ page.
+
+ Laurent Bonnaud <Laurent.Bonnaud at inpg.fr> took the old grotty
+ Debian filesystem that I've been distributing and updated it to 2.2.
+ It is now available by itself here.
+
+ Rik van Riel gave me some ftp space on ftp.nl.linux.org so I can make
+ releases even when Sourceforge is broken.
+
+ Rodrigo de Castro looked at my broken pte code and told me what was
+ wrong with it, letting me fix a long-standing (several weeks) and
+ serious set of bugs.
+
+ Chris Reahard built a specialized root filesystem for running a DNS
+ server jailed inside UML. It's available from the download
+ <http://user-mode-linux.sourceforge.net/dl-sf.html> page in the Jail
+ Filesystems section.
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/Documentation/virtual/virtio-spec.txt b/Documentation/virtual/virtio-spec.txt
new file mode 100644
index 00000000..da094737
--- /dev/null
+++ b/Documentation/virtual/virtio-spec.txt
@@ -0,0 +1,2200 @@
+[Generated file: see http://ozlabs.org/~rusty/virtio-spec/]
+Virtio PCI Card Specification
+v0.9.1 DRAFT
+-
+
+Rusty Russell <rusty@rustcorp.com.au>IBM Corporation (Editor)
+
+2011 August 1.
+
+Purpose and Description
+
+This document describes the specifications of the “virtio” family
+of PCI[LaTeX Command: nomenclature] devices. These are devices
+are found in virtual environments[LaTeX Command: nomenclature],
+yet by design they are not all that different from physical PCI
+devices, and this document treats them as such. This allows the
+guest to use standard PCI drivers and discovery mechanisms.
+
+The purpose of virtio and this specification is that virtual
+environments and guests should have a straightforward, efficient,
+standard and extensible mechanism for virtual devices, rather
+than boutique per-environment or per-OS mechanisms.
+
+ Straightforward: Virtio PCI devices use normal PCI mechanisms
+ of interrupts and DMA which should be familiar to any device
+ driver author. There is no exotic page-flipping or COW
+ mechanism: it's just a PCI device.[footnote:
+This lack of page-sharing implies that the implementation of the
+device (e.g. the hypervisor or host) needs full access to the
+guest memory. Communication with untrusted parties (i.e.
+inter-guest communication) requires copying.
+]
+
+ Efficient: Virtio PCI devices consist of rings of descriptors
+ for input and output, which are neatly separated to avoid cache
+ effects from both guest and device writing to the same cache
+ lines.
+
+ Standard: Virtio PCI makes no assumptions about the environment
+ in which it operates, beyond supporting PCI. In fact the virtio
+ devices specified in the appendices do not require PCI at all:
+ they have been implemented on non-PCI buses.[footnote:
+The Linux implementation further separates the PCI virtio code
+from the specific virtio drivers: these drivers are shared with
+the non-PCI implementations (currently lguest and S/390).
+]
+
+ Extensible: Virtio PCI devices contain feature bits which are
+ acknowledged by the guest operating system during device setup.
+ This allows forwards and backwards compatibility: the device
+ offers all the features it knows about, and the driver
+ acknowledges those it understands and wishes to use.
+
+ Virtqueues
+
+The mechanism for bulk data transport on virtio PCI devices is
+pretentiously called a virtqueue. Each device can have zero or
+more virtqueues: for example, the network device has one for
+transmit and one for receive.
+
+Each virtqueue occupies two or more physically-contiguous pages
+(defined, for the purposes of this specification, as 4096 bytes),
+and consists of three parts:
+
+
++-------------------+-----------------------------------+-----------+
+| Descriptor Table | Available Ring (padding) | Used Ring |
++-------------------+-----------------------------------+-----------+
+
+
+When the driver wants to send buffers to the device, it puts them
+in one or more slots in the descriptor table, and writes the
+descriptor indices into the available ring. It then notifies the
+device. When the device has finished with the buffers, it writes
+the descriptors into the used ring, and sends an interrupt.
+
+Specification
+
+ PCI Discovery
+
+Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000
+through 0x103F inclusive is a virtio device[footnote:
+The actual value within this range is ignored
+]. The device must also have a Revision ID of 0 to match this
+specification.
+
+The Subsystem Device ID indicates which virtio device is
+supported by the device. The Subsystem Vendor ID should reflect
+the PCI Vendor ID of the environment (it's currently only used
+for informational purposes by the guest).
+
+
++----------------------+--------------------+---------------+
+| Subsystem Device ID | Virtio Device | Specification |
++----------------------+--------------------+---------------+
++----------------------+--------------------+---------------+
+| 1 | network card | Appendix C |
++----------------------+--------------------+---------------+
+| 2 | block device | Appendix D |
++----------------------+--------------------+---------------+
+| 3 | console | Appendix E |
++----------------------+--------------------+---------------+
+| 4 | entropy source | Appendix F |
++----------------------+--------------------+---------------+
+| 5 | memory ballooning | Appendix G |
++----------------------+--------------------+---------------+
+| 6 | ioMemory | - |
++----------------------+--------------------+---------------+
+| 9 | 9P transport | - |
++----------------------+--------------------+---------------+
+
+
+ Device Configuration
+
+To configure the device, we use the first I/O region of the PCI
+device. This contains a virtio header followed by a
+device-specific region.
+
+There may be different widths of accesses to the I/O region; the “
+natural” access method for each field in the virtio header must
+be used (i.e. 32-bit accesses for 32-bit fields, etc), but the
+device-specific region can be accessed using any width accesses,
+and should obtain the same results.
+
+Note that this is possible because while the virtio header is PCI
+(i.e. little) endian, the device-specific region is encoded in
+the native endian of the guest (where such distinction is
+applicable).
+
+ Device Initialization Sequence
+
+We start with an overview of device initialization, then expand
+on the details of the device and how each step is preformed.
+
+ Reset the device. This is not required on initial start up.
+
+ The ACKNOWLEDGE status bit is set: we have noticed the device.
+
+ The DRIVER status bit is set: we know how to drive the device.
+
+ Device-specific setup, including reading the Device Feature
+ Bits, discovery of virtqueues for the device, optional MSI-X
+ setup, and reading and possibly writing the virtio
+ configuration space.
+
+ The subset of Device Feature Bits understood by the driver is
+ written to the device.
+
+ The DRIVER_OK status bit is set.
+
+ The device can now be used (ie. buffers added to the
+ virtqueues)[footnote:
+Historically, drivers have used the device before steps 5 and 6.
+This is only allowed if the driver does not use any features
+which would alter this early use of the device.
+]
+
+If any of these steps go irrecoverably wrong, the guest should
+set the FAILED status bit to indicate that it has given up on the
+device (it can reset the device later to restart if desired).
+
+We now cover the fields required for general setup in detail.
+
+ Virtio Header
+
+The virtio header looks as follows:
+
+
++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
+| Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 |
++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
+| Read/Write || R | R+W | R+W | R | R+W | R+W | R+W | R |
++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
+| Purpose || Device | Guest | Queue | Queue | Queue | Queue | Device | ISR |
+| || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status |
++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
+
+
+If MSI-X is enabled for the device, two additional fields
+immediately follow this header:
+
+
++------------++----------------+--------+
+| Bits || 16 | 16 |
+ +----------------+--------+
++------------++----------------+--------+
+| Read/Write || R+W | R+W |
++------------++----------------+--------+
+| Purpose || Configuration | Queue |
+| (MSI-X) || Vector | Vector |
++------------++----------------+--------+
+
+
+Finally, if feature bits (VIRTIO_F_FEATURES_HI) this is
+immediately followed by two additional fields:
+
+
++------------++----------------------+----------------------
+| Bits || 32 | 32
++------------++----------------------+----------------------
+| Read/Write || R | R+W
++------------++----------------------+----------------------
+| Purpose || Device | Guest
+| || Features bits 32:63 | Features bits 32:63
++------------++----------------------+----------------------
+
+
+Immediately following these general headers, there may be
+device-specific headers:
+
+
++------------++--------------------+
+| Bits || Device Specific |
+ +--------------------+
++------------++--------------------+
+| Read/Write || Device Specific |
++------------++--------------------+
+| Purpose || Device Specific... |
+| || |
++------------++--------------------+
+
+
+ Device Status
+
+The Device Status field is updated by the guest to indicate its
+progress. This provides a simple low-level diagnostic: it's most
+useful to imagine them hooked up to traffic lights on the console
+indicating the status of each device.
+
+The device can be reset by writing a 0 to this field, otherwise
+at least one bit should be set:
+
+ ACKNOWLEDGE (1) Indicates that the guest OS has found the
+ device and recognized it as a valid virtio device.
+
+ DRIVER (2) Indicates that the guest OS knows how to drive the
+ device. Under Linux, drivers can be loadable modules so there
+ may be a significant (or infinite) delay before setting this
+ bit.
+
+ DRIVER_OK (3) Indicates that the driver is set up and ready to
+ drive the device.
+
+ FAILED (8) Indicates that something went wrong in the guest,
+ and it has given up on the device. This could be an internal
+ error, or the driver didn't like the device for some reason, or
+ even a fatal error during device operation. The device must be
+ reset before attempting to re-initialize.
+
+ Feature Bits
+
+The least significant 31 bits of the first configuration field
+indicates the features that the device supports (the high bit is
+reserved, and will be used to indicate the presence of future
+feature bits elsewhere). If more than 31 feature bits are
+supported, the device indicates so by setting feature bit 31 (see
+[cha:Reserved-Feature-Bits]). The bits are allocated as follows:
+
+ 0 to 23 Feature bits for the specific device type
+
+ 24 to 40 Feature bits reserved for extensions to the queue and
+ feature negotiation mechanisms
+
+ 41 to 63 Feature bits reserved for future extensions
+
+For example, feature bit 0 for a network device (i.e. Subsystem
+Device ID 1) indicates that the device supports checksumming of
+packets.
+
+The feature bits are negotiated: the device lists all the
+features it understands in the Device Features field, and the
+guest writes the subset that it understands into the Guest
+Features field. The only way to renegotiate is to reset the
+device.
+
+In particular, new fields in the device configuration header are
+indicated by offering a feature bit, so the guest can check
+before accessing that part of the configuration space.
+
+This allows for forwards and backwards compatibility: if the
+device is enhanced with a new feature bit, older guests will not
+write that feature bit back to the Guest Features field and it
+can go into backwards compatibility mode. Similarly, if a guest
+is enhanced with a feature that the device doesn't support, it
+will not see that feature bit in the Device Features field and
+can go into backwards compatibility mode (or, for poor
+implementations, set the FAILED Device Status bit).
+
+Access to feature bits 32 to 63 is enabled by Guest by setting
+feature bit 31. If this bit is unset, Device must assume that all
+feature bits > 31 are unset.
+
+ Configuration/Queue Vectors
+
+When MSI-X capability is present and enabled in the device
+(through standard PCI configuration space) 4 bytes at byte offset
+20 are used to map configuration change and queue interrupts to
+MSI-X vectors. In this case, the ISR Status field is unused, and
+device specific configuration starts at byte offset 24 in virtio
+header structure. When MSI-X capability is not enabled, device
+specific configuration starts at byte offset 20 in virtio header.
+
+Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of
+Configuration/Queue Vector registers, maps interrupts triggered
+by the configuration change/selected queue events respectively to
+the corresponding MSI-X vector. To disable interrupts for a
+specific event type, unmap it by writing a special NO_VECTOR
+value:
+
+/* Vector value used to disable MSI for queue */
+
+#define VIRTIO_MSI_NO_VECTOR 0xffff
+
+Reading these registers returns vector mapped to a given event,
+or NO_VECTOR if unmapped. All queue and configuration change
+events are unmapped by default.
+
+Note that mapping an event to vector might require allocating
+internal device resources, and might fail. Devices report such
+failures by returning the NO_VECTOR value when the relevant
+Vector field is read. After mapping an event to vector, the
+driver must verify success by reading the Vector field value: on
+success, the previously written value is returned, and on
+failure, NO_VECTOR is returned. If a mapping failure is detected,
+the driver can retry mapping with fewervectors, or disable MSI-X.
+
+ Virtqueue Configuration
+
+As a device can have zero or more virtqueues for bulk data
+transport (for example, the network driver has two), the driver
+needs to configure them as part of the device-specific
+configuration.
+
+This is done as follows, for each virtqueue a device has:
+
+ Write the virtqueue index (first queue is 0) to the Queue
+ Select field.
+
+ Read the virtqueue size from the Queue Size field, which is
+ always a power of 2. This controls how big the virtqueue is
+ (see below). If this field is 0, the virtqueue does not exist.
+
+ Allocate and zero virtqueue in contiguous physical memory, on a
+ 4096 byte alignment. Write the physical address, divided by
+ 4096 to the Queue Address field.[footnote:
+The 4096 is based on the x86 page size, but it's also large
+enough to ensure that the separate parts of the virtqueue are on
+separate cache lines.
+]
+
+ Optionally, if MSI-X capability is present and enabled on the
+ device, select a vector to use to request interrupts triggered
+ by virtqueue events. Write the MSI-X Table entry number
+ corresponding to this vector in Queue Vector field. Read the
+ Queue Vector field: on success, previously written value is
+ returned; on failure, NO_VECTOR value is returned.
+
+The Queue Size field controls the total number of bytes required
+for the virtqueue according to the following formula:
+
+#define ALIGN(x) (((x) + 4095) & ~4095)
+
+static inline unsigned vring_size(unsigned int qsz)
+
+{
+
+ return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(2
++ qsz))
+
+ + ALIGN(sizeof(struct vring_used_elem)*qsz);
+
+}
+
+This currently wastes some space with padding, but also allows
+future extensions. The virtqueue layout structure looks like this
+(qsz is the Queue Size field, which is a variable, so this code
+won't compile):
+
+struct vring {
+
+ /* The actual descriptors (16 bytes each) */
+
+ struct vring_desc desc[qsz];
+
+
+
+ /* A ring of available descriptor heads with free-running
+index. */
+
+ struct vring_avail avail;
+
+
+
+ // Padding to the next 4096 boundary.
+
+ char pad[];
+
+
+
+ // A ring of used descriptor heads with free-running index.
+
+ struct vring_used used;
+
+};
+
+ A Note on Virtqueue Endianness
+
+Note that the endian of these fields and everything else in the
+virtqueue is the native endian of the guest, not little-endian as
+PCI normally is. This makes for simpler guest code, and it is
+assumed that the host already has to be deeply aware of the guest
+endian so such an “endian-aware” device is not a significant
+issue.
+
+ Descriptor Table
+
+The descriptor table refers to the buffers the guest is using for
+the device. The addresses are physical addresses, and the buffers
+can be chained via the next field. Each descriptor describes a
+buffer which is read-only or write-only, but a chain of
+descriptors can contain both read-only and write-only buffers.
+
+No descriptor chain may be more than 2^32 bytes long in total.struct vring_desc {
+
+ /* Address (guest-physical). */
+
+ u64 addr;
+
+ /* Length. */
+
+ u32 len;
+
+/* This marks a buffer as continuing via the next field. */
+
+#define VRING_DESC_F_NEXT 1
+
+/* This marks a buffer as write-only (otherwise read-only). */
+
+#define VRING_DESC_F_WRITE 2
+
+/* This means the buffer contains a list of buffer descriptors.
+*/
+
+#define VRING_DESC_F_INDIRECT 4
+
+ /* The flags as indicated above. */
+
+ u16 flags;
+
+ /* Next field if flags & NEXT */
+
+ u16 next;
+
+};
+
+The number of descriptors in the table is specified by the Queue
+Size field for this virtqueue.
+
+ <sub:Indirect-Descriptors>Indirect Descriptors
+
+Some devices benefit by concurrently dispatching a large number
+of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be
+used to allow this (see [cha:Reserved-Feature-Bits]). To increase
+ring capacity it is possible to store a table of indirect
+descriptors anywhere in memory, and insert a descriptor in main
+virtqueue (with flags&INDIRECT on) that refers to memory buffer
+containing this indirect descriptor table; fields addr and len
+refer to the indirect table address and length in bytes,
+respectively. The indirect table layout structure looks like this
+(len is the length of the descriptor that refers to this table,
+which is a variable, so this code won't compile):
+
+struct indirect_descriptor_table {
+
+ /* The actual descriptors (16 bytes each) */
+
+ struct vring_desc desc[len / 16];
+
+};
+
+The first indirect descriptor is located at start of the indirect
+descriptor table (index 0), additional indirect descriptors are
+chained by next field. An indirect descriptor without next field
+(with flags&NEXT off) signals the end of the indirect descriptor
+table, and transfers control back to the main virtqueue. An
+indirect descriptor can not refer to another indirect descriptor
+table (flags&INDIRECT must be off). A single indirect descriptor
+table can include both read-only and write-only descriptors;
+write-only flag (flags&WRITE) in the descriptor that refers to it
+is ignored.
+
+ Available Ring
+
+The available ring refers to what descriptors we are offering the
+device: it refers to the head of a descriptor chain. The “flags”
+field is currently 0 or 1: 1 indicating that we do not need an
+interrupt when the device consumes a descriptor from the
+available ring. Alternatively, the guest can ask the device to
+delay interrupts until an entry with an index specified by the “
+used_event” field is written in the used ring (equivalently,
+until the idx field in the used ring will reach the value
+used_event + 1). The method employed by the device is controlled
+by the VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits]
+). This interrupt suppression is merely an optimization; it may
+not suppress interrupts entirely.
+
+The “idx” field indicates where we would put the next descriptor
+entry (modulo the ring size). This starts at 0, and increases.
+
+struct vring_avail {
+
+#define VRING_AVAIL_F_NO_INTERRUPT 1
+
+ u16 flags;
+
+ u16 idx;
+
+ u16 ring[qsz]; /* qsz is the Queue Size field read from device
+*/
+
+ u16 used_event;
+
+};
+
+ Used Ring
+
+The used ring is where the device returns buffers once it is done
+with them. The flags field can be used by the device to hint that
+no notification is necessary when the guest adds to the available
+ring. Alternatively, the “avail_event” field can be used by the
+device to hint that no notification is necessary until an entry
+with an index specified by the “avail_event” is written in the
+available ring (equivalently, until the idx field in the
+available ring will reach the value avail_event + 1). The method
+employed by the device is controlled by the guest through the
+VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits]
+). [footnote:
+These fields are kept here because this is the only part of the
+virtqueue written by the device
+].
+
+Each entry in the ring is a pair: the head entry of the
+descriptor chain describing the buffer (this matches an entry
+placed in the available ring by the guest earlier), and the total
+of bytes written into the buffer. The latter is extremely useful
+for guests using untrusted buffers: if you do not know exactly
+how much has been written by the device, you usually have to zero
+the buffer to ensure no data leakage occurs.
+
+/* u32 is used here for ids for padding reasons. */
+
+struct vring_used_elem {
+
+ /* Index of start of used descriptor chain. */
+
+ u32 id;
+
+ /* Total length of the descriptor chain which was used
+(written to) */
+
+ u32 len;
+
+};
+
+
+
+struct vring_used {
+
+#define VRING_USED_F_NO_NOTIFY 1
+
+ u16 flags;
+
+ u16 idx;
+
+ struct vring_used_elem ring[qsz];
+
+ u16 avail_event;
+
+};
+
+ Helpers for Managing Virtqueues
+
+The Linux Kernel Source code contains the definitions above and
+helper routines in a more usable form, in
+include/linux/virtio_ring.h. This was explicitly licensed by IBM
+and Red Hat under the (3-clause) BSD license so that it can be
+freely used by all other projects, and is reproduced (with slight
+variation to remove Linux assumptions) in Appendix A.
+
+ Device Operation
+
+There are two parts to device operation: supplying new buffers to
+the device, and processing used buffers from the device. As an
+example, the virtio network device has two virtqueues: the
+transmit virtqueue and the receive virtqueue. The driver adds
+outgoing (read-only) packets to the transmit virtqueue, and then
+frees them after they are used. Similarly, incoming (write-only)
+buffers are added to the receive virtqueue, and processed after
+they are used.
+
+ Supplying Buffers to The Device
+
+Actual transfer of buffers from the guest OS to the device
+operates as follows:
+
+ Place the buffer(s) into free descriptor(s).
+
+ If there are no free descriptors, the guest may choose to
+ notify the device even if notifications are suppressed (to
+ reduce latency).[footnote:
+The Linux drivers do this only for read-only buffers: for
+write-only buffers, it is assumed that the driver is merely
+trying to keep the receive buffer ring full, and no notification
+of this expected condition is necessary.
+]
+
+ Place the id of the buffer in the next ring entry of the
+ available ring.
+
+ The steps (1) and (2) may be performed repeatedly if batching
+ is possible.
+
+ A memory barrier should be executed to ensure the device sees
+ the updated descriptor table and available ring before the next
+ step.
+
+ The available “idx” field should be increased by the number of
+ entries added to the available ring.
+
+ A memory barrier should be executed to ensure that we update
+ the idx field before checking for notification suppression.
+
+ If notifications are not suppressed, the device should be
+ notified of the new buffers.
+
+Note that the above code does not take precautions against the
+available ring buffer wrapping around: this is not possible since
+the ring buffer is the same size as the descriptor table, so step
+(1) will prevent such a condition.
+
+In addition, the maximum queue size is 32768 (it must be a power
+of 2 which fits in 16 bits), so the 16-bit “idx” value can always
+distinguish between a full and empty buffer.
+
+Here is a description of each stage in more detail.
+
+ Placing Buffers Into The Descriptor Table
+
+A buffer consists of zero or more read-only physically-contiguous
+elements followed by zero or more physically-contiguous
+write-only elements (it must have at least one element). This
+algorithm maps it into the descriptor table:
+
+ for each buffer element, b:
+
+ Get the next free descriptor table entry, d
+
+ Set d.addr to the physical address of the start of b
+
+ Set d.len to the length of b.
+
+ If b is write-only, set d.flags to VRING_DESC_F_WRITE,
+ otherwise 0.
+
+ If there is a buffer element after this:
+
+ Set d.next to the index of the next free descriptor element.
+
+ Set the VRING_DESC_F_NEXT bit in d.flags.
+
+In practice, the d.next fields are usually used to chain free
+descriptors, and a separate count kept to check there are enough
+free descriptors before beginning the mappings.
+
+ Updating The Available Ring
+
+The head of the buffer we mapped is the first d in the algorithm
+above. A naive implementation would do the following:
+
+avail->ring[avail->idx % qsz] = head;
+
+However, in general we can add many descriptors before we update
+the “idx” field (at which point they become visible to the
+device), so we keep a counter of how many we've added:
+
+avail->ring[(avail->idx + added++) % qsz] = head;
+
+ Updating The Index Field
+
+Once the idx field of the virtqueue is updated, the device will
+be able to access the descriptor entries we've created and the
+memory they refer to. This is why a memory barrier is generally
+used before the idx update, to ensure it sees the most up-to-date
+copy.
+
+The idx field always increments, and we let it wrap naturally at
+65536:
+
+avail->idx += added;
+
+ <sub:Notifying-The-Device>Notifying The Device
+
+Device notification occurs by writing the 16-bit virtqueue index
+of this virtqueue to the Queue Notify field of the virtio header
+in the first I/O region of the PCI device. This can be expensive,
+however, so the device can suppress such notifications if it
+doesn't need them. We have to be careful to expose the new idx
+value before checking the suppression flag: it's OK to notify
+gratuitously, but not to omit a required notification. So again,
+we use a memory barrier here before reading the flags or the
+avail_event field.
+
+If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if
+the VRING_USED_F_NOTIFY flag is not set, we go ahead and write to
+the PCI configuration space.
+
+If the VIRTIO_F_RING_EVENT_IDX feature is negotiated, we read the
+avail_event field in the available ring structure. If the
+available index crossed_the avail_event field value since the
+last notification, we go ahead and write to the PCI configuration
+space. The avail_event field wraps naturally at 65536 as well:
+
+(u16)(new_idx - avail_event - 1) < (u16)(new_idx - old_idx)
+
+ <sub:Receiving-Used-Buffers>Receiving Used Buffers From The
+ Device
+
+Once the device has used a buffer (read from or written to it, or
+parts of both, depending on the nature of the virtqueue and the
+device), it sends an interrupt, following an algorithm very
+similar to the algorithm used for the driver to send the device a
+buffer:
+
+ Write the head descriptor number to the next field in the used
+ ring.
+
+ Update the used ring idx.
+
+ Determine whether an interrupt is necessary:
+
+ If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated: check
+ if f the VRING_AVAIL_F_NO_INTERRUPT flag is not set in avail-
+ >flags
+
+ If the VIRTIO_F_RING_EVENT_IDX feature is negotiated: check
+ whether the used index crossed the used_event field value
+ since the last update. The used_event field wraps naturally
+ at 65536 as well:(u16)(new_idx - used_event - 1) < (u16)(new_idx - old_idx)
+
+ If an interrupt is necessary:
+
+ If MSI-X capability is disabled:
+
+ Set the lower bit of the ISR Status field for the device.
+
+ Send the appropriate PCI interrupt for the device.
+
+ If MSI-X capability is enabled:
+
+ Request the appropriate MSI-X interrupt message for the
+ device, Queue Vector field sets the MSI-X Table entry
+ number.
+
+ If Queue Vector field value is NO_VECTOR, no interrupt
+ message is requested for this event.
+
+The guest interrupt handler should:
+
+ If MSI-X capability is disabled: read the ISR Status field,
+ which will reset it to zero. If the lower bit is zero, the
+ interrupt was not for this device. Otherwise, the guest driver
+ should look through the used rings of each virtqueue for the
+ device, to see if any progress has been made by the device
+ which requires servicing.
+
+ If MSI-X capability is enabled: look through the used rings of
+ each virtqueue mapped to the specific MSI-X vector for the
+ device, to see if any progress has been made by the device
+ which requires servicing.
+
+For each ring, guest should then disable interrupts by writing
+VRING_AVAIL_F_NO_INTERRUPT flag in avail structure, if required.
+It can then process used ring entries finally enabling interrupts
+by clearing the VRING_AVAIL_F_NO_INTERRUPT flag or updating the
+EVENT_IDX field in the available structure, Guest should then
+execute a memory barrier, and then recheck the ring empty
+condition. This is necessary to handle the case where, after the
+last check and before enabling interrupts, an interrupt has been
+suppressed by the device:
+
+vring_disable_interrupts(vq);
+
+for (;;) {
+
+ if (vq->last_seen_used != vring->used.idx) {
+
+ vring_enable_interrupts(vq);
+
+ mb();
+
+ if (vq->last_seen_used != vring->used.idx)
+
+ break;
+
+ }
+
+ struct vring_used_elem *e =
+vring.used->ring[vq->last_seen_used%vsz];
+
+ process_buffer(e);
+
+ vq->last_seen_used++;
+
+}
+
+ Dealing With Configuration Changes
+
+Some virtio PCI devices can change the device configuration
+state, as reflected in the virtio header in the PCI configuration
+space. In this case:
+
+ If MSI-X capability is disabled: an interrupt is delivered and
+ the second highest bit is set in the ISR Status field to
+ indicate that the driver should re-examine the configuration
+ space.Note that a single interrupt can indicate both that one
+ or more virtqueue has been used and that the configuration
+ space has changed: even if the config bit is set, virtqueues
+ must be scanned.
+
+ If MSI-X capability is enabled: an interrupt message is
+ requested. The Configuration Vector field sets the MSI-X Table
+ entry number to use. If Configuration Vector field value is
+ NO_VECTOR, no interrupt message is requested for this event.
+
+Creating New Device Types
+
+Various considerations are necessary when creating a new device
+type:
+
+ How Many Virtqueues?
+
+It is possible that a very simple device will operate entirely
+through its configuration space, but most will need at least one
+virtqueue in which it will place requests. A device with both
+input and output (eg. console and network devices described here)
+need two queues: one which the driver fills with buffers to
+receive input, and one which the driver places buffers to
+transmit output.
+
+ What Configuration Space Layout?
+
+Configuration space is generally used for rarely-changing or
+initialization-time parameters. But it is a limited resource, so
+it might be better to use a virtqueue to update configuration
+information (the network device does this for filtering,
+otherwise the table in the config space could potentially be very
+large).
+
+Note that this space is generally the guest's native endian,
+rather than PCI's little-endian.
+
+ What Device Number?
+
+Currently device numbers are assigned quite freely: a simple
+request mail to the author of this document or the Linux
+virtualization mailing list[footnote:
+
+https://lists.linux-foundation.org/mailman/listinfo/virtualization
+] will be sufficient to secure a unique one.
+
+Meanwhile for experimental drivers, use 65535 and work backwards.
+
+ How many MSI-X vectors?
+
+Using the optional MSI-X capability devices can speed up
+interrupt processing by removing the need to read ISR Status
+register by guest driver (which might be an expensive operation),
+reducing interrupt sharing between devices and queues within the
+device, and handling interrupts from multiple CPUs. However, some
+systems impose a limit (which might be as low as 256) on the
+total number of MSI-X vectors that can be allocated to all
+devices. Devices and/or device drivers should take this into
+account, limiting the number of vectors used unless the device is
+expected to cause a high volume of interrupts. Devices can
+control the number of vectors used by limiting the MSI-X Table
+Size or not presenting MSI-X capability in PCI configuration
+space. Drivers can control this by mapping events to as small
+number of vectors as possible, or disabling MSI-X capability
+altogether.
+
+ Message Framing
+
+The descriptors used for a buffer should not effect the semantics
+of the message, except for the total length of the buffer. For
+example, a network buffer consists of a 10 byte header followed
+by the network packet. Whether this is presented in the ring
+descriptor chain as (say) a 10 byte buffer and a 1514 byte
+buffer, or a single 1524 byte buffer, or even three buffers,
+should have no effect.
+
+In particular, no implementation should use the descriptor
+boundaries to determine the size of any header in a request.[footnote:
+The current qemu device implementations mistakenly insist that
+the first descriptor cover the header in these cases exactly, so
+a cautious driver should arrange it so.
+]
+
+ Device Improvements
+
+Any change to configuration space, or new virtqueues, or
+behavioural changes, should be indicated by negotiation of a new
+feature bit. This establishes clarity[footnote:
+Even if it does mean documenting design or implementation
+mistakes!
+] and avoids future expansion problems.
+
+Clusters of functionality which are always implemented together
+can use a single bit, but if one feature makes sense without the
+others they should not be gratuitously grouped together to
+conserve feature bits. We can always extend the spec when the
+first person needs more than 24 feature bits for their device.
+
+[LaTeX Command: printnomenclature]
+
+Appendix A: virtio_ring.h
+
+#ifndef VIRTIO_RING_H
+
+#define VIRTIO_RING_H
+
+/* An interface for efficient virtio implementation.
+
+ *
+
+ * This header is BSD licensed so anyone can use the definitions
+
+ * to implement compatible drivers/servers.
+
+ *
+
+ * Copyright 2007, 2009, IBM Corporation
+
+ * Copyright 2011, Red Hat, Inc
+
+ * All rights reserved.
+
+ *
+
+ * Redistribution and use in source and binary forms, with or
+without
+
+ * modification, are permitted provided that the following
+conditions
+
+ * are met:
+
+ * 1. Redistributions of source code must retain the above
+copyright
+
+ * notice, this list of conditions and the following
+disclaimer.
+
+ * 2. Redistributions in binary form must reproduce the above
+copyright
+
+ * notice, this list of conditions and the following
+disclaimer in the
+
+ * documentation and/or other materials provided with the
+distribution.
+
+ * 3. Neither the name of IBM nor the names of its contributors
+
+ * may be used to endorse or promote products derived from
+this software
+
+ * without specific prior written permission.
+
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
+CONTRIBUTORS ``AS IS'' AND
+
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+TO, THE
+
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+PARTICULAR PURPOSE
+
+ * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE
+LIABLE
+
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL
+
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS
+
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION)
+
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT
+
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+IN ANY WAY
+
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF
+
+ * SUCH DAMAGE.
+
+ */
+
+
+
+/* This marks a buffer as continuing via the next field. */
+
+#define VRING_DESC_F_NEXT 1
+
+/* This marks a buffer as write-only (otherwise read-only). */
+
+#define VRING_DESC_F_WRITE 2
+
+
+
+/* The Host uses this in used->flags to advise the Guest: don't
+kick me
+
+ * when you add a buffer. It's unreliable, so it's simply an
+
+ * optimization. Guest will still kick if it's out of buffers.
+*/
+
+#define VRING_USED_F_NO_NOTIFY 1
+
+/* The Guest uses this in avail->flags to advise the Host: don't
+
+ * interrupt me when you consume a buffer. It's unreliable, so
+it's
+
+ * simply an optimization. */
+
+#define VRING_AVAIL_F_NO_INTERRUPT 1
+
+
+
+/* Virtio ring descriptors: 16 bytes.
+
+ * These can chain together via "next". */
+
+struct vring_desc {
+
+ /* Address (guest-physical). */
+
+ uint64_t addr;
+
+ /* Length. */
+
+ uint32_t len;
+
+ /* The flags as indicated above. */
+
+ uint16_t flags;
+
+ /* We chain unused descriptors via this, too */
+
+ uint16_t next;
+
+};
+
+
+
+struct vring_avail {
+
+ uint16_t flags;
+
+ uint16_t idx;
+
+ uint16_t ring[];
+
+ uint16_t used_event;
+
+};
+
+
+
+/* u32 is used here for ids for padding reasons. */
+
+struct vring_used_elem {
+
+ /* Index of start of used descriptor chain. */
+
+ uint32_t id;
+
+ /* Total length of the descriptor chain which was written
+to. */
+
+ uint32_t len;
+
+};
+
+
+
+struct vring_used {
+
+ uint16_t flags;
+
+ uint16_t idx;
+
+ struct vring_used_elem ring[];
+
+ uint16_t avail_event;
+
+};
+
+
+
+struct vring {
+
+ unsigned int num;
+
+
+
+ struct vring_desc *desc;
+
+ struct vring_avail *avail;
+
+ struct vring_used *used;
+
+};
+
+
+
+/* The standard layout for the ring is a continuous chunk of
+memory which
+
+ * looks like this. We assume num is a power of 2.
+
+ *
+
+ * struct vring {
+
+ * // The actual descriptors (16 bytes each)
+
+ * struct vring_desc desc[num];
+
+ *
+
+ * // A ring of available descriptor heads with free-running
+index.
+
+ * __u16 avail_flags;
+
+ * __u16 avail_idx;
+
+ * __u16 available[num];
+
+ *
+
+ * // Padding to the next align boundary.
+
+ * char pad[];
+
+ *
+
+ * // A ring of used descriptor heads with free-running
+index.
+
+ * __u16 used_flags;
+
+ * __u16 EVENT_IDX;
+
+ * struct vring_used_elem used[num];
+
+ * };
+
+ * Note: for virtio PCI, align is 4096.
+
+ */
+
+static inline void vring_init(struct vring *vr, unsigned int num,
+void *p,
+
+ unsigned long align)
+
+{
+
+ vr->num = num;
+
+ vr->desc = p;
+
+ vr->avail = p + num*sizeof(struct vring_desc);
+
+ vr->used = (void *)(((unsigned long)&vr->avail->ring[num]
+
+ + align-1)
+
+ & ~(align - 1));
+
+}
+
+
+
+static inline unsigned vring_size(unsigned int num, unsigned long
+align)
+
+{
+
+ return ((sizeof(struct vring_desc)*num +
+sizeof(uint16_t)*(2+num)
+
+ + align - 1) & ~(align - 1))
+
+ + sizeof(uint16_t)*3 + sizeof(struct
+vring_used_elem)*num;
+
+}
+
+
+
+static inline int vring_need_event(uint16_t event_idx, uint16_t
+new_idx, uint16_t old_idx)
+
+{
+
+ return (uint16_t)(new_idx - event_idx - 1) <
+(uint16_t)(new_idx - old_idx);
+
+}
+
+#endif /* VIRTIO_RING_H */
+
+<cha:Reserved-Feature-Bits>Appendix B: Reserved Feature Bits
+
+Currently there are five device-independent feature bits defined:
+
+ VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature
+ indicates that the driver wants an interrupt if the device runs
+ out of available descriptors on a virtqueue, even though
+ interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT
+ flag or the used_event field. An example of this is the
+ networking driver: it doesn't need to know every time a packet
+ is transmitted, but it does need to free the transmitted
+ packets a finite time after they are transmitted. It can avoid
+ using a timer if the device interrupts it when all the packets
+ are transmitted.
+
+ VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature
+ indicates that the driver can use descriptors with the
+ VRING_DESC_F_INDIRECT flag set, as described in [sub:Indirect-Descriptors]
+ .
+
+ VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_event
+ and the avail_event fields. If set, it indicates that the
+ device should ignore the flags field in the available ring
+ structure. Instead, the used_event field in this structure is
+ used by guest to suppress device interrupts. Further, the
+ driver should ignore the flags field in the used ring
+ structure. Instead, the avail_event field in this structure is
+ used by the device to suppress notifications. If unset, the
+ driver should ignore the used_event field; the device should
+ ignore the avail_event field; the flags field is used
+
+ VIRTIO_F_BAD_FEATURE(30) This feature should never be
+ negotiated by the guest; doing so is an indication that the
+ guest is faulty[footnote:
+An experimental virtio PCI driver contained in Linux version
+2.6.25 had this problem, and this feature bit can be used to
+detect it.
+]
+
+ VIRTIO_F_FEATURES_HIGH(31) This feature indicates that the
+ device supports feature bits 32:63. If unset, feature bits
+ 32:63 are unset.
+
+Appendix C: Network Device
+
+The virtio network device is a virtual ethernet card, and is the
+most complex of the devices supported so far by virtio. It has
+enhanced rapidly and demonstrates clearly how support for new
+features should be added to an existing device. Empty buffers are
+placed in one virtqueue for receiving packets, and outgoing
+packets are enqueued into another for transmission in that order.
+A third command queue is used to control advanced filtering
+features.
+
+ Configuration
+
+ Subsystem Device ID 1
+
+ Virtqueues 0:receiveq. 1:transmitq. 2:controlq[footnote:
+Only if VIRTIO_NET_F_CTRL_VQ set
+]
+
+ Feature bits
+
+ VIRTIO_NET_F_CSUM (0) Device handles packets with partial
+ checksum
+
+ VIRTIO_NET_F_GUEST_CSUM (1) Guest handles packets with partial
+ checksum
+
+ VIRTIO_NET_F_MAC (5) Device has given MAC address.
+
+ VIRTIO_NET_F_GSO (6) (Deprecated) device handles packets with
+ any GSO type.[footnote:
+It was supposed to indicate segmentation offload support, but
+upon further investigation it became clear that multiple bits
+were required.
+]
+
+ VIRTIO_NET_F_GUEST_TSO4 (7) Guest can receive TSOv4.
+
+ VIRTIO_NET_F_GUEST_TSO6 (8) Guest can receive TSOv6.
+
+ VIRTIO_NET_F_GUEST_ECN (9) Guest can receive TSO with ECN.
+
+ VIRTIO_NET_F_GUEST_UFO (10) Guest can receive UFO.
+
+ VIRTIO_NET_F_HOST_TSO4 (11) Device can receive TSOv4.
+
+ VIRTIO_NET_F_HOST_TSO6 (12) Device can receive TSOv6.
+
+ VIRTIO_NET_F_HOST_ECN (13) Device can receive TSO with ECN.
+
+ VIRTIO_NET_F_HOST_UFO (14) Device can receive UFO.
+
+ VIRTIO_NET_F_MRG_RXBUF (15) Guest can merge receive buffers.
+
+ VIRTIO_NET_F_STATUS (16) Configuration status field is
+ available.
+
+ VIRTIO_NET_F_CTRL_VQ (17) Control channel is available.
+
+ VIRTIO_NET_F_CTRL_RX (18) Control channel RX mode support.
+
+ VIRTIO_NET_F_CTRL_VLAN (19) Control channel VLAN filtering.
+
+ Device configuration layout Two configuration fields are
+ currently defined. The mac address field always exists (though
+ is only valid if VIRTIO_NET_F_MAC is set), and the status field
+ only exists if VIRTIO_NET_F_STATUS is set. Only one bit is
+ currently defined for the status field: VIRTIO_NET_S_LINK_UP. #define VIRTIO_NET_S_LINK_UP 1
+
+
+
+struct virtio_net_config {
+
+ u8 mac[6];
+
+ u16 status;
+
+};
+
+ Device Initialization
+
+ The initialization routine should identify the receive and
+ transmission virtqueues.
+
+ If the VIRTIO_NET_F_MAC feature bit is set, the configuration
+ space “mac” entry indicates the “physical” address of the the
+ network card, otherwise a private MAC address should be
+ assigned. All guests are expected to negotiate this feature if
+ it is set.
+
+ If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, identify
+ the control virtqueue.
+
+ If the VIRTIO_NET_F_STATUS feature bit is negotiated, the link
+ status can be read from the bottom bit of the “status” config
+ field. Otherwise, the link should be assumed active.
+
+ The receive virtqueue should be filled with receive buffers.
+ This is described in detail below in “Setting Up Receive
+ Buffers”.
+
+ A driver can indicate that it will generate checksumless
+ packets by negotating the VIRTIO_NET_F_CSUM feature. This “
+ checksum offload” is a common feature on modern network cards.
+
+ If that feature is negotiated, a driver can use TCP or UDP
+ segmentation offload by negotiating the VIRTIO_NET_F_HOST_TSO4
+ (IPv4 TCP), VIRTIO_NET_F_HOST_TSO6 (IPv6 TCP) and
+ VIRTIO_NET_F_HOST_UFO (UDP fragmentation) features. It should
+ not send TCP packets requiring segmentation offload which have
+ the Explicit Congestion Notification bit set, unless the
+ VIRTIO_NET_F_HOST_ECN feature is negotiated.[footnote:
+This is a common restriction in real, older network cards.
+]
+
+ The converse features are also available: a driver can save the
+ virtual device some work by negotiating these features.[footnote:
+For example, a network packet transported between two guests on
+the same system may not require checksumming at all, nor
+segmentation, if both guests are amenable.
+] The VIRTIO_NET_F_GUEST_CSUM feature indicates that partially
+ checksummed packets can be received, and if it can do that then
+ the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6,
+ VIRTIO_NET_F_GUEST_UFO and VIRTIO_NET_F_GUEST_ECN are the input
+ equivalents of the features described above. See “Receiving
+ Packets” below.
+
+ Device Operation
+
+Packets are transmitted by placing them in the transmitq, and
+buffers for incoming packets are placed in the receiveq. In each
+case, the packet itself is preceded by a header:
+
+struct virtio_net_hdr {
+
+#define VIRTIO_NET_HDR_F_NEEDS_CSUM 1
+
+ u8 flags;
+
+#define VIRTIO_NET_HDR_GSO_NONE 0
+
+#define VIRTIO_NET_HDR_GSO_TCPV4 1
+
+#define VIRTIO_NET_HDR_GSO_UDP 3
+
+#define VIRTIO_NET_HDR_GSO_TCPV6 4
+
+#define VIRTIO_NET_HDR_GSO_ECN 0x80
+
+ u8 gso_type;
+
+ u16 hdr_len;
+
+ u16 gso_size;
+
+ u16 csum_start;
+
+ u16 csum_offset;
+
+/* Only if VIRTIO_NET_F_MRG_RXBUF: */
+
+ u16 num_buffers
+
+};
+
+The controlq is used to control device features such as
+filtering.
+
+ Packet Transmission
+
+Transmitting a single packet is simple, but varies depending on
+the different features the driver negotiated.
+
+ If the driver negotiated VIRTIO_NET_F_CSUM, and the packet has
+ not been fully checksummed, then the virtio_net_hdr's fields
+ are set as follows. Otherwise, the packet must be fully
+ checksummed, and flags is zero.
+
+ flags has the VIRTIO_NET_HDR_F_NEEDS_CSUM set,
+
+ <ite:csum_start-is-set>csum_start is set to the offset within
+ the packet to begin checksumming, and
+
+ csum_offset indicates how many bytes after the csum_start the
+ new (16 bit ones' complement) checksum should be placed.[footnote:
+For example, consider a partially checksummed TCP (IPv4) packet.
+It will have a 14 byte ethernet header and 20 byte IP header
+followed by the TCP header (with the TCP checksum field 16 bytes
+into that header). csum_start will be 14+20 = 34 (the TCP
+checksum includes the header), and csum_offset will be 16. The
+value in the TCP checksum field will be the sum of the TCP pseudo
+header, so that replacing it by the ones' complement checksum of
+the TCP header and body will give the correct result.
+]
+
+ <enu:If-the-driver>If the driver negotiated
+ VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO, and the packet requires
+ TCP segmentation or UDP fragmentation, then the “gso_type”
+ field is set to VIRTIO_NET_HDR_GSO_TCPV4, TCPV6 or UDP.
+ (Otherwise, it is set to VIRTIO_NET_HDR_GSO_NONE). In this
+ case, packets larger than 1514 bytes can be transmitted: the
+ metadata indicates how to replicate the packet header to cut it
+ into smaller packets. The other gso fields are set:
+
+ hdr_len is a hint to the device as to how much of the header
+ needs to be kept to copy into each packet, usually set to the
+ length of the headers, including the transport header.[footnote:
+Due to various bugs in implementations, this field is not useful
+as a guarantee of the transport header size.
+]
+
+ gso_size is the size of the packet beyond that header (ie.
+ MSS).
+
+ If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the
+ VIRTIO_NET_HDR_GSO_ECN bit may be set in “gso_type” as well,
+ indicating that the TCP packet has the ECN bit set.[footnote:
+This case is not handled by some older hardware, so is called out
+specifically in the protocol.
+]
+
+ If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature,
+ the num_buffers field is set to zero.
+
+ The header and packet are added as one output buffer to the
+ transmitq, and the device is notified of the new entry (see [sub:Notifying-The-Device]
+ ).[footnote:
+Note that the header will be two bytes longer for the
+VIRTIO_NET_F_MRG_RXBUF case.
+]
+
+ Packet Transmission Interrupt
+
+Often a driver will suppress transmission interrupts using the
+VRING_AVAIL_F_NO_INTERRUPT flag (see [sub:Receiving-Used-Buffers]
+) and check for used packets in the transmit path of following
+packets. However, it will still receive interrupts if the
+VIRTIO_F_NOTIFY_ON_EMPTY feature is negotiated, indicating that
+the transmission queue is completely emptied.
+
+The normal behavior in this interrupt handler is to retrieve and
+new descriptors from the used ring and free the corresponding
+headers and packets.
+
+ Setting Up Receive Buffers
+
+It is generally a good idea to keep the receive virtqueue as
+fully populated as possible: if it runs out, network performance
+will suffer.
+
+If the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6 or
+VIRTIO_NET_F_GUEST_UFO features are used, the Guest will need to
+accept packets of up to 65550 bytes long (the maximum size of a
+TCP or UDP packet, plus the 14 byte ethernet header), otherwise
+1514 bytes. So unless VIRTIO_NET_F_MRG_RXBUF is negotiated, every
+buffer in the receive queue needs to be at least this length [footnote:
+Obviously each one can be split across multiple descriptor
+elements.
+].
+
+If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at
+least the size of the struct virtio_net_hdr.
+
+ Packet Receive Interrupt
+
+When a packet is copied into a buffer in the receiveq, the
+optimal path is to disable further interrupts for the receiveq
+(see [sub:Receiving-Used-Buffers]) and process packets until no
+more are found, then re-enable them.
+
+Processing packet involves:
+
+ If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature,
+ then the “num_buffers” field indicates how many descriptors
+ this packet is spread over (including this one). This allows
+ receipt of large packets without having to allocate large
+ buffers. In this case, there will be at least “num_buffers” in
+ the used ring, and they should be chained together to form a
+ single packet. The other buffers will not begin with a struct
+ virtio_net_hdr.
+
+ If the VIRTIO_NET_F_MRG_RXBUF feature was not negotiated, or
+ the “num_buffers” field is one, then the entire packet will be
+ contained within this buffer, immediately following the struct
+ virtio_net_hdr.
+
+ If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the
+ VIRTIO_NET_HDR_F_NEEDS_CSUM bit in the “flags” field may be
+ set: if so, the checksum on the packet is incomplete and the “
+ csum_start” and “csum_offset” fields indicate how to calculate
+ it (see [ite:csum_start-is-set]).
+
+ If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were
+ negotiated, then the “gso_type” may be something other than
+ VIRTIO_NET_HDR_GSO_NONE, and the “gso_size” field indicates the
+ desired MSS (see [enu:If-the-driver]).Control Virtqueue
+
+The driver uses the control virtqueue (if VIRTIO_NET_F_VTRL_VQ is
+negotiated) to send commands to manipulate various features of
+the device which would not easily map into the configuration
+space.
+
+All commands are of the following form:
+
+struct virtio_net_ctrl {
+
+ u8 class;
+
+ u8 command;
+
+ u8 command-specific-data[];
+
+ u8 ack;
+
+};
+
+
+
+/* ack values */
+
+#define VIRTIO_NET_OK 0
+
+#define VIRTIO_NET_ERR 1
+
+The class, command and command-specific-data are set by the
+driver, and the device sets the ack byte. There is little it can
+do except issue a diagnostic if the ack byte is not
+VIRTIO_NET_OK.
+
+ Packet Receive Filtering
+
+If the VIRTIO_NET_F_CTRL_RX feature is negotiated, the driver can
+send control commands for promiscuous mode, multicast receiving,
+and filtering of MAC addresses.
+
+Note that in general, these commands are best-effort: unwanted
+packets may still arrive.
+
+ Setting Promiscuous Mode
+
+#define VIRTIO_NET_CTRL_RX 0
+
+ #define VIRTIO_NET_CTRL_RX_PROMISC 0
+
+ #define VIRTIO_NET_CTRL_RX_ALLMULTI 1
+
+The class VIRTIO_NET_CTRL_RX has two commands:
+VIRTIO_NET_CTRL_RX_PROMISC turns promiscuous mode on and off, and
+VIRTIO_NET_CTRL_RX_ALLMULTI turns all-multicast receive on and
+off. The command-specific-data is one byte containing 0 (off) or
+1 (on).
+
+ Setting MAC Address Filtering
+
+struct virtio_net_ctrl_mac {
+
+ u32 entries;
+
+ u8 macs[entries][ETH_ALEN];
+
+};
+
+
+
+#define VIRTIO_NET_CTRL_MAC 1
+
+ #define VIRTIO_NET_CTRL_MAC_TABLE_SET 0
+
+The device can filter incoming packets by any number of
+destination MAC addresses.[footnote:
+Since there are no guarantees, it can use a hash filter
+orsilently switch to allmulti or promiscuous mode if it is given
+too many addresses.
+] This table is set using the class VIRTIO_NET_CTRL_MAC and the
+command VIRTIO_NET_CTRL_MAC_TABLE_SET. The command-specific-data
+is two variable length tables of 6-byte MAC addresses. The first
+table contains unicast addresses, and the second contains
+multicast addresses.
+
+ VLAN Filtering
+
+If the driver negotiates the VIRTION_NET_F_CTRL_VLAN feature, it
+can control a VLAN filter table in the device.
+
+#define VIRTIO_NET_CTRL_VLAN 2
+
+ #define VIRTIO_NET_CTRL_VLAN_ADD 0
+
+ #define VIRTIO_NET_CTRL_VLAN_DEL 1
+
+Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DEL
+command take a 16-bit VLAN id as the command-specific-data.
+
+Appendix D: Block Device
+
+The virtio block device is a simple virtual block device (ie.
+disk). Read and write requests (and other exotic requests) are
+placed in the queue, and serviced (probably out of order) by the
+device except where noted.
+
+ Configuration
+
+ Subsystem Device ID 2
+
+ Virtqueues 0:requestq.
+
+ Feature bits
+
+ VIRTIO_BLK_F_BARRIER (0) Host supports request barriers.
+
+ VIRTIO_BLK_F_SIZE_MAX (1) Maximum size of any single segment is
+ in “size_max”.
+
+ VIRTIO_BLK_F_SEG_MAX (2) Maximum number of segments in a
+ request is in “seg_max”.
+
+ VIRTIO_BLK_F_GEOMETRY (4) Disk-style geometry specified in “
+ geometry”.
+
+ VIRTIO_BLK_F_RO (5) Device is read-only.
+
+ VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”.
+
+ VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands.
+
+ VIRTIO_BLK_F_FLUSH (9) Cache flush command support.
+
+
+
+ Device configuration layout The capacity of the device
+ (expressed in 512-byte sectors) is always present. The
+ availability of the others all depend on various feature bits
+ as indicated above. struct virtio_blk_config {
+
+ u64 capacity;
+
+ u32 size_max;
+
+ u32 seg_max;
+
+ struct virtio_blk_geometry {
+
+ u16 cylinders;
+
+ u8 heads;
+
+ u8 sectors;
+
+ } geometry;
+
+ u32 blk_size;
+
+
+
+};
+
+ Device Initialization
+
+ The device size should be read from the “capacity”
+ configuration field. No requests should be submitted which goes
+ beyond this limit.
+
+ If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated, the
+ blk_size field can be read to determine the optimal sector size
+ for the driver to use. This does not effect the units used in
+ the protocol (always 512 bytes), but awareness of the correct
+ value can effect performance.
+
+ If the VIRTIO_BLK_F_RO feature is set by the device, any write
+ requests will fail.
+
+
+
+ Device Operation
+
+The driver queues requests to the virtqueue, and they are used by
+the device (not necessarily in order). Each request is of form:
+
+struct virtio_blk_req {
+
+
+
+ u32 type;
+
+ u32 ioprio;
+
+ u64 sector;
+
+ char data[][512];
+
+ u8 status;
+
+};
+
+If the device has VIRTIO_BLK_F_SCSI feature, it can also support
+scsi packet command requests, each of these requests is of form:struct virtio_scsi_pc_req {
+
+ u32 type;
+
+ u32 ioprio;
+
+ u64 sector;
+
+ char cmd[];
+
+ char data[][512];
+
+#define SCSI_SENSE_BUFFERSIZE 96
+
+ u8 sense[SCSI_SENSE_BUFFERSIZE];
+
+ u32 errors;
+
+ u32 data_len;
+
+ u32 sense_len;
+
+ u32 residual;
+
+ u8 status;
+
+};
+
+The type of the request is either a read (VIRTIO_BLK_T_IN), a
+write (VIRTIO_BLK_T_OUT), a scsi packet command
+(VIRTIO_BLK_T_SCSI_CMD or VIRTIO_BLK_T_SCSI_CMD_OUT[footnote:
+the SCSI_CMD and SCSI_CMD_OUT types are equivalent, the device
+does not distinguish between them
+]) or a flush (VIRTIO_BLK_T_FLUSH or VIRTIO_BLK_T_FLUSH_OUT[footnote:
+the FLUSH and FLUSH_OUT types are equivalent, the device does not
+distinguish between them
+]). If the device has VIRTIO_BLK_F_BARRIER feature the high bit
+(VIRTIO_BLK_T_BARRIER) indicates that this request acts as a
+barrier and that all preceding requests must be complete before
+this one, and all following requests must not be started until
+this is complete. Note that a barrier does not flush caches in
+the underlying backend device in host, and thus does not serve as
+data consistency guarantee. Driver must use FLUSH request to
+flush the host cache.
+
+#define VIRTIO_BLK_T_IN 0
+
+#define VIRTIO_BLK_T_OUT 1
+
+#define VIRTIO_BLK_T_SCSI_CMD 2
+
+#define VIRTIO_BLK_T_SCSI_CMD_OUT 3
+
+#define VIRTIO_BLK_T_FLUSH 4
+
+#define VIRTIO_BLK_T_FLUSH_OUT 5
+
+#define VIRTIO_BLK_T_BARRIER 0x80000000
+
+The ioprio field is a hint about the relative priorities of
+requests to the device: higher numbers indicate more important
+requests.
+
+The sector number indicates the offset (multiplied by 512) where
+the read or write is to occur. This field is unused and set to 0
+for scsi packet commands and for flush commands.
+
+The cmd field is only present for scsi packet command requests,
+and indicates the command to perform. This field must reside in a
+single, separate read-only buffer; command length can be derived
+from the length of this buffer.
+
+Note that these first three (four for scsi packet commands)
+fields are always read-only: the data field is either read-only
+or write-only, depending on the request. The size of the read or
+write can be derived from the total size of the request buffers.
+
+The sense field is only present for scsi packet command requests,
+and indicates the buffer for scsi sense data.
+
+The data_len field is only present for scsi packet command
+requests, this field is deprecated, and should be ignored by the
+driver. Historically, devices copied data length there.
+
+The sense_len field is only present for scsi packet command
+requests and indicates the number of bytes actually written to
+the sense buffer.
+
+The residual field is only present for scsi packet command
+requests and indicates the residual size, calculated as data
+length - number of bytes actually transferred.
+
+The final status byte is written by the device: either
+VIRTIO_BLK_S_OK for success, VIRTIO_BLK_S_IOERR for host or guest
+error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host:#define VIRTIO_BLK_S_OK 0
+
+#define VIRTIO_BLK_S_IOERR 1
+
+#define VIRTIO_BLK_S_UNSUPP 2
+
+Historically, devices assumed that the fields type, ioprio and
+sector reside in a single, separate read-only buffer; the fields
+errors, data_len, sense_len and residual reside in a single,
+separate write-only buffer; the sense field in a separate
+write-only buffer of size 96 bytes, by itself; the fields errors,
+data_len, sense_len and residual in a single write-only buffer;
+and the status field is a separate read-only buffer of size 1
+byte, by itself.
+
+Appendix E: Console Device
+
+The virtio console device is a simple device for data input and
+output. A device may have one or more ports. Each port has a pair
+of input and output virtqueues. Moreover, a device has a pair of
+control IO virtqueues. The control virtqueues are used to
+communicate information between the device and the driver about
+ports being opened and closed on either side of the connection,
+indication from the host about whether a particular port is a
+console port, adding new ports, port hot-plug/unplug, etc., and
+indication from the guest about whether a port or a device was
+successfully added, port open/close, etc.. For data IO, one or
+more empty buffers are placed in the receive queue for incoming
+data and outgoing characters are placed in the transmit queue.
+
+ Configuration
+
+ Subsystem Device ID 3
+
+ Virtqueues 0:receiveq(port0). 1:transmitq(port0), 2:control
+ receiveq[footnote:
+Ports 2 onwards only if VIRTIO_CONSOLE_F_MULTIPORT is set
+], 3:control transmitq, 4:receiveq(port1), 5:transmitq(port1),
+ ...
+
+ Feature bits
+
+ VIRTIO_CONSOLE_F_SIZE (0) Configuration cols and rows fields
+ are valid.
+
+ VIRTIO_CONSOLE_F_MULTIPORT(1) Device has support for multiple
+ ports; configuration fields nr_ports and max_nr_ports are
+ valid and control virtqueues will be used.
+
+ Device configuration layout The size of the console is supplied
+ in the configuration space if the VIRTIO_CONSOLE_F_SIZE feature
+ is set. Furthermore, if the VIRTIO_CONSOLE_F_MULTIPORT feature
+ is set, the maximum number of ports supported by the device can
+ be fetched.struct virtio_console_config {
+
+ u16 cols;
+
+ u16 rows;
+
+
+
+ u32 max_nr_ports;
+
+};
+
+ Device Initialization
+
+ If the VIRTIO_CONSOLE_F_SIZE feature is negotiated, the driver
+ can read the console dimensions from the configuration fields.
+
+ If the VIRTIO_CONSOLE_F_MULTIPORT feature is negotiated, the
+ driver can spawn multiple ports, not all of which may be
+ attached to a console. Some could be generic ports. In this
+ case, the control virtqueues are enabled and according to the
+ max_nr_ports configuration-space value, the appropriate number
+ of virtqueues are created. A control message indicating the
+ driver is ready is sent to the host. The host can then send
+ control messages for adding new ports to the device. After
+ creating and initializing each port, a
+ VIRTIO_CONSOLE_PORT_READY control message is sent to the host
+ for that port so the host can let us know of any additional
+ configuration options set for that port.
+
+ The receiveq for each port is populated with one or more
+ receive buffers.
+
+ Device Operation
+
+ For output, a buffer containing the characters is placed in the
+ port's transmitq.[footnote:
+Because this is high importance and low bandwidth, the current
+Linux implementation polls for the buffer to be used, rather than
+waiting for an interrupt, simplifying the implementation
+significantly. However, for generic serial ports with the
+O_NONBLOCK flag set, the polling limitation is relaxed and the
+consumed buffers are freed upon the next write or poll call or
+when a port is closed or hot-unplugged.
+]
+
+ When a buffer is used in the receiveq (signalled by an
+ interrupt), the contents is the input to the port associated
+ with the virtqueue for which the notification was received.
+
+ If the driver negotiated the VIRTIO_CONSOLE_F_SIZE feature, a
+ configuration change interrupt may occur. The updated size can
+ be read from the configuration fields.
+
+ If the driver negotiated the VIRTIO_CONSOLE_F_MULTIPORT
+ feature, active ports are announced by the host using the
+ VIRTIO_CONSOLE_PORT_ADD control message. The same message is
+ used for port hot-plug as well.
+
+ If the host specified a port `name', a sysfs attribute is
+ created with the name filled in, so that udev rules can be
+ written that can create a symlink from the port's name to the
+ char device for port discovery by applications in the guest.
+
+ Changes to ports' state are effected by control messages.
+ Appropriate action is taken on the port indicated in the
+ control message. The layout of the structure of the control
+ buffer and the events associated are:struct virtio_console_control {
+
+ uint32_t id; /* Port number */
+
+ uint16_t event; /* The kind of control event */
+
+ uint16_t value; /* Extra information for the event */
+
+};
+
+
+
+/* Some events for the internal messages (control packets) */
+
+
+
+#define VIRTIO_CONSOLE_DEVICE_READY 0
+
+#define VIRTIO_CONSOLE_PORT_ADD 1
+
+#define VIRTIO_CONSOLE_PORT_REMOVE 2
+
+#define VIRTIO_CONSOLE_PORT_READY 3
+
+#define VIRTIO_CONSOLE_CONSOLE_PORT 4
+
+#define VIRTIO_CONSOLE_RESIZE 5
+
+#define VIRTIO_CONSOLE_PORT_OPEN 6
+
+#define VIRTIO_CONSOLE_PORT_NAME 7
+
+Appendix F: Entropy Device
+
+The virtio entropy device supplies high-quality randomness for
+guest use.
+
+ Configuration
+
+ Subsystem Device ID 4
+
+ Virtqueues 0:requestq.
+
+ Feature bits None currently defined
+
+ Device configuration layout None currently defined.
+
+ Device Initialization
+
+ The virtqueue is initialized
+
+ Device Operation
+
+When the driver requires random bytes, it places the descriptor
+of one or more buffers in the queue. It will be completely filled
+by random data by the device.
+
+Appendix G: Memory Balloon Device
+
+The virtio memory balloon device is a primitive device for
+managing guest memory: the device asks for a certain amount of
+memory, and the guest supplies it (or withdraws it, if the device
+has more than it asks for). This allows the guest to adapt to
+changes in allowance of underlying physical memory. If the
+feature is negotiated, the device can also be used to communicate
+guest memory statistics to the host.
+
+ Configuration
+
+ Subsystem Device ID 5
+
+ Virtqueues 0:inflateq. 1:deflateq. 2:statsq.[footnote:
+Only if VIRTIO_BALLON_F_STATS_VQ set
+]
+
+ Feature bits
+
+ VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host must be told before
+ pages from the balloon are used.
+
+ VIRTIO_BALLOON_F_STATS_VQ (1) A virtqueue for reporting guest
+ memory statistics is present.
+
+ Device configuration layout Both fields of this configuration
+ are always available. Note that they are little endian, despite
+ convention that device fields are guest endian:struct virtio_balloon_config {
+
+ u32 num_pages;
+
+ u32 actual;
+
+};
+
+ Device Initialization
+
+ The inflate and deflate virtqueues are identified.
+
+ If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated:
+
+ Identify the stats virtqueue.
+
+ Add one empty buffer to the stats virtqueue and notify the
+ host.
+
+Device operation begins immediately.
+
+ Device Operation
+
+ Memory Ballooning The device is driven by the receipt of a
+ configuration change interrupt.
+
+ The “num_pages” configuration field is examined. If this is
+ greater than the “actual” number of pages, memory must be given
+ to the balloon. If it is less than the “actual” number of
+ pages, memory may be taken back from the balloon for general
+ use.
+
+ To supply memory to the balloon (aka. inflate):
+
+ The driver constructs an array of addresses of unused memory
+ pages. These addresses are divided by 4096[footnote:
+This is historical, and independent of the guest page size
+] and the descriptor describing the resulting 32-bit array is
+ added to the inflateq.
+
+ To remove memory from the balloon (aka. deflate):
+
+ The driver constructs an array of addresses of memory pages it
+ has previously given to the balloon, as described above. This
+ descriptor is added to the deflateq.
+
+ If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is set, the
+ guest may not use these requested pages until that descriptor
+ in the deflateq has been used by the device.
+
+ Otherwise, the guest may begin to re-use pages previously given
+ to the balloon before the device has acknowledged their
+ withdrawal. [footnote:
+In this case, deflation advice is merely a courtesy
+]
+
+ In either case, once the device has completed the inflation or
+ deflation, the “actual” field of the configuration should be
+ updated to reflect the new number of pages in the balloon.[footnote:
+As updates to configuration space are not atomic, this field
+isn't particularly reliable, but can be used to diagnose buggy
+guests.
+]
+
+ Memory Statistics
+
+The stats virtqueue is atypical because communication is driven
+by the device (not the driver). The channel becomes active at
+driver initialization time when the driver adds an empty buffer
+and notifies the device. A request for memory statistics proceeds
+as follows:
+
+ The device pushes the buffer onto the used ring and sends an
+ interrupt.
+
+ The driver pops the used buffer and discards it.
+
+ The driver collects memory statistics and writes them into a
+ new buffer.
+
+ The driver adds the buffer to the virtqueue and notifies the
+ device.
+
+ The device pops the buffer (retaining it to initiate a
+ subsequent request) and consumes the statistics.
+
+ Memory Statistics Format Each statistic consists of a 16 bit
+ tag and a 64 bit value. Both quantities are represented in the
+ native endian of the guest. All statistics are optional and the
+ driver may choose which ones to supply. To guarantee backwards
+ compatibility, unsupported statistics should be omitted.
+
+ struct virtio_balloon_stat {
+
+#define VIRTIO_BALLOON_S_SWAP_IN 0
+
+#define VIRTIO_BALLOON_S_SWAP_OUT 1
+
+#define VIRTIO_BALLOON_S_MAJFLT 2
+
+#define VIRTIO_BALLOON_S_MINFLT 3
+
+#define VIRTIO_BALLOON_S_MEMFREE 4
+
+#define VIRTIO_BALLOON_S_MEMTOT 5
+
+ u16 tag;
+
+ u64 val;
+
+} __attribute__((packed));
+
+ Tags
+
+ VIRTIO_BALLOON_S_SWAP_IN The amount of memory that has been
+ swapped in (in bytes).
+
+ VIRTIO_BALLOON_S_SWAP_OUT The amount of memory that has been
+ swapped out to disk (in bytes).
+
+ VIRTIO_BALLOON_S_MAJFLT The number of major page faults that
+ have occurred.
+
+ VIRTIO_BALLOON_S_MINFLT The number of minor page faults that
+ have occurred.
+
+ VIRTIO_BALLOON_S_MEMFREE The amount of memory not being used
+ for any purpose (in bytes).
+
+ VIRTIO_BALLOON_S_MEMTOT The total amount of memory available
+ (in bytes).
+