Intro to IOMMU in Linux
Intro to IOMMU in Linux
An Introduction to IOMMU
Infrastructure in the Linux
Kernel
Adrian Huang
This paper explains the IOMMU technology, providing a high-level overview of IOMMU and
IOMMU infrastructure in Linux kernel. Two IOMMU kernel modes (DMA translation mode and
pass-through mode) are then described in detail. The last section of the white paper
illustrates IOMMU use case with the PCI pass-through device in virtualization environment.
This paper is intended for IT specialists who want to know the difference between IOMMU
DMA translation mode and IOMMU pass-through mode by means of the high-level overview,
and should have knowledge of how to configure the Linux kernel and a familiarity with
virtualization technologies such as KVM and Xen. The paper is also suitable for software
developers who want to know the Linux kernel IOMMU subsystem, and it is recommended
that they already have kernel development experience and knowledge of how MMU works.
At Lenovo® Press, we bring together experts to produce technical publications around topics
of importance to you, providing information and best practices for using Lenovo products and
solutions to solve IT challenges.
See a list of our most recent publications at the Lenovo Press web site:
http://lenovopress.com
Do you have the latest version? We update our papers from time to time, so check
whether you have the latest version of this document by clicking the Check for Updates
button on the front page of the PDF. Pressing this button will take you to a web page that
will tell you if you are reading the latest version of the document and give you a link to the
latest if needed. While you’re there, you can also sign up to get notified via email whenever
we make an update.
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
IOMMU Subsystem in Linux Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Linux Kernel IOMMU: DMA Translation Mode versus Pass-through Mode . . . . . . . . . . . . . 11
Direct Device Access Use Case in Virtualization Environment . . . . . . . . . . . . . . . . . . . . . . 14
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Author. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
8QSULYLOHJHGGRPDLQ
*XHVW26 90 *XHVW26 90 'RP8
*XHVWGULYHU *XHVWGULYHU
3ULYLOHJHGGRPDLQ
(PXODWHGGHYLFH 'RP
3K\VLFDOGHYLFH +DUGZDUH3ODWIRUP
1
From https://developer.ibm.com/tutorials/l-pci-passthrough/
Intel names the hardware-assisted component “Intel Virtualization Technology for Directed
I/O (VT-d)”, whereas AMD titles it “AMD I/O Memory Management Unit (IOMMU) or AMD I/O
Virtualization Technology (AMD-Vi)”.
8QSULYLOHJHGGRPDLQ
*XHVW26 90 *XHVW26 90 'RP8
*XHVWGULYHU 3K\VLFDOGULYHU
SDVVWKURXJK
3ULYLOHJHGGRPDLQ
(PXODWHGGHYLFH 'RP
3K\VLFDOGHYLFHGULYHU
+\SHUYLVRU 900
3K\VLFDOGHYLFH 3K\VLFDOGHYLFH
+DUGZDUH3ODWIRUP
2
From https://developer.ibm.com/tutorials/l-pci-passthrough/
8QSULYLOHJHGGRPDLQ
*XHVW26 90 'RP8
3K\VLFDOGULYHU
SDVVWKURXJK
3ULYLOHJHGGRPDLQ
'RP
,2008+DUGZDUH
3K\VLFDOGHYLFH
+DUGZDUH3ODWIRUP
The operating system needs to understand the IOMMU hardware information, so the system
firmware provides the IOMMU description by means of an ACPI table. This will be also
discussed in this section.
5
IOMMU Subsystem in Linux Kernel – High-level Overview
Figure 4 illustrates the high-level overview about IOMMU subsystem in Linux kernel.
'0$5HTXHVW
,20086XEV\VWHP
,2008'0$
,2008*HQHULF/D\HU
/HYHO .E\WH
3DJH7DEOH 3K\VLFDO3DJH
'HYLFH7DEOH
37(
/HYHOWDEOHDGGUHVV 'DWD
,23DJH7DEOH
Figure 4 IOMMU Subsystem in Linux Kernel: High-level Overview
When will IOMMU hardware-specific layer update the I/O Page Table? Two cases are
available:
Direct mapping (or identity mapping) defined in Advanced Configuration and Power
Interface (ACPI) table.
When probing/initializing IOMMU hardware, IOMMU hardware-specific layer parses the
direct mapping information stored in ACPI table and configures I/O page table based on
ACPI table.
'0$5HTXHVW,QWHUDFWLRQEHWZHHQGHYLFHGULYHUDQG,2008
'0$VXEV\VWHP
'0$0DSSLQJ/D\HU
'0$ <
GLUHFW"
1
'0$'LUHFW0DSSLQJ
'0$0DS236
LRPPXBGPDBPDSBSDJH
3K\VLFDODGGUHVV ,QYRNHSFLBPDSBSDJH RUGPDBPDSBSDJH $3,
GPDBDGGUBW
«
,20086XEV\VWHP
Figure 5 DMA Request: Interaction between Device Driver, DMA Subsystem and IOMMU Subsystem
7
Figure 6 illustrates 4K-byte page translation with 4-level I/O page table. This translates GVA
to GPA. The addresses of the GCR3 table and the level-4 table address are SPAs, and those
of PM4E, PDPE (Page Directory Pointer Entry), PDE (Page Directory Entry) and PTE (Page
Table Entry) are GPAs. This implies that those GPAs needs to be translated to SPA in order to
get the page table data from physical memory. Nested address translation (or host
translation) achieves the requirement. Linux IOMMU subsystem constructs I/O page table
and GCR3 table so that IOMMU hardware can deal with DMA translation properly.
GRPDLQ .E\WH
/HYHO /HYHO /HYHO /HYHO 3K\VLFDO
SGHY 3DJH7DEOH 3DJH7DEOH 3DJH7DEOH 3DJH7DEOH 3DJH
GHYLG
SDVVWKURXJK
3'3(
*3$ 37(
30(
SURWHFWLRQBGRPDLQ 'DWD
3'(
OLVW
GHYBOLVW
*3$ *3$ *3$
GRPDLQ
63$
LG /HJHQG
PRGH *&57DEOH 'DWD6WUXFWXUHLQ/LQX[.HUQHO,20086XEV\VWHP
SWBURRW 3$6,' ,23DJH7DEOH
/HYHOWDEOHDGGUHVV
JFUBWEO 3$6,'3URFHVV$GGUHVV6SDFH,GHQWLILHU
(QWU\RI*&57DEOH *&5*XHVW&5 &RQWURO5HJLVWHU
%LW 9 9DOLG
Figure 6 GVA to GPA Translation: 4K-byte Page Translation with 4-level I/O page table
Figure 7 GPA to SPA translation: 4K-byte Page Translation with 3-level I/O page table
When initializing IOMMU hardware in Linux IOMMU subsystem, the IOMMU driver parses
IVRS from ACPI table. If IVRS does not exist in the system, the IOMMU driver ignores the
initialization flow. Conversely, the IOMMU driver initializes IOMMU hardware based on IVRS
that includes one or more I/O Virtualization Definition Blocks (IVDBs).
9
Two types of IVDBs are as follows:
I/O Virtualization Hardware Definition (IVHD): An IVHD describes the capabilities and
configuration of IOMMU hardware as well as system I/O topology associated with each
IOMMU hardware.
I/O Virtualization Memory Definition (IVMD): An IVMD describes the special memory
constraints for specific devices.
Figure 8 illustrates AMD IOMMU hardware description known as IVHD. The figure shows two
IVHDs in the system, and the corresponding devices are attached to each IVHD. The detail of
IVHD is elaborately described in AMD IOMMU specification.
3URFHVV 3URFHVV
9$ 9$
008
3$
3K\VLFDO0HPRU\
3$ 3$
,2008 ,9+' ,2008 ,9+'
,29$ ,29$
'HYLFH 'HYLFH 'HYLFH
3$3K\VLFDO$GGUHVV
9$9LUWXDO$GGUHVV
,29$,29LUWXDO$GGUHVV
,9+',29LUWXDOL]DWLRQ+DUGZDUH'HILQLWLRQ
IOMMU pass-through mode is widely enabled in virtualization environment. This section also
lists the default IOMMU operation mode of Linux OSes and provides a kernel parameter to
change the IOMMU mode.
8QSULYLOHJHGGRPDLQ 'RP8
*XHVW26 90 *XHVW26 90
*XHVWGULYHU *XHVWGULYHU
,2008'0$7UDQVODWLRQ0RGH ,20083DVVWKURXJK0RGH
'HYLFH'ULYHU$ 'HYLFH'ULYHU%
SDVVWKURXJK
,2008+:
'0$5HPDSSLQJ ,QWHUUXSW5HPDSSLQJ
+DUGZDUH3ODWIRUP
3K\VLFDOGHYLFH$ 3K\VLFDOGHYLFH%
11
8QSULYLOHJHGGRPDLQ 'RP8
*XHVW26 90 *XHVW26 90
3K\VLFDOGULYHU$ *XHVWGULYHU
SDVVWKURXJK
3ULYLOHJHGGRPDLQ 'RP
,20083DVVWKURXJK 0RGH +\SHUYLVRU 900
RUKRVWLQJ26
'HYLFH'ULYHU$
SDVVWKURXJK
,2008+:
'0$5HPDSSLQJ ,QWHUUXSW5HPDSSLQJ
; +DUGZDUH3ODWIRUP
3K\VLFDOGHYLFH$
PCI pass-through model bypasses the hypervisor’s intervention to render the guest OS to
take control of the physical device directly. IOMMU pass-through mode bypasses the DMA
translation from the hypervisor. The hypervisor does not need to process DMA requests when
IOMMU pass-through mode is enabled in Linux. PCI pass-through and IOMMU pass-through
work collaboratively to enable the guest OS to have the direct control of the physical device.
What is the difference between IOMMU pass-through mode and disabling IOMMU option in
BIOS setup?
Disabling IOMMU option in BIOS setup means the IOMMU hardware is not exported to OS
software because the IOMMU related data structures are not embedded in the ACPI table.
Therefore, OS software cannot interact with IOMMU hardware. In this circumstance, the DMA
address equals to the system physical address (no DMA translation is required) in the
hypervisor.
IOMMU pass-through mode and disabling IOMMU have the same symptom – the DMA
address equals to the system physical address. The main difference is that the guest OS can
have the direct device access with the aid of the IOMMU pass-through mode, whereas the
guest OS cannot have the direct device access when disabling IOMMU option in BIOS setup.
Thanks to the IOMMU pass-through mode and the PCI pass-through model, the guest OS
can directly access the physical device without any SW changes. Apparently, the hosting OS
requires a specific component interacting between the guest OS and the physical device.
Virtual Function I/O (VFIO) framework running on the hypervisor aims at providing
user-space application for direct device access. QEMU, a user-space application, leverages
VFIO framework to expose the direct access of the physical device to the guest OS. Figure 11
illustrates how VFIO framework cooperates with the guest OS, PCI driver and IOMMU driver.
3K\VLFDOGULYHU$ *XHVWGULYHU
.90
YILR
3&,GULYHU LRPPX
,2008+:
'0$5HPDSSLQJ ,QWHUUXSW5HPDSSLQJ
+DUGZDUH3ODWIRUP
3K\VLFDOGHYLFH$
13
Direct Device Access Use Case in Virtualization Environment
This section shows how to attach a direct device access to a guest OS and takes a deep dive
into IOMMU status change of the hypervisor via the crash utility.
Hardware Lenovo ThinkSystem™ SR665 with AMD EPYC 7002 “Rome” family of
processors
15
5. Power on the guest OS and check the network device using the lspci command as shown
in Figure 15. The command output shows that the guest OS is equipped with a direct
device access that is exported by the hypervisor.
Figure 16 Status before Booting into a Guest OS with a Direct Device Access
Figure 17 shows the IOMMU information of the KVM after booting into a guest OS with a
direct device access.
Figure 17 Status after Booting into a Guest OS with a Direct Device Access
17
Summary
IOMMU hardware has been widely adopted within a virtual environment to improve the
system performance. This paper describes PCI device virtualization models, IOMMU
subsystem in Linux kernel, Linux IOMMU DMA translation mode and pass-through mode,
and how to directly use a PCI device in a guest OS.
Acronyms
ACPI Advanced Configuration and Power Interface
AMD-Vi AMD I/O Virtualization Technology
BDF Bus Number/Device Number/Function Number
DMA Direct Memory Access
DMAR DMA Remapping Reporting
DTE Device Table Entry
GCR3 Guest Control Register 3
GPA Guest Physical Address
GVA Guest Virtual Address
IOMMU Input Output Memory Management Unit
IVDB I/O Virtualization Definition Block
IVHD I/O Virtualization Hardware Definition
IVMD I/O Virtualization Memory Definition
IVRS I/O Virtualization Reporting Structure
MMU Memory Management Unit
SPA System Physical Address
VMM Virtual Machine Monitor
VT-d Intel Virtualization Technology for Directed I/O
References
See these web resources for more information:
Linux virtualization and PCI passthrough
https://developer.ibm.com/tutorials/l-pci-passthrough
AMD I/O Virtualization Technology (IOMMU) Specification
https://www.amd.com/system/files/TechDocs/48882_IOMMU_3.05_PUB.pdf
Intel Virtualization Technology for Directed I/O Architecture Specification
https://software.intel.com/content/www/us/en/develop/download/intel-virtualizat
ion-technology-for-directed-io-architecture-specification.html
Author
Adrian Huang is a Senior Linux Engineer in the Lenovo Infrastructure Solutions Group based
in Taipei, Taiwan. He has experience with Linux kernel IOMMU subsystem, block device layer
and memory management. He also contributes kernel patches to kernel community.
Special thanks to the following people for their contributions and suggestions:
Xiaochun, Lenovo Linux Engineer
Song Shang, Lenovo Linux Engineer
Gary Cudak, Lenovo OS Architect
David Watts, Lenovo Press
19
Notices
Lenovo may not offer the products, services, or features discussed in this document in all countries. Consult
your local Lenovo representative for information on the products and services currently available in your area.
Any reference to a Lenovo product, program, or service is not intended to state or imply that only that Lenovo
product, program, or service may be used. Any functionally equivalent product, program, or service that does
not infringe any Lenovo intellectual property right may be used instead. However, it is the user's responsibility
to evaluate and verify the operation of any other product, program, or service.
Lenovo may have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not give you any license to these patents. You can send license
inquiries, in writing, to:
LENOVO PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some
jurisdictions do not allow disclaimer of express or implied warranties in certain transactions, therefore, this
statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. Lenovo may
make improvements and/or changes in the product(s) and/or the program(s) described in this publication at
any time without notice.
The products described in this document are not intended for use in implantation or other life support
applications where malfunction may result in injury or death to persons. The information contained in this
document does not affect or change Lenovo product specifications or warranties. Nothing in this document
shall operate as an express or implied license or indemnity under the intellectual property rights of Lenovo or
third parties. All information contained in this document was obtained in specific environments and is
presented as an illustration. The result obtained in other operating environments may vary.
Lenovo may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
Any references in this publication to non-Lenovo Web sites are provided for convenience only and do not in
any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this Lenovo product, and use of those Web sites is at your own risk.
Any performance data contained herein was determined in a controlled environment. Therefore, the result
obtained in other operating environments may vary significantly. Some measurements may have been made
on development-level systems and there is no guarantee that these measurements will be the same on
generally available systems. Furthermore, some measurements may have been estimated through
extrapolation. Actual results may vary. Users of this document should verify the applicable data for their
specific environment.
Send us your comments via the Rate & Provide Feedback form found at
http://lenovopress.com/lp1467
Trademarks
Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other
countries, or both. These and other Lenovo trademarked terms are marked on their first occurrence in this
information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned
by Lenovo at the time this information was published. Such trademarks may also be registered or common law
trademarks in other countries. A current list of Lenovo trademarks is available from
https://www.lenovo.com/us/en/legal/copytrade/.
The following terms are trademarks of Lenovo in the United States, other countries, or both:
Lenovo® Lenovo(logo)® ThinkSystem™
Intel, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the
United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
21