In the last year or two, I've spent more time than I like frantically trying to fix Linux laptops where I've broken the boot process one way or another. The biggest obstacle, in each case, has been that I didn't really understand what was going on, so I was stuck doing web searches until I found something that solved my immediate problem. A lot has changed in the x86 and Linux boot process since I last spent much time looking at it, so I'm writing this down after spending some time investigating how it works now.
This is specifically for the installation that I have on my own laptop, which boots 64-bit Fedora 32 from an NVMe SSD via UEFI and Grub 2.
Boot is complicated, even more complicated than I'd realized. This discussion is nowhere near complete.
My SSD's NVMe device is /dev/nvme0
. NVMe devices are divided into
"namespaces". Mine has just one namespace /dev/nvme0n1
. That
appears to be the normal number of namespaces. This namespace is
where all the data is. The nvme list
command shows some basic
stats. You can see that it's a 1 TB drive, for example:
$ sudo nvme list Node SN Model Namespace Usage Format FW Rev ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 MI93T003610303E06 PC401 NVMe SK hynix 1TB 1 324.04 GB / 1.02 TB 512 B + 0 B 80007E00
(The nvme
utility isn't installed by default. On Fedora:
sudo dnf install nvme-cli
.)
The 324 GB number above might be the amount of storage in use from the drive's own perspective. I wasn't able to find any documentation explaining that number.
I used Fedora's default partitioning for an encrypted boot drive. This yielded three partitions using a GPT disklabel:
$ sudo fdisk -l /dev/nvme0n1 Disk /dev/nvme0n1: 953.89 GiB, 1024209543168 bytes, 2000409264 sectors Disk model: PC401 NVMe SK hynix 1TB Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 225DB613-867D-4004-9C94-4822154BA204 Device Start End Sectors Size Type /dev/nvme0n1p1 2048 411647 409600 200M EFI System /dev/nvme0n1p2 411648 2508799 2097152 1G Linux filesystem /dev/nvme0n1p3 2508800 2000408575 1997899776 952.7G Linux filesystem
I hadn't known much about GPT before. It stands for "GUID partition table", is an advancement over the previous MBR scheme in multiple ways: it supports up to 128 partitions (instead of just 4 physical partitions), it uses a GUID to identify partition types (instead of a byte), it supports very large disks via 64-bit LBA addresses, it has a CRC32 checksum for safety, and it is stored at both the beginning and the end of the disk for redundancy. It preserves some compatibility with the traditional MBR by using an MBR that declares the whole disk to be a single partition of type 0xee.
The partitions above are:
/dev/nvme0n1p1
: This is the EFI system partition. It is formatted as a FAT file system. Ultimately, it ends up mounted at/boot/efi
. It is small (by today's standards!), only 200 MB. It contains the bootloader. Here's another way to see this:$ ls -l /dev/disk/by-partlabel/ total 0 lrwxrwxrwx. 1 root root 15 Jul 10 16:19 'EFI\x20System\x20Partition' -> ../../nvme0n1p1
/dev/nvme0n1p2
: This is an ext4 partition ultimately mounted at/boot
. It is a little bigger than the EFI system partition. It contains kernels and their configurations and modules./dev/nvme0n1p3
: This is a LUKS2 encrypted partition. It takes up the bulk of the disk space.(LUKS stands for Linux Unified Key Setup. It's the Linux standard for disk encryption.)
Even though GPT uses a whole 128-bit GUID to identify the type of each
partition, Linux only uses a single GUID for all types of data
partition. Thus, the second and third partitions above have the same
GUID for their types, rather than different GUIDs for ext4 versus
LUKS2 encryption. We can ask fdisk
to show us the "Type-UUID"s as
well as the partitions' UUIDs:
$ sudo fdisk -l -o +uuid,type-uuid /dev/nvme0n1 ... Device Start End Sectors Size Type UUID Type-UUID /dev/nvme0n1p1 2048 411647 409600 200M EFI 10F28736-7EB9-48A8-BB37-AC2FDBACAC3F C12A7328-F81F-11D2-BA4B-00A0C93EC93B /dev/nvme0n1p2 411648 2508799 2097152 1G Linu 1D24A3B1-3131-4365-A3A7-DADAE8DDF6FD 0FC63DAF-8483-4772-8E79-3D69D8477DE4 /dev/nvme0n1p3 2508800 2000408575 1997899776 952.7G Linu 7A2DC291-4F3C-4DE5-B1C9-DC6351863A89 0FC63DAF-8483-4772-8E79-3D69D8477DE4
(Linux doesn't use GPT partition UUIDs much. They do not appear in
for example, /dev/disk/by-uuid
; rather, the UUIDs embedded in the
filesystems' superblocks appear there. GPT partition UUIDs do appear
in /dev/disk/by-partuuid
.)
In UEFI, boot starts out from a boot manager embedded in the machine's
firmware. Boot managers use a standardized set of variables. The
most important of these for our purposes is BootOrder
. Its value
is a list of numbers, each of which identifies a Boot<XXXX>
entry,
where <XXXX> is the number from BootOrder
.
The efivar
utility manipulates EFI variables. It's not installed
by default. On Fedora, you can install it with sudo dnf install
efivar
. (The /sys/firmware/efi/efivars
directory also provides
an interface to EFI variables. These are binary files. Ignore the
first four bytes of each file to get the raw data.)
We can get a list of EFI variables with efivar -l
. The efivar
program makes us specify variable names along with an obnoxious UUID,
so this is useful for finding out those UUIDs. On my system:
$ sudo efivar -l|grep BootOrder 8be4df61-93ca-11d2-aa0d-00e098032b8c-BootOrder 45cf35f6-0d6e-4d04-856a-0370a5b16f53-DefaultBootOrder $ sudo efivar -p --name 8be4df61-93ca-11d2-aa0d-00e098032b8c-BootOrder GUID: 8be4df61-93ca-11d2-aa0d-00e098032b8c Name: "BootOrder" Attributes: Non-Volatile Boot Service Access Runtime Service Access Value: 00000000 09 00 06 00 02 00 03 00 04 00 05 00 07 00 0a 00 |................|
The value is 16-bit little-endian numbers in hex, so this is 9, 6, 2, 3, 4, 5, 7, 10. We can then dump the first boot order entry:
$ sudo efivar -l|grep Boot0009 8be4df61-93ca-11d2-aa0d-00e098032b8c-Boot0009 $ sudo efivar -p --name 8be4df61-93ca-11d2-aa0d-00e098032b8c-Boot0009 GUID: 8be4df61-93ca-11d2-aa0d-00e098032b8c Name: "Boot0009" Attributes: Non-Volatile Boot Service Access Runtime Service Access Value: 00000000 01 00 00 00 62 00 46 00 65 00 64 00 6f 00 72 00 |....b.F.e.d.o.r.| 00000010 61 00 00 00 04 01 2a 00 01 00 00 00 00 08 00 00 |a.....*.........| 00000020 00 00 00 00 00 40 06 00 00 00 00 00 36 87 f2 10 |[email protected]...| 00000030 b9 7e a8 48 bb 37 ac 2f db ac ac 3f 02 02 04 04 |.~.H.7./...?....| 00000040 34 00 5c 00 45 00 46 00 49 00 5c 00 66 00 65 00 |4.\.E.F.I.\.f.e.| 00000050 64 00 6f 00 72 00 61 00 5c 00 73 00 68 00 69 00 |d.o.r.a.\.s.h.i.| 00000060 6d 00 78 00 36 00 34 00 2e 00 65 00 66 00 69 00 |m.x.6.4...e.f.i.| 00000070 00 00 7f ff 04 00 |...... |
The filename embedded in this is in UCS-2LE. In ASCII, it says
\EFI\fedora\shimx64.efi
. This refers to a file in the EFI system
partition.
The efibootmgr
program puts this together in a nicer way:
$ efibootmgr BootCurrent: 0009 Timeout: 0 seconds BootOrder: 0009,0006,0002,0003,0004,0005,0007,000A Boot0000* Windows Boot Manager Boot0001* debian Boot0002* Diskette Drive Boot0003* USB Storage Device Boot0004* CD/DVD/CD-RW Drive Boot0005* Onboard NIC Boot0006* ubuntu Boot0007* Linux Firmware Updater Boot0009* Fedora Boot000A* USB NIC
Since, as mentioned before, this is mounted at /boot/efi
, we can
look at this file and the rest of the directory:
$ sudo ls /boot/efi/EFI/fedora -l total 15604 -rwx------. 1 root root 112 Oct 2 2018 BOOTIA32.CSV -rwx------. 1 root root 110 Oct 2 2018 BOOTX64.CSV drwx------. 2 root root 4096 Jun 5 18:01 fonts drwx------. 2 root root 4096 Jun 5 18:07 fw -rwx------. 1 root root 62832 May 22 04:32 fwupdx64.efi -rwx------. 1 root root 1587528 May 26 09:23 gcdia32.efi -rwx------. 1 root root 2513224 May 26 09:23 gcdx64.efi -rwx------. 1 root root 5471 Sep 8 2019 grub.cfg -rwx------. 1 root root 1024 Jul 10 16:22 grubenv -rwx------. 1 root root 1587528 May 26 09:23 grubia32.efi -rwx------. 1 root root 2513224 May 26 09:23 grubx64.efi -rwx------. 1 root root 927824 Oct 2 2018 mmia32.efi -rwx------. 1 root root 1159560 Oct 2 2018 mmx64.efi -rwx------. 1 root root 1210776 Oct 2 2018 shim.efi -rwx------. 1 root root 975536 Oct 2 2018 shimia32.efi -rwx------. 1 root root 969264 Oct 2 2018 shimia32-fedora.efi -rwx------. 1 root root 1210776 Oct 2 2018 shimx64.efi -rwx------. 1 root root 1204496 Oct 2 2018 shimx64-fedora.efi
shimx64.efi
is an executable program that sits between the UEFI
boot manager and Grub 2. True to its name, it acts as a "shim" for
Secure Boot. We can see that it's an executable:
$ sudo file /boot/efi/EFI/fedora/shimx64.efi /boot/efi/EFI/fedora/shimx64.efi: PE32+ executable (EFI application) x86-64 (stripped to external PDB), for MS Windows
It's not really a Windows program, it's just in the Windows PE
"Portable Executable" format. Intel designed EFI and it's heavily
Windows-flavored. If you run strings
on the shim binary, you see
a lot of OpenSSL references. That is presumably the "Secure" part.
The shim is not part of the Grub bootloader. It is a separate program
shim <https://github.com/rhboot/shim/> that runs before Grub and
loads Grub. The first paragraph from its own README.md
is a good
description:
shim is a trivial EFI application that, when run, attempts to open and execute another application. It will initially attempt to do this via the standard EFI LoadImage() and StartImage() calls. If these fail (because Secure Boot is enabled and the binary is not signed with an appropriate key, for instance) it will then validate the binary against a built-in certificate. If this succeeds and if the binary or signing key are not blacklisted then shim will relocate and execute the binary.
The shim is signed by Microsoft (!) through an onerous process <https://blog.hansenpartnership.com/adventures-in-microsoft-uefi-signing/>.
The shimx64.efi
program runs grubx64.efi
, which is the actual
Grub 2 bootloader. In turn, grubx64.efi
reads and executes
grub.cfg
, which is a script written in a Grub-specific language
that resembles Bourne shell with some special features for booting.
grub.cfg
has lots of insmod
commands. These do not load Linux
kernel modules, they load Grub modules. Grub has a lot of modules:
I see 295 of them in /usr/lib/grub/i386-pc
. The modules are not
consistently documented. A lot of modules exist to implement a
particular command; for example, hexdump.mod
implements the
hexdump
command.
grub.cfg
loads environment variables from grubenv
, which is
mostly a text file but has a special format. Use grub-editenv
to
edit grubenv
if you really need to. See The GRUB environment
block
<https://www.gnu.org/software/grub/manual/grub/html_node/Environment-block.html>,
for more information.
In earlier versions of Grub, grub.cfg
usually contained a list of
menuentry
commands that said how to boot various kernels you might
have. This isn't the case for my Fedora installation. Instead,
grub.cfg
has a single blscfg
command:
# The blscfg command parses the BootLoaderSpec files stored in /boot/loader/entries and # populates the boot menu. Please refer to the Boot Loader Specification documentation # for the files format: https://www.freedesktop.org/wiki/Specifications/BootLoaderSpec/. ⋮ insmod blscfg blscfg
In turn, I have several .conf
files in /boot/loader/entries
:
$ sudo ls /boot/loader/entries/ b4e66474bf8f4ab997720a7bd0d83628-0-rescue.conf b4e66474bf8f4ab997720a7bd0d83628-5.3.12-300.fc31.x86_64.conf b4e66474bf8f4ab997720a7bd0d83628-5.4.13-201.fc31.x86_64.conf b4e66474bf8f4ab997720a7bd0d83628-5.5.17-200.fc31.x86_64.conf b4e66474bf8f4ab997720a7bd0d83628-5.6.15-300.fc32.x86_64.conf
Each of the .conf
files describes one menu entry. My default
entry is:
$ sudo cat /boot/loader/entries/b4e66474bf8f4ab997720a7bd0d83628-5.6.15-300.fc32.x86_64.conf title Fedora (5.6.15-300.fc32.x86_64) 32 (Thirty Two) version 5.6.15-300.fc32.x86_64 linux /vmlinuz-5.6.15-300.fc32.x86_64 initrd /initramfs-5.6.15-300.fc32.x86_64.img options $kernelopts grub_users $grub_users grub_arg --unrestricted grub_class kernel
In my installation, the $kernelopts
variable above comes from
/boot/efi/EFI/fedora/grubenv
:
kernelopts=root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.luks.uuid=luks-3acca984-01a3-4c67-b9c6-94433f667b88 rd.lvm.lv=fedora/swap rhgb quiet systemd.unified_cgroup_hierarchy=0
(I added the systemd.unified_cgroup_hierarchy=0
variable assigment
at the end to disable cgroups v2. This is needed to run Docker.)
One may compare these kernel options to the ones in the running
kernel. The only difference in the BOOT_IMAGE
at the beginning
(Grub adds this itself):
$ cat /proc/cmdline BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.6.15-300.fc32.x86_64 root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.luks.uuid=luks-3acca984-01a3-4c67-b9c6-94433f667b88 rd.lvm.lv=fedora/swap rhgb quiet systemd.unified_cgroup_hierarchy=0
Finally, I noticed that grub.cfg
itself has a similar assignment
to $default_kernelopts
. I don't know whether or how this is
connected to $kernelopts
that all of the .conf
files use:
set default_kernelopts="root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.luks.uuid=luks-3acca984-01a3-4c67-b9c6-94433f667b88 rd.lvm.lv=fedora/swap rhgb quiet "
Grub needs to find the root
file system, the one that contains the
Linux kernel and initrd files, which in my case is the ext4 partition
/dev/nvme0n1p2
. grub.cfg
tells it to find this partition by
searching for it by UUID:
search --no-floppy --fs-uuid --set=root 71baeefe-e3d2-45e1-a0ff-bf651d4a7591
grub.cfg
also tells Grub to find the boot
partition, i.e. the
EFI system partition in FAT format on /dev/nvme0n1p1
, by its
UUID:
search --no-floppy --fs-uuid --set=boot 2AB5-7C0C
It's not clear to me what Grub uses the latter information for, if
anything, since at the time this search
command runs Grub has
already read all the files it really needs.
We've covered everything necessary for Grub to allow the user to select a kernel. Suppose the user selects the Fedora 32 entry detailed above. This will cause the associated commands to run. The important ones are:
linux /vmlinuz-5.6.15-300.fc32.x86_64 initrd /initramfs-5.6.15-300.fc32.x86_64.img options $kernelopts
The linux
command points to the kernel image to load. The path is
relative to the Grub root established above, which on a booted system
is mounted at /boot
:
$ ls -l /boot/vmlinuz-5.6.15-300.fc32.x86_64 -rwxr-xr-x. 1 root root 10795112 May 29 07:42 /boot/vmlinuz-5.6.15-300.fc32.x86_64 $ file /boot/vmlinuz-5.6.15-300.fc32.x86_64 /boot/vmlinuz-5.6.15-300.fc32.x86_64: Linux kernel x86 boot executable bzImage, version 5.6.15-300.fc32.x86_64 ([email protected]) #1 SMP Fri May 29 14:23:59 , RO-rootFS, swap_dev 0xA, Normal VGA
The initrd
command points to an additional file that Grub loads
into memory and makes available to the kernel.
Now Grub turns over control to the kernel. The kernel does basic system setup. We will ignore that, as detailed and exciting as it is.
After the kernel does basic system setup, it finds userspace to run.
To do that, it needs a file system. Initially, nothing is mounted,
except for an empty root directory (a ramfs
file system).
The initial file system can come from a few places. The following sections explore the possibilities.
The kernel can get its initial file system from a few places. There
is always a cpio
archive built into the kernel itself. (cpio
has the same purpose as tar
, but its command-line interface is
very different.) We should look at it. It is kind of hard to get it
out, and some of the recipes I found online did not work. Eventually,
I got the uncompressed kernel image using extract-vmlinux
, then
the start and end address of the archive from System.map
, then the
offset in the file from objdump
, then the contents via dd
:
$ /usr/src/kernels/5.6.15-300.fc32.x86_64/scripts/extract-vmlinux /boot/vmlinuz-5.6.15-300.fc32.x86_64 > vmlinux $ sudo grep initramfs /boot/System.map-5.6.15-300.fc32.x86_64 ffffffff831d2228 D __initramfs_start ffffffff831d2428 D __initramfs_size $ sudo objdump --file-offsets -s --start-address=0xffffffff831d2228 --stop-address=0xffffffff831d2428 vmlinux | head -4 vmlinux: file format elf64-x86-64 Contents of section .init.data: (Starting at file offset: 0x25d2228) $ sudo dd if=vmlinux bs=1 skip=$((0x25d2228)) count=$((0xffffffff831d2428 - 0xffffffff831d2228)) status=none | cpio -vt drwxr-xr-x 2 root root 0 May 29 07:24 dev crw------- 1 root root 5, 1 May 29 07:24 dev/console drwx------ 2 root root 0 May 29 07:24 root 1 block
As you can see above, there's barely anything in the kernel's built-in
cpio
archive. This is clearly not how this system is booting.
Consider the initrd
command that was given to Grub. Its name
stands for "initial RAM disk". This is what this kernel is actually
using to boot.
Earlier versions of Linux (though not the very earliest!) pointed
initrd
to a small disk image that the kernel would mount and use
as the first stage of userland. These days, initrd
points to a
cpio
archive:
$ ls -l /boot/initramfs-5.6.15-300.fc32.x86_64.img -rw-------. 1 root root 35496138 Jun 5 18:05 /boot/initramfs-5.6.15-300.fc32.x86_64.img $ file /boot/initramfs-5.6.15-300.fc32.x86_64.img /boot/initramfs-5.6.15-300.fc32.x86_64.img: regular file, no read permission $ sudo file /boot/initramfs-5.6.15-300.fc32.x86_64.img /boot/initramfs-5.6.15-300.fc32.x86_64.img: ASCII cpio archive (SVR4 with no CRC)
We can look inside the archive:
$ sudo cat /boot/initramfs-5.6.15-300.fc32.x86_64.img | cpio -tv drwxr-xr-x 3 root root 0 May 29 11:35 . -rw-r--r-- 1 root root 2 May 29 11:35 early_cpio drwxr-xr-x 3 root root 0 May 29 11:35 kernel drwxr-xr-x 3 root root 0 May 29 11:35 kernel/x86 drwxr-xr-x 2 root root 0 May 29 11:35 kernel/x86/microcode -rw-r--r-- 1 root root 99328 May 29 11:35 kernel/x86/microcode/GenuineIntel.bin 196 blocks
This is also clearly not anything that can be booted. This archive is
just for CPU microcode. It's not clear to me exactly how and when the
kernel finds and sets up this microcode. Nothing related to it shows
up in dmesg
.
The above is less than 100 kB in size (cpio
uses 512-byte blocks).
The previous ls -l
shows that the initrd is tens of megabytes in
size, so there must be more following the cpio
archive. Indeed:
$ sudo cat /boot/initramfs-5.6.15-300.fc32.x86_64.img | (cpio -t >/dev/null 2>&1; file -) /dev/stdin: gzip compressed data, max compression, from Unix $ sudo cat /boot/initramfs-5.6.15-300.fc32.x86_64.img | (cpio -t >/dev/null 2>&1; zcat | file -) /dev/stdin: ASCII cpio archive (SVR4 with no CRC) $ sudo cat /boot/initramfs-5.6.15-300.fc32.x86_64.img | (cpio -t >/dev/null 2>&1; zcat | cpio -t) drwxr-xr-x 12 root root 0 May 29 11:35 . lrwxrwxrwx 1 root root 7 May 29 11:35 bin -> usr/bin drwxr-xr-x 2 root root 0 May 29 11:35 dev crw-r--r-- 1 root root 5, 1 May 29 11:35 dev/console crw-r--r-- 1 root root 1, 11 May 29 11:35 dev/kmsg crw-r--r-- 1 root root 1, 3 May 29 11:35 dev/null crw-r--r-- 1 root root 1, 8 May 29 11:35 dev/random crw-r--r-- 1 root root 1, 9 May 29 11:35 dev/urandom drwxr-xr-x 11 root root 0 May 29 11:35 etc -rw-r--r-- 1 root root 92 May 29 11:35 etc/block_uuid.map …2468 files and directories omitted… -rw-r--r-- 1 root root 1730 Jan 29 08:57 usr/share/terminfo/l/linux drwxr-xr-x 2 root root 0 May 29 11:35 usr/share/terminfo/v -rw-r--r-- 1 root root 1190 Jan 29 08:57 usr/share/terminfo/v/vt100 -rw-r--r-- 1 root root 1184 Jan 29 08:57 usr/share/terminfo/v/vt102 -rw-r--r-- 1 root root 1377 Jan 29 08:57 usr/share/terminfo/v/vt220 lrwxrwxrwx 1 root root 20 May 29 11:35 usr/share/unimaps -> /usr/lib/kbd/unimaps drwxr-xr-x 3 root root 0 May 29 11:35 var lrwxrwxrwx 1 root root 11 May 29 11:35 var/lock -> ../run/lock lrwxrwxrwx 1 root root 6 May 29 11:35 var/run -> ../run drwxr-xr-x 2 root root 0 May 29 11:35 var/tmp
Now we're getting somewhere! We can extract the archive into a temporary directory for further examination:
$ mkdir initrd $ cd initrd/ $ sudo cat /boot/initramfs-5.6.15-300.fc32.x86_64.img | (cpio -t >/dev/null 2>&1; zcat | cpio -i) cpio: dev/console: Cannot mknod: Operation not permitted cpio: dev/kmsg: Cannot mknod: Operation not permitted cpio: dev/null: Cannot mknod: Operation not permitted cpio: dev/random: Cannot mknod: Operation not permitted cpio: dev/urandom: Cannot mknod: Operation not permitted 156121 blocks $ ls bin etc lib proc run shutdown sysroot usr dev init lib64 root sbin sys tmp var
(One may also sudo chroot
into the initrd, which has a shell and
basic libraries and command-line tools. It is not a comfortable
environment in other ways, though; for example, Backspace and Ctrl+U
didn't behave in a sane way for me.)
After the kernel extracts this archive into RAM, it starts /init
,
which is actually systemd
:
$ ls -l init lrwxrwxrwx. 1 bpfaff bpfaff 23 Jul 11 11:37 init -> usr/lib/systemd/systemd $ file usr/lib/systemd/systemd usr/lib/systemd/systemd: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=a427da4e4bd4db8b6aac43245f8da742e1994076, for GNU/Linux 3.2.0, stripped
After I discovered this and started looking through the systemd documentation for information on startup, I came across bootup(7). It is full of great information (although it further references boot(7), which doesn't exist), including a diagram. There is particularly a section titled "Bootup in the Initial RAM Disk (initrd)" which is relevant now. dracut.bootup(7) also has a wonderful diagram illustrating the bootup sequence.
bootup(7) says that the default target is initrd.target
if
/etc/initrd-release
exists, which it does in my case. (Kernel
command-line setting rd.systemd.unit=…
could override the default,
but it is not set.) Even if /etc/initrd-release
did not exist, my
initrd symlinks from /etc/systemd/system/default.target
to
/usr/lib/systemd/system/initrd.target
. The latter file contains,
in part:
ConditionPathExists=/etc/initrd-release Requires=basic.target Wants=initrd-root-fs.target initrd-root-device.target initrd-fs.target initrd-parse-etc.service After=initrd-root-fs.target initrd-root-device.target initrd-fs.target basic.target rescue.service rescue.target
In my initrd, /etc/initrd-release
is a symlink to
/usr/lib/initrd-release
. By either name, it contains a series of
key-value pairs in a format suitable for the Bourne shell. Grepping
around the initrd tree shows that the /usr/bin/dracut-cmdline
shell script sources these pairs (via /usr/lib/initrd-release
).
This will come up later, in Dracut Command Line Parsing.
It would be possible to figure out what's going on at boot by manually
looking through all of the systemd unit files in
/etc/systemd/system
and /usr/lib/systemd/system
. This would
be a slow process because there are a lot of them and it's not easy to
follow all the dependency chains. I didn't want to do that.
Luckily, systemd comes with the systemd-analyze
program to
help out. It has a number of subcommands. The best one for this
purpose seems to be the plot
subcommand. I invoked it like this:
systemd-analyze plot > bootup.svg
The output was bootup.svg. You might want to load this in another tab or another window so you can flip back and forth, because the rest of boot follows along with it. (This diagram is very tall and wide! You will probably have to pan to the right or shrink it down with Ctrl+− to see anything meaningful.)
Multiple programs on my system can view SVG files but I ended up using Firefox the most because it allows cut-and-paste of text whereas the other viewers I tried do not.
The SVG output is a 2-d plot where the x axis is time and the y axis is a series of rows, each of which is a bar that represents a timespan. There are a few gray bars at the top that represent a few important stages of kernel loading: "firmware", "loader", "kernel", "initrd". Time is in seconds, with zero at the transition from "loader" to "kernel". The rest of the rows are red and represent units, spanning from the unit's start time to its exit time. Some units finish quickly, so they only have short bars; some run as long as the system is up, so their bars run past the right end of the x axis.
On my system, "kernel" takes about 2 seconds to initialize and then "initrd" (and thus systemd) takes over. About 300 ms later, systemd starts launching units. The first few units it launches are built-in and I think they happen automatically without considering any dependency chains. All of them appear to remain active as long as the system itself does. These are documented in systemd.special(7):
-.mount
: This represents the root mount point.systemd has a naming convention for mounts (and slices, below), that uses
-
as the root (instead of/
) and as the delimiter between levels, so that a mount/home/foo
would be calledhome-foo.mount
. Details in systemd.unit(5) under "String Escaping for Inclusion in Unit Names".-.slice
: Root of the "slice hierarchy" documented in systemd.slice(7).A slice is a cgroup that systemd manages. Slices may contain slices (recursively), services, and scopes. Slices don't directly contain processes (but services and scopes nested within them do). More information in systemd.slice(5).
Slices have a hierarchy expressed the same way as for mounts (although slices don't correspond to files), so that
foo-bar.slice
is a child underfoo.slice
.systemd.slice(7) says that
-.slice
doesn't usually contain units, that instead it is used to set defaults for the slice tree.init.scope
: Scope unit that contains PID 1.Scopes contain processes that are started by arbitrary processes (as opposed to processes within services, which systemd itself starts).
https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ has useful documentation on slices and scopes and especially at https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/#systemdsresourcecontrolconcepts
I discovered the
systemd-cgls
command and used it to list the contents ofinit.scope
. Indeed, it only containedsystemd
itself:$ systemd-cgls -u init.scope Unit init.scope (/init.scope): └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 30
system.slice
: Default slice for service and scope units.When I ran
systemd-cgls -u system.slice
, I saw 44 services running under it ranging fromabrtd.service
towpa_supplicant.service
, with no sub-slices or sub-scopes.
The next row in the systemd-analyze plot
output is for
system-systemd\x2dcryptsetup.slice
. The name illustrates
systemd's escaping convention: \x2d
is hexified ASCII for -
.
This gets used a lot and it makes the names hard to read, so for the
rest of this document I'm going to replace \x2d
by -
without
mentioning it further.
systemd-cryptsetup(8) says that systemd uses
systemd-cryptsetup-generator
to translate /etc/crypttab
into
units. Reading systemd-cryptsetup-generator(8), it mentions that it
also uses some kernel command line options with luks
in their
names, including the one that appears on my kernel command line:
rd.luks.uuid=luks-3acca984-01a3-4c67-b9c6-94433f667b88
The /etc/crypttab
in my initrd says:
luks-3acca984-01a3-4c67-b9c6-94433f667b88 /dev/disk/by-uuid/3acca984-01a3-4c67-b9c6-94433f667b88 none discard
Putting the device in both places appears to be "belt-and-suspenders"
redundancy, because rd.luks.uuid
is documented to "activate the
specified device as part of the boot process as if it was listed in
/etc/crypttab
".
Documentation for systemd-cryptsetup-generator
pointed to
systemd.generator(7). This revealed that systemd has a general
concept of "generators", which are programs that run early in boot to
translate file formats foreign to systemd into units. The manpage
gave an example for debugging a generator, so I ran:
$ dir=$(mktemp -d) $ sudo SYSTEMD_LOG_LEVEL=debug /usr/lib/systemd/system-generators/systemd-cryptsetup-generator "$dir" "$dir" "$dir"
This generated a tree of directories and symlinks and some unit files.
The most interesting one was
systemd-cryptsetup@luks-3acca984-01a3-4c67-b9c6-94433f667b88.service
,
which contained, among other things:
ExecStart=/usr/lib/systemd/systemd-cryptsetup attach 'luks-3acca984-01a3-4c67-b9c6-94433f667b88' '/dev/disk/by-uuid/3acca984-01a3-4c67-b9c6-94433f667b88' 'none' 'discard' ExecStop=/usr/lib/systemd/systemd-cryptsetup detach 'luks-3acca984-01a3-4c67-b9c6-94433f667b88'
Looking around, the same file was in my running system in
/run/systemd/generator
, so that's presumably the place that
systemd keeps it at runtime.
I noticed that systemd-cgls
didn't list this slice, even though
the plot output showed it as living "forever". I found the --all
option to systemd-cgls
, for listing empty groups. With that, I
found the slice. Presumably, it could have new service units at
runtime if I connected a new encrypted disk, or as part of system
shutdown.
This service doesn't immediately prompt for the password. Bootup (from the initrd) continues.
The next row lists system-systemd-hibernate-resume.slice
.
This is for resume from hibernation. I don't use this feature, so I
didn't look any deeper.
The next three lines in the plot are all for journal sockets:
systemd-journald-audit.socket systemd-journald-dev-log.socket systemd-journald.socket
Each of these has a unit file in the initrd,
e.g. /usr/lib/systemd/system/systemd-journald.socket
. Each of
these unit files causes systemd to listen on a socket listed in the
file (e.g. as ListenStream=/run/systemd/journal/stdout
) and then
pass any accepted connections to systemd-journald
(because all of
them say Service=systemd-journald.service
). See systemd.socket(5)
for details.
The "journal" is what systemd calls the log. I don't know why it uses that name.
The next row is for systemd-vconsole-setup.service
.
systemd-vconsole-setup(8) documents that this reads
/etc/vconsole.conf
. In my initrd that file contains:
KEYMAP="us" FONT="eurlatgr"
The documentation says that kernel command-line options can override this file. Mine don't.
I wish systemd were smarter about console fonts (or maybe it's Dracut or Fedora that's responsible). The one that it chooses on my system is almost unreadable small (on a 4k screen).
Dracut (see dracut(8)) is a program written in Bash that generates the initrd that I'm using. The next few rows in the startup plot relate to Dracut.
The first one is dracut-cmdline.service
. This service is
underdocumented: dracut-cmdline.service(8) just says that it runs
hooks to parse the kernel command line. The service turns out to just
be a shell script /usr/bin/dracut-cmdline
. When run, it sources
/dracut-state.sh
(yes, in the root!), if it exists, then
/usr/lib/initrd-release
(as we saw in Initrd cpio Archive
earlier). Each of these is a set of shell variable assignments.
Then the shell script looks at various parameters that might be present in the kernel command line. The kernel command options I have set that Dracut or its hooks cares about are:
root=/dev/mapper/fedora-root rd.luks.uuid=luks-3acca984-01a3-4c67-b9c6-94433f667b88 rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap
Dracut itself looks at the root
command-line parameter. It also
invokes a string of hook scripts in /usr/lib/dracut/hooks/cmdline
.
These turn out to be important for my case:
30-parse-crypt.sh
looks atrd.luks.uuid
and creates a rule in/etc/udev/rules.d/70-luks.rules
. (I think so; if so, the rule doesn't make it into the booted system.)30-parse-lvm.sh
looks at therd.lvm.lv
variables above.
These hooks create shell scripts in
/lib/dracut/hooks/initqueue/finished
. Each of these exits
successfully only if the appropriate device file exists (i.e. in
/dev/disk
or /dev/fedora
). This enables the Dracut Initqueue
Service, later, to know that initialization is complete.
Finally, it dumps the generated variables back into
/dracut-state.sh
. Other Dracut scripts that run later both source
this and update it.
(I wasn't able to get a sample of /dracut-state.sh
from my own
system. It disappears after boot is complete, as part of the
pivot_root operation, I believe.)
This script is, I believe, where the Dracut kernel command-line options, which are listed in dracut.cmdline(7), get parsed, although most of them actually take effect in a later stage.
The new row in the startup plot shows dracut-pre-udev.service
,
which runs /usr/bin/dracut-pre-udev
, which is a Bash script. I
don't see anything significant that this does on my system.
The next three rows appear to be related. The third row is
sys-kernel-configfs.mount
, which makes sure that configfs
is
mounted at /sys/kernel/config
. This unit depends on
systemd-modules-load.service
; the second row reports starting
this.
The first row is for sys-module-fuse.device
, that is, loading the
fuse
module into the kernel as if with modprobe fuse
. I don't
know why this gets loaded: it is listed in
/usr/lib/modules-load.d/open-vm-tools.conf
, but the order is wrong
for that, and if the systemd-modules-load
were loading it then it
would presumably also load the modules listed in
/usr/lib/modules-load.d/VirtualBox.conf
, which it doesn't.
The next row in the plot is for dracut-initqueue.service
. Until
now, everything that has been started has either started and run
forever, or run to completion deterministically. This new service is
different: it runs in a loop until it detects that initialization is
done. On my system, this takes a long time overall (over 8 seconds in
the run I'm looking at) because it requires prompting me for a
password.
The loop does this:
- Checks whether startup is finished, by running all of the scripts
in
/lib/dracut/hooks/initqueue/finished/*.sh
. If all of them exit successfully, startup is finished. - Does various things to help startup along. It's not clear to me that any of these trigger on my system.
- Eventually times out and offers an emergency shell to the user.
The plymouth-start.service
unit, on the next row, covers up the
kernel initialization message with a graphical splash screen. The
plymouthd
program only does that if the kernel command line
contains rhgb
, which stands for "Red Hat Graphical Boot".
(Plymouth is the successor to an older program named rhgb
.)
The next row is for systemd-ask-password-plymouth.path
. This is a
new kind of unit to me, a "path" unit. systemd.path(7) explains that
each of these units monitors a file or directory and triggers when
something happens. In this case the path unit says:
[Path] DirectoryNotEmpty=/run/systemd/ask-password MakeDirectory=yes
This means that when /run/systemd/ask-password
is nonempty,
systemd triggers the corresponding service unit. There is no Unit
setting, so by default the service unit is
systemd-ask-password-plymouth.service
. There's nothing in the
directory being watched yet, so systemd defers starting the service.
The next 64 (!) rows set up 32 (!) different ttyS<number>
serial
devices, in random order. I don't know why, since the laptop has no
traditional serial ports at all. This doesn't take any real amount of
time, just an annoying number of useless entries in /dev
.
Then the next several rows set up all the devices for the SSD. There
are four of them for the device itself (and its "namespace" n1
):
dev-disk-by-path-pci-0000:3d:00.0-nvme-1.device dev-disk-by-id-nvme-eui.ace42e0090114c8b.device dev-disk-by-id-nvme-PC401_NVMe_SK_hynix_1TB_MI93T003610303E06.device dev-nvme0n1.device sys-devices-pci0000:00-0000:00:1d.0-0000:3d:00.0-nvme-nvme0-nvme0n1.device
Then seven for the GRUB partition:
dev-disk-by-id-nvme-PC401_NVMe_SK_hynix_1TB_MI93T003610303E06-part2.device dev-disk-by-id-nvme-eui.ace42e0090114c8b-part2.device dev-disk-by-partuuid-1d24a3b1-3131-4365-a3a7-dadae8ddf6fd.device dev-disk-by-path-pci-0000:3d:00.0-nvme-1-part2.device dev-disk-by-uuid-71baeefe-e3d2-45e1-a0ff-bf651d4a7591.device dev-nvme0n1p2.device sys-devices-pci0000:00-0000:00:1d.0-0000:3d:00.0-nvme-nvme0-nvme0n1-nvme0n1p2.device
and the same pattern carries on for the other two partitions. Partition 1 is last; maybe the order is random.
Amid the disk device setup, just after the setup for the encrypted
nvme0n1p2 device, a row lists
systemd-cryptsetup@luks-3acca984-01a3-4c67-b9c6-94433f667b88.service
.
This is the service unit created by the generator described under
Encrypted Disk Setup earlier. As shown before, this service runs
systemd-cryptsetup
to attach the encrypted volume:
ExecStart=/usr/lib/systemd/systemd-cryptsetup attach 'luks-3acca984-01a3-4c67-b9c6-94433f667b88' '/dev/disk/by-uuid/3acca984-01a3-4c67-b9c6-94433f667b88' 'none' 'discard'
The systemd-cryptsetup
helper is underdocumented, but running it
without arguments gives a little bit of help:
$ /usr/lib/systemd/systemd-cryptsetup systemd-cryptsetup attach VOLUME SOURCEDEVICE [PASSWORD] [OPTIONS] systemd-cryptsetup detach VOLUME Attaches or detaches an encrypted block device. See the [email protected](8) man page for details.
Peeking into the helper's source code, it calls into
ask_password_agent() in systemd's src/shared/ask-password-api.c
.
In turn, this creates a file
/run/systemd/ask-password/ask.<random>
with some information about
the password to be prompted for.
Now, the path unit described in Plymouth Service triggers its
corresponding service unit systemd-ask-password-plymouth.service
,
which is the next row in the plot.
In turn, the service unit gets systemd to ask for the password:
[Service] ExecStart=/usr/bin/systemd-tty-ask-password-agent --watch --plymouth
About 7 seconds later, I've finished typing the password, which allows
the systemd-cryptsetup@…
unit to finish up. The
systemd-ask-password-plymouth.service
unit exits about a second
later, I believe because it is killed by initrd-cleanup.service
(see Switching Roots).
The next row in the plot is for
sys-devices-pci0000:00-0000:00:02.0-drm-card0-card0-eDP-1-intel_backlight.device
.
I guess this is just random ordering.
The successful unlocking of the encrypted block device made everything
inside visible to the device manager. The first set of rows shows the
top-level device, which becomes /dev/dm-0
. Note that the system
considers this device's UUID so fantastic that one of the device nodes
contains it twice:
dev-disk-by-id-dm-name-luks-3acca984-01a3-4c67-b9c6-94433f667b88.device dev-disk-by-id-lvm-pv-uuid-yFbXNn-DjBz-J9c1-c7Pj-QIKZ-JoiH-k5wQWR.device dev-mapper-luks-3acca984-01a3-4c67-b9c6-94433f667b88.device dev-disk-by-id-dm-uuid-CRYPT-LUKS2-3acca98401a34c67b9c694433f667b88-luks-3acca984-01a3-4c67-b9c6-94433f667b88.device dev-dm-0.device sys-devices-virtual-block-dm-0.device
The next series of rows makes the root device within the encrypted
block device visible as /dev/dm-1
aka /dev/fedora/root
:
dev-mapper-fedora-root.device dev-disk-by-uuid-45501fb7-be9f-4f11-97ba-b2c2046fd746.device dev-disk-by-id-dm-uuid-LVM-mMVYE2s7gIvVmpHUl0cvxbfn8i8NODO8bRKuGAOpNVTVjPrca5QTzb5F3eXjOglK.device dev-fedora-root.device dev-disk-by-id-dm-name-fedora-root.device dev-dm-1.device sys-devices-virtual-block-dm-1.device
systemd notes in the next row that initrd-root-device.target
is
now satisfied, because the root file system's device is now available.
It isn't mounted yet.
The next series of rows makes the swap device visible as /dev/dm-2
aka /dev/fedora/swap
. It isn't being used for swap yet:
dev-disk-by-id-dm-name-fedora-swap.device dev-mapper-fedora-swap.device dev-disk-by-id-dm-uuid-LVM-mMVYE2s7gIvVmpHUl0cvxbfn8i8NODO8qImL9xOU61DWKST2SpEEw3aF0kARK3oK.device dev-disk-by-uuid-71cdb103-4791-4d43-aa4c-49743c36be6e.device dev-fedora-swap.device dev-dm-2.device sys-devices-virtual-block-dm-2.device
The next row is systemd-fsck-root.service
, which is
underdocumented. It looks like a special systemd service, but it
isn't documented in systemd.special(7). I guess it fscks
/dev/fedora/root
.
Next, sysroot.mount
mounts /dev/fedora/root
on /sysroot
.
Later, this will become the root of the file system. The
sysroot.mount
unit was previously generated automatically by
systemd-fstab-generator
. This generation is hardcoded into the C
code of systemd-fstab-generator
.
The next row is for initrd-parse-etc.service
. This invokes
initrd-fs.target
, the next row, which is followed by
initrd.target
. None of these seem to do anything significant on
my machine.
Now it's time for /sysroot
to become /
and everything not
under /sysroot
to go away.
The next row is dracut-pre-pivot.service
. This runs Dracut's
"pre-pivot" and "cleanup" hooks. These don't seem to do anything
important on my system. (I've ignored the scripts that Dracut has
run, across its various hooks, related to keeping network interface
names persistent across boots. I'm continuing to ignore them here.
Maybe I'll look into them in later revisions of this document.)
The next row is initrd-cleanup.service
, which is defined as:
[Service] Type=oneshot ExecStart=systemctl --no-block isolate initrd-switch-root.target
The isolate
command was new to me. systemctl(1) says that it
starts the argument unit and stops everything else, with an exception
for units marked as IgnoreOnIsolate=yes
. For me, the latter units
(based on grep
) are systemd-journald-dev-log.socket
and
systemd-journald.socket
. I see that
systemd-journald-audit.socket
also continues, so perhaps sockets
are generally excepted. In practice, only a few units terminate right
at this point (possibly some of these are by coincidence):
dracut-cmdline.service dracut-pre-udev.service dracut-initqueue.service systemd-ask-password-plymouth.service initrd-root-device.target initrd.target
Two services are marked as Before=initrd-switch-root.target
.
These are the next two rows:
plymouth-switch-root.service
, which runs/usr/bin/plymouth update-root-fs --new-root-dir=/sysroot
. plymouth(1) says that this tells theplymouth
daemon about the upcoming root file system change. It doesn't say anything about the implications of the root change or the notification, though.initrd-udevadm-cleanup-db.service
, which runsudevadm info --cleanup-db
. udevadm(1) says that this makesudev
clean up its database. It does not explain what this means, but udev(7) gives the hint that thedb_persist
flag on a database event means that it will be kept even after--cleanup-db
. Agrep
shows thatdb_persist
only shows up in/etc/udev/rules.d/11-dm.rules
. This would seem to mean that only device manager nodes would be kept across--cleanup-db
. This is wrong (thettyS<number>
nodes, at least, are also kept), so there must be more going on. I didn't investigate further.
After that, initrd-switch-root.target
triggers. It runs
systemctl --no-block switch-root /sysroot
. This does the actual
switch of root directories inside systemd's at PID 1 process:
- Creates a memory-based file using
memfd_create()
and serializes all the systemd state to it. - Un-set some limits, etc., so that the new systemd knows what the defaults should be.
- Switches root. There's a
pivot_root
system call that it tries to use first, followed by unmounting the old root in its new location. This approach won't necessarily work, so as an alternative it does the system call equivalent ofmount --move /sysroot /
. - Deletes all of the old initrd using the equivalent of
rm -rf old_root
. This mildly scares the crap out of me, but it should only destroy files that were extracted from the initrd cpio image. - Re-execs itself as
/usr/lib/systemd/systemd --switched-root --system --deserialize <fd>
, where <fd> is a file descriptor for the state serialized earlier. - The new systemd starts and initializes itself using the serialized state from <fd>.
The plot shows all of the next 40 units all beginning at the same
time, meaning that none of them had any unsatisfied dependencies when
systemd
restarted (or that their dependencies could be satisfied
instantly).
Since the initrd is no longer mounted, at this point we no longer benefit from looking at what we extracted from it. Now the systemd units come from the runtime file system. We can list the directories that systemd searches for them with:
$ systemctl show -p UnitPath --value | tr ' ' '\n' /etc/systemd/system.control /run/systemd/system.control /run/systemd/transient /run/systemd/generator.early /etc/systemd/system /etc/systemd/system.attached /run/systemd/system /run/systemd/system.attached /run/systemd/generator /usr/local/lib/systemd/system /usr/lib/systemd/system /run/systemd/generator.late
Only a fraction of these directories exist, though:
$ for d in $(systemctl show -p UnitPath --value); do if test -d $d; then echo $d; fi; done /run/systemd/transient /etc/systemd/system /run/systemd/system /run/systemd/generator /usr/lib/systemd/system /run/systemd/generator.late
The first five units are slice units, followed a little bit later by
the target that some of them mark themselves as Before
:
machine.slice system-getty.slice system-modprobe.slice system-systemd-fsck.slice user.slice ⋮ slices.target
Rather than looking around the file system for these units, it's
easier to examine units with systemctl cat
, e.g.:
$ systemctl cat machine.slice # /usr/lib/systemd/system/machine.slice ⋮ [Unit] Description=Virtual Machine and Container Slice Documentation=man:systemd.special(7) Before=slices.target
Some of the above units are mysterious. For example, I can't find
system-modprobe.slice
anywhere. and systemctl cat
doesn't know
about it either, even though systemctl list
shows it as loaded and
active. The same is true for system-getty.slice
.
system-systemd-fsck.slice
doesn't even show up in systemctl
list
.
The next row is for systemd-ask-password-wall.path
. Its
corresponding service unit has:
ExecStart=systemd-tty-ask-password-agent --wall
Frankly, wall(1) seems to me like a dreadful way to prompt for a password. I hope this doesn't really get used much. I've never noticed it in practice.
No other password agent units are running, though, since the plymouth one terminated earlier, so I guess this is better than nothing. Maybe the idea is that needing passwords after running the initrd is unusual, or that if something likely to need them comes up then maybe a better agent will be configured first.
The next row is for proc-sys-fs-binfmt_misc.automount
. This is a
kind of systemd unit that is new to us, documented in
systemd.automount(5). When the mount point is first used, systemd
invokes the corresponding proc-sys-fs-binfmt_misc.mount
unit,
which mounts the binfmt_misc filesystem on
/proc/sys/fs/binfmt_misc
. I don't know why this is automounted
instead of mounted proactively. It's the only automount unit on my
system. When I checked, it was in fact mounted:
$ mount|grep binfmt systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=293) binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)