Skip to content

Latest commit

 

History

History
1168 lines (925 loc) · 50.2 KB

README.rst

File metadata and controls

1168 lines (925 loc) · 50.2 KB

Modern Linux Boot Process

In the last year or two, I've spent more time than I like frantically trying to fix Linux laptops where I've broken the boot process one way or another. The biggest obstacle, in each case, has been that I didn't really understand what was going on, so I was stuck doing web searches until I found something that solved my immediate problem. A lot has changed in the x86 and Linux boot process since I last spent much time looking at it, so I'm writing this down after spending some time investigating how it works now.

This is specifically for the installation that I have on my own laptop, which boots 64-bit Fedora 32 from an NVMe SSD via UEFI and Grub 2.

Boot is complicated, even more complicated than I'd realized. This discussion is nowhere near complete.

The SSD

My SSD's NVMe device is /dev/nvme0. NVMe devices are divided into "namespaces". Mine has just one namespace /dev/nvme0n1. That appears to be the normal number of namespaces. This namespace is where all the data is. The nvme list command shows some basic stats. You can see that it's a 1 TB drive, for example:

$ sudo nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     MI93T003610303E06    PC401 NVMe SK hynix 1TB                  1         324.04  GB /   1.02  TB    512   B +  0 B   80007E00

(The nvme utility isn't installed by default. On Fedora: sudo dnf install nvme-cli.)

The 324 GB number above might be the amount of storage in use from the drive's own perspective. I wasn't able to find any documentation explaining that number.

Partitions

I used Fedora's default partitioning for an encrypted boot drive. This yielded three partitions using a GPT disklabel:

$ sudo fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 953.89 GiB, 1024209543168 bytes, 2000409264 sectors
Disk model: PC401 NVMe SK hynix 1TB
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 225DB613-867D-4004-9C94-4822154BA204

Device           Start        End    Sectors   Size Type
/dev/nvme0n1p1    2048     411647     409600   200M EFI System
/dev/nvme0n1p2  411648    2508799    2097152     1G Linux filesystem
/dev/nvme0n1p3 2508800 2000408575 1997899776 952.7G Linux filesystem

I hadn't known much about GPT before. It stands for "GUID partition table", is an advancement over the previous MBR scheme in multiple ways: it supports up to 128 partitions (instead of just 4 physical partitions), it uses a GUID to identify partition types (instead of a byte), it supports very large disks via 64-bit LBA addresses, it has a CRC32 checksum for safety, and it is stored at both the beginning and the end of the disk for redundancy. It preserves some compatibility with the traditional MBR by using an MBR that declares the whole disk to be a single partition of type 0xee.

The partitions above are:

  • /dev/nvme0n1p1: This is the EFI system partition. It is formatted as a FAT file system. Ultimately, it ends up mounted at /boot/efi. It is small (by today's standards!), only 200 MB. It contains the bootloader. Here's another way to see this:

    $ ls -l /dev/disk/by-partlabel/
    total 0
    lrwxrwxrwx. 1 root root 15 Jul 10 16:19 'EFI\x20System\x20Partition' -> ../../nvme0n1p1
    
  • /dev/nvme0n1p2: This is an ext4 partition ultimately mounted at /boot. It is a little bigger than the EFI system partition. It contains kernels and their configurations and modules.

  • /dev/nvme0n1p3: This is a LUKS2 encrypted partition. It takes up the bulk of the disk space.

    (LUKS stands for Linux Unified Key Setup. It's the Linux standard for disk encryption.)

Even though GPT uses a whole 128-bit GUID to identify the type of each partition, Linux only uses a single GUID for all types of data partition. Thus, the second and third partitions above have the same GUID for their types, rather than different GUIDs for ext4 versus LUKS2 encryption. We can ask fdisk to show us the "Type-UUID"s as well as the partitions' UUIDs:

$ sudo fdisk -l -o +uuid,type-uuid /dev/nvme0n1
...
Device           Start        End    Sectors   Size Type UUID                                 Type-UUID
/dev/nvme0n1p1    2048     411647     409600   200M EFI  10F28736-7EB9-48A8-BB37-AC2FDBACAC3F C12A7328-F81F-11D2-BA4B-00A0C93EC93B
/dev/nvme0n1p2  411648    2508799    2097152     1G Linu 1D24A3B1-3131-4365-A3A7-DADAE8DDF6FD 0FC63DAF-8483-4772-8E79-3D69D8477DE4
/dev/nvme0n1p3 2508800 2000408575 1997899776 952.7G Linu 7A2DC291-4F3C-4DE5-B1C9-DC6351863A89 0FC63DAF-8483-4772-8E79-3D69D8477DE4

(Linux doesn't use GPT partition UUIDs much. They do not appear in for example, /dev/disk/by-uuid; rather, the UUIDs embedded in the filesystems' superblocks appear there. GPT partition UUIDs do appear in /dev/disk/by-partuuid.)

Early Boot

In UEFI, boot starts out from a boot manager embedded in the machine's firmware. Boot managers use a standardized set of variables. The most important of these for our purposes is BootOrder. Its value is a list of numbers, each of which identifies a Boot<XXXX> entry, where <XXXX> is the number from BootOrder.

The efivar utility manipulates EFI variables. It's not installed by default. On Fedora, you can install it with sudo dnf install efivar. (The /sys/firmware/efi/efivars directory also provides an interface to EFI variables. These are binary files. Ignore the first four bytes of each file to get the raw data.)

We can get a list of EFI variables with efivar -l. The efivar program makes us specify variable names along with an obnoxious UUID, so this is useful for finding out those UUIDs. On my system:

$ sudo efivar -l|grep BootOrder
8be4df61-93ca-11d2-aa0d-00e098032b8c-BootOrder
45cf35f6-0d6e-4d04-856a-0370a5b16f53-DefaultBootOrder
$ sudo efivar -p --name 8be4df61-93ca-11d2-aa0d-00e098032b8c-BootOrder
GUID: 8be4df61-93ca-11d2-aa0d-00e098032b8c
Name: "BootOrder"
Attributes:
        Non-Volatile
        Boot Service Access
        Runtime Service Access
Value:
00000000  09 00 06 00 02 00 03 00  04 00 05 00 07 00 0a 00  |................|

The value is 16-bit little-endian numbers in hex, so this is 9, 6, 2, 3, 4, 5, 7, 10. We can then dump the first boot order entry:

$ sudo efivar -l|grep Boot0009
8be4df61-93ca-11d2-aa0d-00e098032b8c-Boot0009
$ sudo efivar -p --name 8be4df61-93ca-11d2-aa0d-00e098032b8c-Boot0009
GUID: 8be4df61-93ca-11d2-aa0d-00e098032b8c
Name: "Boot0009"
Attributes:
        Non-Volatile
        Boot Service Access
        Runtime Service Access
Value:
00000000  01 00 00 00 62 00 46 00  65 00 64 00 6f 00 72 00  |....b.F.e.d.o.r.|
00000010  61 00 00 00 04 01 2a 00  01 00 00 00 00 08 00 00  |a.....*.........|
00000020  00 00 00 00 00 40 06 00  00 00 00 00 36 87 f2 10  |[email protected]...|
00000030  b9 7e a8 48 bb 37 ac 2f  db ac ac 3f 02 02 04 04  |.~.H.7./...?....|
00000040  34 00 5c 00 45 00 46 00  49 00 5c 00 66 00 65 00  |4.\.E.F.I.\.f.e.|
00000050  64 00 6f 00 72 00 61 00  5c 00 73 00 68 00 69 00  |d.o.r.a.\.s.h.i.|
00000060  6d 00 78 00 36 00 34 00  2e 00 65 00 66 00 69 00  |m.x.6.4...e.f.i.|
00000070  00 00 7f ff 04 00                                 |......          |

The filename embedded in this is in UCS-2LE. In ASCII, it says \EFI\fedora\shimx64.efi. This refers to a file in the EFI system partition.

The efibootmgr program puts this together in a nicer way:

$ efibootmgr
BootCurrent: 0009
Timeout: 0 seconds
BootOrder: 0009,0006,0002,0003,0004,0005,0007,000A
Boot0000* Windows Boot Manager
Boot0001* debian
Boot0002* Diskette Drive
Boot0003* USB Storage Device
Boot0004* CD/DVD/CD-RW Drive
Boot0005* Onboard NIC
Boot0006* ubuntu
Boot0007* Linux Firmware Updater
Boot0009* Fedora
Boot000A* USB NIC

Since, as mentioned before, this is mounted at /boot/efi, we can look at this file and the rest of the directory:

$ sudo ls /boot/efi/EFI/fedora -l
total 15604
-rwx------. 1 root root     112 Oct  2  2018 BOOTIA32.CSV
-rwx------. 1 root root     110 Oct  2  2018 BOOTX64.CSV
drwx------. 2 root root    4096 Jun  5 18:01 fonts
drwx------. 2 root root    4096 Jun  5 18:07 fw
-rwx------. 1 root root   62832 May 22 04:32 fwupdx64.efi
-rwx------. 1 root root 1587528 May 26 09:23 gcdia32.efi
-rwx------. 1 root root 2513224 May 26 09:23 gcdx64.efi
-rwx------. 1 root root    5471 Sep  8  2019 grub.cfg
-rwx------. 1 root root    1024 Jul 10 16:22 grubenv
-rwx------. 1 root root 1587528 May 26 09:23 grubia32.efi
-rwx------. 1 root root 2513224 May 26 09:23 grubx64.efi
-rwx------. 1 root root  927824 Oct  2  2018 mmia32.efi
-rwx------. 1 root root 1159560 Oct  2  2018 mmx64.efi
-rwx------. 1 root root 1210776 Oct  2  2018 shim.efi
-rwx------. 1 root root  975536 Oct  2  2018 shimia32.efi
-rwx------. 1 root root  969264 Oct  2  2018 shimia32-fedora.efi
-rwx------. 1 root root 1210776 Oct  2  2018 shimx64.efi
-rwx------. 1 root root 1204496 Oct  2  2018 shimx64-fedora.efi

shimx64.efi is an executable program that sits between the UEFI boot manager and Grub 2. True to its name, it acts as a "shim" for Secure Boot. We can see that it's an executable:

$ sudo file /boot/efi/EFI/fedora/shimx64.efi
/boot/efi/EFI/fedora/shimx64.efi: PE32+ executable (EFI application) x86-64 (stripped to external PDB), for MS Windows

It's not really a Windows program, it's just in the Windows PE "Portable Executable" format. Intel designed EFI and it's heavily Windows-flavored. If you run strings on the shim binary, you see a lot of OpenSSL references. That is presumably the "Secure" part.

The shim is not part of the Grub bootloader. It is a separate program shim <https://github.com/rhboot/shim/> that runs before Grub and loads Grub. The first paragraph from its own README.md is a good description:

shim is a trivial EFI application that, when run, attempts to open and execute another application. It will initially attempt to do this via the standard EFI LoadImage() and StartImage() calls. If these fail (because Secure Boot is enabled and the binary is not signed with an appropriate key, for instance) it will then validate the binary against a built-in certificate. If this succeeds and if the binary or signing key are not blacklisted then shim will relocate and execute the binary.

The shim is signed by Microsoft (!) through an onerous process <https://blog.hansenpartnership.com/adventures-in-microsoft-uefi-signing/>.

Grub

The shimx64.efi program runs grubx64.efi, which is the actual Grub 2 bootloader. In turn, grubx64.efi reads and executes grub.cfg, which is a script written in a Grub-specific language that resembles Bourne shell with some special features for booting.

grub.cfg has lots of insmod commands. These do not load Linux kernel modules, they load Grub modules. Grub has a lot of modules: I see 295 of them in /usr/lib/grub/i386-pc. The modules are not consistently documented. A lot of modules exist to implement a particular command; for example, hexdump.mod implements the hexdump command.

grub.cfg loads environment variables from grubenv, which is mostly a text file but has a special format. Use grub-editenv to edit grubenv if you really need to. See The GRUB environment block <https://www.gnu.org/software/grub/manual/grub/html_node/Environment-block.html>, for more information.

In earlier versions of Grub, grub.cfg usually contained a list of menuentry commands that said how to boot various kernels you might have. This isn't the case for my Fedora installation. Instead, grub.cfg has a single blscfg command:

# The blscfg command parses the BootLoaderSpec files stored in /boot/loader/entries and
# populates the boot menu. Please refer to the Boot Loader Specification documentation
# for the files format: https://www.freedesktop.org/wiki/Specifications/BootLoaderSpec/.
⋮
insmod blscfg
blscfg

In turn, I have several .conf files in /boot/loader/entries:

$ sudo ls /boot/loader/entries/
b4e66474bf8f4ab997720a7bd0d83628-0-rescue.conf
b4e66474bf8f4ab997720a7bd0d83628-5.3.12-300.fc31.x86_64.conf
b4e66474bf8f4ab997720a7bd0d83628-5.4.13-201.fc31.x86_64.conf
b4e66474bf8f4ab997720a7bd0d83628-5.5.17-200.fc31.x86_64.conf
b4e66474bf8f4ab997720a7bd0d83628-5.6.15-300.fc32.x86_64.conf

Each of the .conf files describes one menu entry. My default entry is:

$ sudo cat /boot/loader/entries/b4e66474bf8f4ab997720a7bd0d83628-5.6.15-300.fc32.x86_64.conf
title Fedora (5.6.15-300.fc32.x86_64) 32 (Thirty Two)
version 5.6.15-300.fc32.x86_64
linux /vmlinuz-5.6.15-300.fc32.x86_64
initrd /initramfs-5.6.15-300.fc32.x86_64.img
options $kernelopts
grub_users $grub_users
grub_arg --unrestricted
grub_class kernel

In my installation, the $kernelopts variable above comes from /boot/efi/EFI/fedora/grubenv:

kernelopts=root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.luks.uuid=luks-3acca984-01a3-4c67-b9c6-94433f667b88 rd.lvm.lv=fedora/swap rhgb quiet systemd.unified_cgroup_hierarchy=0

(I added the systemd.unified_cgroup_hierarchy=0 variable assigment at the end to disable cgroups v2. This is needed to run Docker.)

One may compare these kernel options to the ones in the running kernel. The only difference in the BOOT_IMAGE at the beginning (Grub adds this itself):

$ cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.6.15-300.fc32.x86_64 root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.luks.uuid=luks-3acca984-01a3-4c67-b9c6-94433f667b88 rd.lvm.lv=fedora/swap rhgb quiet systemd.unified_cgroup_hierarchy=0

Finally, I noticed that grub.cfg itself has a similar assignment to $default_kernelopts. I don't know whether or how this is connected to $kernelopts that all of the .conf files use:

set default_kernelopts="root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.luks.uuid=luks-3acca984-01a3-4c67-b9c6-94433f667b88 rd.lvm.lv=fedora/swap rhgb quiet "

Grub Root File Systems

Grub needs to find the root file system, the one that contains the Linux kernel and initrd files, which in my case is the ext4 partition /dev/nvme0n1p2. grub.cfg tells it to find this partition by searching for it by UUID:

search --no-floppy --fs-uuid --set=root 71baeefe-e3d2-45e1-a0ff-bf651d4a7591

grub.cfg also tells Grub to find the boot partition, i.e. the EFI system partition in FAT format on /dev/nvme0n1p1, by its UUID:

search --no-floppy --fs-uuid --set=boot 2AB5-7C0C

It's not clear to me what Grub uses the latter information for, if anything, since at the time this search command runs Grub has already read all the files it really needs.

Kernel Loading

We've covered everything necessary for Grub to allow the user to select a kernel. Suppose the user selects the Fedora 32 entry detailed above. This will cause the associated commands to run. The important ones are:

linux /vmlinuz-5.6.15-300.fc32.x86_64
initrd /initramfs-5.6.15-300.fc32.x86_64.img
options $kernelopts

The linux command points to the kernel image to load. The path is relative to the Grub root established above, which on a booted system is mounted at /boot:

$ ls -l /boot/vmlinuz-5.6.15-300.fc32.x86_64
-rwxr-xr-x. 1 root root 10795112 May 29 07:42 /boot/vmlinuz-5.6.15-300.fc32.x86_64
$ file /boot/vmlinuz-5.6.15-300.fc32.x86_64
/boot/vmlinuz-5.6.15-300.fc32.x86_64: Linux kernel x86 boot executable bzImage, version 5.6.15-300.fc32.x86_64 ([email protected]) #1 SMP Fri May 29 14:23:59 , RO-rootFS, swap_dev 0xA, Normal VGA

The initrd command points to an additional file that Grub loads into memory and makes available to the kernel.

Now Grub turns over control to the kernel. The kernel does basic system setup. We will ignore that, as detailed and exciting as it is.

Initrd

After the kernel does basic system setup, it finds userspace to run. To do that, it needs a file system. Initially, nothing is mounted, except for an empty root directory (a ramfs file system).

The initial file system can come from a few places. The following sections explore the possibilities.

Built-In cpio Archive

The kernel can get its initial file system from a few places. There is always a cpio archive built into the kernel itself. (cpio has the same purpose as tar, but its command-line interface is very different.) We should look at it. It is kind of hard to get it out, and some of the recipes I found online did not work. Eventually, I got the uncompressed kernel image using extract-vmlinux, then the start and end address of the archive from System.map, then the offset in the file from objdump, then the contents via dd:

$ /usr/src/kernels/5.6.15-300.fc32.x86_64/scripts/extract-vmlinux /boot/vmlinuz-5.6.15-300.fc32.x86_64 > vmlinux
$ sudo grep initramfs /boot/System.map-5.6.15-300.fc32.x86_64
ffffffff831d2228 D __initramfs_start
ffffffff831d2428 D __initramfs_size
$ sudo objdump --file-offsets -s --start-address=0xffffffff831d2228 --stop-address=0xffffffff831d2428 vmlinux | head -4

vmlinux:     file format elf64-x86-64

Contents of section .init.data:  (Starting at file offset: 0x25d2228)
$ sudo dd if=vmlinux bs=1 skip=$((0x25d2228)) count=$((0xffffffff831d2428 - 0xffffffff831d2228)) status=none | cpio -vt
drwxr-xr-x   2 root     root            0 May 29 07:24 dev
crw-------   1 root     root       5,   1 May 29 07:24 dev/console
drwx------   2 root     root            0 May 29 07:24 root
1 block

As you can see above, there's barely anything in the kernel's built-in cpio archive. This is clearly not how this system is booting.

Initrd cpio Archive

Consider the initrd command that was given to Grub. Its name stands for "initial RAM disk". This is what this kernel is actually using to boot.

Earlier versions of Linux (though not the very earliest!) pointed initrd to a small disk image that the kernel would mount and use as the first stage of userland. These days, initrd points to a cpio archive:

$ ls -l /boot/initramfs-5.6.15-300.fc32.x86_64.img
-rw-------. 1 root root 35496138 Jun  5 18:05 /boot/initramfs-5.6.15-300.fc32.x86_64.img
$ file /boot/initramfs-5.6.15-300.fc32.x86_64.img
/boot/initramfs-5.6.15-300.fc32.x86_64.img: regular file, no read permission
$ sudo file /boot/initramfs-5.6.15-300.fc32.x86_64.img
/boot/initramfs-5.6.15-300.fc32.x86_64.img: ASCII cpio archive (SVR4 with no CRC)

We can look inside the archive:

$ sudo cat /boot/initramfs-5.6.15-300.fc32.x86_64.img | cpio -tv
drwxr-xr-x   3 root     root            0 May 29 11:35 .
-rw-r--r--   1 root     root            2 May 29 11:35 early_cpio
drwxr-xr-x   3 root     root            0 May 29 11:35 kernel
drwxr-xr-x   3 root     root            0 May 29 11:35 kernel/x86
drwxr-xr-x   2 root     root            0 May 29 11:35 kernel/x86/microcode
-rw-r--r--   1 root     root        99328 May 29 11:35 kernel/x86/microcode/GenuineIntel.bin
196 blocks

This is also clearly not anything that can be booted. This archive is just for CPU microcode. It's not clear to me exactly how and when the kernel finds and sets up this microcode. Nothing related to it shows up in dmesg.

The above is less than 100 kB in size (cpio uses 512-byte blocks). The previous ls -l shows that the initrd is tens of megabytes in size, so there must be more following the cpio archive. Indeed:

$ sudo cat /boot/initramfs-5.6.15-300.fc32.x86_64.img | (cpio -t >/dev/null 2>&1; file -)
/dev/stdin: gzip compressed data, max compression, from Unix
$ sudo cat /boot/initramfs-5.6.15-300.fc32.x86_64.img | (cpio -t >/dev/null 2>&1; zcat | file -)
/dev/stdin: ASCII cpio archive (SVR4 with no CRC)
$ sudo cat /boot/initramfs-5.6.15-300.fc32.x86_64.img | (cpio -t >/dev/null 2>&1; zcat | cpio -t)
drwxr-xr-x  12 root     root            0 May 29 11:35 .
lrwxrwxrwx   1 root     root            7 May 29 11:35 bin -> usr/bin
drwxr-xr-x   2 root     root            0 May 29 11:35 dev
crw-r--r--   1 root     root       5,   1 May 29 11:35 dev/console
crw-r--r--   1 root     root       1,  11 May 29 11:35 dev/kmsg
crw-r--r--   1 root     root       1,   3 May 29 11:35 dev/null
crw-r--r--   1 root     root       1,   8 May 29 11:35 dev/random
crw-r--r--   1 root     root       1,   9 May 29 11:35 dev/urandom
drwxr-xr-x  11 root     root            0 May 29 11:35 etc
-rw-r--r--   1 root     root           92 May 29 11:35 etc/block_uuid.map
…2468 files and directories omitted…
-rw-r--r--   1 root     root         1730 Jan 29 08:57 usr/share/terminfo/l/linux
drwxr-xr-x   2 root     root            0 May 29 11:35 usr/share/terminfo/v
-rw-r--r--   1 root     root         1190 Jan 29 08:57 usr/share/terminfo/v/vt100
-rw-r--r--   1 root     root         1184 Jan 29 08:57 usr/share/terminfo/v/vt102
-rw-r--r--   1 root     root         1377 Jan 29 08:57 usr/share/terminfo/v/vt220
lrwxrwxrwx   1 root     root           20 May 29 11:35 usr/share/unimaps -> /usr/lib/kbd/unimaps
drwxr-xr-x   3 root     root            0 May 29 11:35 var
lrwxrwxrwx   1 root     root           11 May 29 11:35 var/lock -> ../run/lock
lrwxrwxrwx   1 root     root            6 May 29 11:35 var/run -> ../run
drwxr-xr-x   2 root     root            0 May 29 11:35 var/tmp

Now we're getting somewhere! We can extract the archive into a temporary directory for further examination:

$ mkdir initrd
$ cd initrd/
$ sudo cat /boot/initramfs-5.6.15-300.fc32.x86_64.img | (cpio -t >/dev/null 2>&1; zcat | cpio -i)
cpio: dev/console: Cannot mknod: Operation not permitted
cpio: dev/kmsg: Cannot mknod: Operation not permitted
cpio: dev/null: Cannot mknod: Operation not permitted
cpio: dev/random: Cannot mknod: Operation not permitted
cpio: dev/urandom: Cannot mknod: Operation not permitted
156121 blocks
$ ls
bin  etc   lib    proc  run   shutdown  sysroot  usr
dev  init  lib64  root  sbin  sys       tmp      var

(One may also sudo chroot into the initrd, which has a shell and basic libraries and command-line tools. It is not a comfortable environment in other ways, though; for example, Backspace and Ctrl+U didn't behave in a sane way for me.)

After the kernel extracts this archive into RAM, it starts /init, which is actually systemd:

$ ls -l init
lrwxrwxrwx. 1 bpfaff bpfaff 23 Jul 11 11:37 init -> usr/lib/systemd/systemd
$ file usr/lib/systemd/systemd
usr/lib/systemd/systemd: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=a427da4e4bd4db8b6aac43245f8da742e1994076, for GNU/Linux 3.2.0, stripped

After I discovered this and started looking through the systemd documentation for information on startup, I came across bootup(7). It is full of great information (although it further references boot(7), which doesn't exist), including a diagram. There is particularly a section titled "Bootup in the Initial RAM Disk (initrd)" which is relevant now. dracut.bootup(7) also has a wonderful diagram illustrating the bootup sequence.

bootup(7) says that the default target is initrd.target if /etc/initrd-release exists, which it does in my case. (Kernel command-line setting rd.systemd.unit=… could override the default, but it is not set.) Even if /etc/initrd-release did not exist, my initrd symlinks from /etc/systemd/system/default.target to /usr/lib/systemd/system/initrd.target. The latter file contains, in part:

ConditionPathExists=/etc/initrd-release
Requires=basic.target
Wants=initrd-root-fs.target initrd-root-device.target initrd-fs.target initrd-parse-etc.service
After=initrd-root-fs.target initrd-root-device.target initrd-fs.target basic.target rescue.service rescue.target

In my initrd, /etc/initrd-release is a symlink to /usr/lib/initrd-release. By either name, it contains a series of key-value pairs in a format suitable for the Bourne shell. Grepping around the initrd tree shows that the /usr/bin/dracut-cmdline shell script sources these pairs (via /usr/lib/initrd-release). This will come up later, in Dracut Command Line Parsing.

Early systemd Startup

It would be possible to figure out what's going on at boot by manually looking through all of the systemd unit files in /etc/systemd/system and /usr/lib/systemd/system. This would be a slow process because there are a lot of them and it's not easy to follow all the dependency chains. I didn't want to do that.

Luckily, systemd comes with the systemd-analyze program to help out. It has a number of subcommands. The best one for this purpose seems to be the plot subcommand. I invoked it like this:

systemd-analyze plot > bootup.svg

The output was bootup.svg. You might want to load this in another tab or another window so you can flip back and forth, because the rest of boot follows along with it. (This diagram is very tall and wide! You will probably have to pan to the right or shrink it down with Ctrl+− to see anything meaningful.)

Multiple programs on my system can view SVG files but I ended up using Firefox the most because it allows cut-and-paste of text whereas the other viewers I tried do not.

The SVG output is a 2-d plot where the x axis is time and the y axis is a series of rows, each of which is a bar that represents a timespan. There are a few gray bars at the top that represent a few important stages of kernel loading: "firmware", "loader", "kernel", "initrd". Time is in seconds, with zero at the transition from "loader" to "kernel". The rest of the rows are red and represent units, spanning from the unit's start time to its exit time. Some units finish quickly, so they only have short bars; some run as long as the system is up, so their bars run past the right end of the x axis.

On my system, "kernel" takes about 2 seconds to initialize and then "initrd" (and thus systemd) takes over. About 300 ms later, systemd starts launching units. The first few units it launches are built-in and I think they happen automatically without considering any dependency chains. All of them appear to remain active as long as the system itself does. These are documented in systemd.special(7):

  • -.mount: This represents the root mount point.

    systemd has a naming convention for mounts (and slices, below), that uses - as the root (instead of /) and as the delimiter between levels, so that a mount /home/foo would be called home-foo.mount. Details in systemd.unit(5) under "String Escaping for Inclusion in Unit Names".

  • -.slice: Root of the "slice hierarchy" documented in systemd.slice(7).

    A slice is a cgroup that systemd manages. Slices may contain slices (recursively), services, and scopes. Slices don't directly contain processes (but services and scopes nested within them do). More information in systemd.slice(5).

    Slices have a hierarchy expressed the same way as for mounts (although slices don't correspond to files), so that foo-bar.slice is a child under foo.slice.

    systemd.slice(7) says that -.slice doesn't usually contain units, that instead it is used to set defaults for the slice tree.

  • init.scope: Scope unit that contains PID 1.

    Scopes contain processes that are started by arbitrary processes (as opposed to processes within services, which systemd itself starts).

    https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ has useful documentation on slices and scopes and especially at https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/#systemdsresourcecontrolconcepts

    I discovered the systemd-cgls command and used it to list the contents of init.scope. Indeed, it only contained systemd itself:

    $ systemd-cgls -u init.scope
    Unit init.scope (/init.scope):
    └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 30
    
  • system.slice: Default slice for service and scope units.

    When I ran systemd-cgls -u system.slice, I saw 44 services running under it ranging from abrtd.service to wpa_supplicant.service, with no sub-slices or sub-scopes.

Encrypted Disk Setup

The next row in the systemd-analyze plot output is for system-systemd\x2dcryptsetup.slice. The name illustrates systemd's escaping convention: \x2d is hexified ASCII for -. This gets used a lot and it makes the names hard to read, so for the rest of this document I'm going to replace \x2d by - without mentioning it further.

systemd-cryptsetup(8) says that systemd uses systemd-cryptsetup-generator to translate /etc/crypttab into units. Reading systemd-cryptsetup-generator(8), it mentions that it also uses some kernel command line options with luks in their names, including the one that appears on my kernel command line:

rd.luks.uuid=luks-3acca984-01a3-4c67-b9c6-94433f667b88

The /etc/crypttab in my initrd says:

luks-3acca984-01a3-4c67-b9c6-94433f667b88 /dev/disk/by-uuid/3acca984-01a3-4c67-b9c6-94433f667b88 none discard

Putting the device in both places appears to be "belt-and-suspenders" redundancy, because rd.luks.uuid is documented to "activate the specified device as part of the boot process as if it was listed in /etc/crypttab".

Documentation for systemd-cryptsetup-generator pointed to systemd.generator(7). This revealed that systemd has a general concept of "generators", which are programs that run early in boot to translate file formats foreign to systemd into units. The manpage gave an example for debugging a generator, so I ran:

$ dir=$(mktemp -d)
$ sudo SYSTEMD_LOG_LEVEL=debug /usr/lib/systemd/system-generators/systemd-cryptsetup-generator "$dir" "$dir" "$dir"

This generated a tree of directories and symlinks and some unit files. The most interesting one was systemd-cryptsetup@luks-3acca984-01a3-4c67-b9c6-94433f667b88.service, which contained, among other things:

ExecStart=/usr/lib/systemd/systemd-cryptsetup attach 'luks-3acca984-01a3-4c67-b9c6-94433f667b88' '/dev/disk/by-uuid/3acca984-01a3-4c67-b9c6-94433f667b88' 'none' 'discard'
ExecStop=/usr/lib/systemd/systemd-cryptsetup detach 'luks-3acca984-01a3-4c67-b9c6-94433f667b88'

Looking around, the same file was in my running system in /run/systemd/generator, so that's presumably the place that systemd keeps it at runtime.

I noticed that systemd-cgls didn't list this slice, even though the plot output showed it as living "forever". I found the --all option to systemd-cgls, for listing empty groups. With that, I found the slice. Presumably, it could have new service units at runtime if I connected a new encrypted disk, or as part of system shutdown.

This service doesn't immediately prompt for the password. Bootup (from the initrd) continues.

Hibernate/Resume

The next row lists system-systemd-hibernate-resume.slice. This is for resume from hibernation. I don't use this feature, so I didn't look any deeper.

Journal Sockets

The next three lines in the plot are all for journal sockets:

systemd-journald-audit.socket
systemd-journald-dev-log.socket
systemd-journald.socket

Each of these has a unit file in the initrd, e.g. /usr/lib/systemd/system/systemd-journald.socket. Each of these unit files causes systemd to listen on a socket listed in the file (e.g. as ListenStream=/run/systemd/journal/stdout) and then pass any accepted connections to systemd-journald (because all of them say Service=systemd-journald.service). See systemd.socket(5) for details.

The "journal" is what systemd calls the log. I don't know why it uses that name.

Virtual Console Setup

The next row is for systemd-vconsole-setup.service. systemd-vconsole-setup(8) documents that this reads /etc/vconsole.conf. In my initrd that file contains:

KEYMAP="us"
FONT="eurlatgr"

The documentation says that kernel command-line options can override this file. Mine don't.

I wish systemd were smarter about console fonts (or maybe it's Dracut or Fedora that's responsible). The one that it chooses on my system is almost unreadable small (on a 4k screen).

Dracut Command Line Parsing

Dracut (see dracut(8)) is a program written in Bash that generates the initrd that I'm using. The next few rows in the startup plot relate to Dracut.

The first one is dracut-cmdline.service. This service is underdocumented: dracut-cmdline.service(8) just says that it runs hooks to parse the kernel command line. The service turns out to just be a shell script /usr/bin/dracut-cmdline. When run, it sources /dracut-state.sh (yes, in the root!), if it exists, then /usr/lib/initrd-release (as we saw in Initrd cpio Archive earlier). Each of these is a set of shell variable assignments.

Then the shell script looks at various parameters that might be present in the kernel command line. The kernel command options I have set that Dracut or its hooks cares about are:

root=/dev/mapper/fedora-root
rd.luks.uuid=luks-3acca984-01a3-4c67-b9c6-94433f667b88
rd.lvm.lv=fedora/root
rd.lvm.lv=fedora/swap

Dracut itself looks at the root command-line parameter. It also invokes a string of hook scripts in /usr/lib/dracut/hooks/cmdline. These turn out to be important for my case:

  • 30-parse-crypt.sh looks at rd.luks.uuid and creates a rule in /etc/udev/rules.d/70-luks.rules. (I think so; if so, the rule doesn't make it into the booted system.)
  • 30-parse-lvm.sh looks at the rd.lvm.lv variables above.

These hooks create shell scripts in /lib/dracut/hooks/initqueue/finished. Each of these exits successfully only if the appropriate device file exists (i.e. in /dev/disk or /dev/fedora). This enables the Dracut Initqueue Service, later, to know that initialization is complete.

Finally, it dumps the generated variables back into /dracut-state.sh. Other Dracut scripts that run later both source this and update it.

(I wasn't able to get a sample of /dracut-state.sh from my own system. It disappears after boot is complete, as part of the pivot_root operation, I believe.)

This script is, I believe, where the Dracut kernel command-line options, which are listed in dracut.cmdline(7), get parsed, although most of them actually take effect in a later stage.

Dracut Pre-udev service

The new row in the startup plot shows dracut-pre-udev.service, which runs /usr/bin/dracut-pre-udev, which is a Bash script. I don't see anything significant that this does on my system.

configfs Mounting

The next three rows appear to be related. The third row is sys-kernel-configfs.mount, which makes sure that configfs is mounted at /sys/kernel/config. This unit depends on systemd-modules-load.service; the second row reports starting this.

The first row is for sys-module-fuse.device, that is, loading the fuse module into the kernel as if with modprobe fuse. I don't know why this gets loaded: it is listed in /usr/lib/modules-load.d/open-vm-tools.conf, but the order is wrong for that, and if the systemd-modules-load were loading it then it would presumably also load the modules listed in /usr/lib/modules-load.d/VirtualBox.conf, which it doesn't.

Dracut Initqueue Service

The next row in the plot is for dracut-initqueue.service. Until now, everything that has been started has either started and run forever, or run to completion deterministically. This new service is different: it runs in a loop until it detects that initialization is done. On my system, this takes a long time overall (over 8 seconds in the run I'm looking at) because it requires prompting me for a password.

The loop does this:

  • Checks whether startup is finished, by running all of the scripts in /lib/dracut/hooks/initqueue/finished/*.sh. If all of them exit successfully, startup is finished.
  • Does various things to help startup along. It's not clear to me that any of these trigger on my system.
  • Eventually times out and offers an emergency shell to the user.

Plymouth Service

The plymouth-start.service unit, on the next row, covers up the kernel initialization message with a graphical splash screen. The plymouthd program only does that if the kernel command line contains rhgb, which stands for "Red Hat Graphical Boot". (Plymouth is the successor to an older program named rhgb.)

The next row is for systemd-ask-password-plymouth.path. This is a new kind of unit to me, a "path" unit. systemd.path(7) explains that each of these units monitors a file or directory and triggers when something happens. In this case the path unit says:

[Path]
DirectoryNotEmpty=/run/systemd/ask-password
MakeDirectory=yes

This means that when /run/systemd/ask-password is nonempty, systemd triggers the corresponding service unit. There is no Unit setting, so by default the service unit is systemd-ask-password-plymouth.service. There's nothing in the directory being watched yet, so systemd defers starting the service.

Device Setup

The next 64 (!) rows set up 32 (!) different ttyS<number> serial devices, in random order. I don't know why, since the laptop has no traditional serial ports at all. This doesn't take any real amount of time, just an annoying number of useless entries in /dev.

Then the next several rows set up all the devices for the SSD. There are four of them for the device itself (and its "namespace" n1):

dev-disk-by-path-pci-0000:3d:00.0-nvme-1.device
dev-disk-by-id-nvme-eui.ace42e0090114c8b.device
dev-disk-by-id-nvme-PC401_NVMe_SK_hynix_1TB_MI93T003610303E06.device
dev-nvme0n1.device
sys-devices-pci0000:00-0000:00:1d.0-0000:3d:00.0-nvme-nvme0-nvme0n1.device

Then seven for the GRUB partition:

dev-disk-by-id-nvme-PC401_NVMe_SK_hynix_1TB_MI93T003610303E06-part2.device
dev-disk-by-id-nvme-eui.ace42e0090114c8b-part2.device
dev-disk-by-partuuid-1d24a3b1-3131-4365-a3a7-dadae8ddf6fd.device
dev-disk-by-path-pci-0000:3d:00.0-nvme-1-part2.device
dev-disk-by-uuid-71baeefe-e3d2-45e1-a0ff-bf651d4a7591.device
dev-nvme0n1p2.device
sys-devices-pci0000:00-0000:00:1d.0-0000:3d:00.0-nvme-nvme0-nvme0n1-nvme0n1p2.device

and the same pattern carries on for the other two partitions. Partition 1 is last; maybe the order is random.

Password Prompting

Amid the disk device setup, just after the setup for the encrypted nvme0n1p2 device, a row lists systemd-cryptsetup@luks-3acca984-01a3-4c67-b9c6-94433f667b88.service. This is the service unit created by the generator described under Encrypted Disk Setup earlier. As shown before, this service runs systemd-cryptsetup to attach the encrypted volume:

ExecStart=/usr/lib/systemd/systemd-cryptsetup attach 'luks-3acca984-01a3-4c67-b9c6-94433f667b88' '/dev/disk/by-uuid/3acca984-01a3-4c67-b9c6-94433f667b88' 'none' 'discard'

The systemd-cryptsetup helper is underdocumented, but running it without arguments gives a little bit of help:

$ /usr/lib/systemd/systemd-cryptsetup
systemd-cryptsetup attach VOLUME SOURCEDEVICE [PASSWORD] [OPTIONS]
systemd-cryptsetup detach VOLUME

Attaches or detaches an encrypted block device.

See the [email protected](8) man page for details.

Peeking into the helper's source code, it calls into ask_password_agent() in systemd's src/shared/ask-password-api.c. In turn, this creates a file /run/systemd/ask-password/ask.<random> with some information about the password to be prompted for.

Now, the path unit described in Plymouth Service triggers its corresponding service unit systemd-ask-password-plymouth.service, which is the next row in the plot.

In turn, the service unit gets systemd to ask for the password:

[Service]
ExecStart=/usr/bin/systemd-tty-ask-password-agent --watch --plymouth

About 7 seconds later, I've finished typing the password, which allows the systemd-cryptsetup@… unit to finish up. The systemd-ask-password-plymouth.service unit exits about a second later, I believe because it is killed by initrd-cleanup.service (see Switching Roots).

The next row in the plot is for sys-devices-pci0000:00-0000:00:02.0-drm-card0-card0-eDP-1-intel_backlight.device. I guess this is just random ordering.

Inner Encrypted Device Setup

The successful unlocking of the encrypted block device made everything inside visible to the device manager. The first set of rows shows the top-level device, which becomes /dev/dm-0. Note that the system considers this device's UUID so fantastic that one of the device nodes contains it twice:

dev-disk-by-id-dm-name-luks-3acca984-01a3-4c67-b9c6-94433f667b88.device
dev-disk-by-id-lvm-pv-uuid-yFbXNn-DjBz-J9c1-c7Pj-QIKZ-JoiH-k5wQWR.device
dev-mapper-luks-3acca984-01a3-4c67-b9c6-94433f667b88.device
dev-disk-by-id-dm-uuid-CRYPT-LUKS2-3acca98401a34c67b9c694433f667b88-luks-3acca984-01a3-4c67-b9c6-94433f667b88.device
dev-dm-0.device
sys-devices-virtual-block-dm-0.device

The next series of rows makes the root device within the encrypted block device visible as /dev/dm-1 aka /dev/fedora/root:

dev-mapper-fedora-root.device
dev-disk-by-uuid-45501fb7-be9f-4f11-97ba-b2c2046fd746.device
dev-disk-by-id-dm-uuid-LVM-mMVYE2s7gIvVmpHUl0cvxbfn8i8NODO8bRKuGAOpNVTVjPrca5QTzb5F3eXjOglK.device
dev-fedora-root.device
dev-disk-by-id-dm-name-fedora-root.device
dev-dm-1.device
sys-devices-virtual-block-dm-1.device

systemd notes in the next row that initrd-root-device.target is now satisfied, because the root file system's device is now available. It isn't mounted yet.

The next series of rows makes the swap device visible as /dev/dm-2 aka /dev/fedora/swap. It isn't being used for swap yet:

dev-disk-by-id-dm-name-fedora-swap.device
dev-mapper-fedora-swap.device
dev-disk-by-id-dm-uuid-LVM-mMVYE2s7gIvVmpHUl0cvxbfn8i8NODO8qImL9xOU61DWKST2SpEEw3aF0kARK3oK.device
dev-disk-by-uuid-71cdb103-4791-4d43-aa4c-49743c36be6e.device
dev-fedora-swap.device
dev-dm-2.device
sys-devices-virtual-block-dm-2.device

Root Filesystem

The next row is systemd-fsck-root.service, which is underdocumented. It looks like a special systemd service, but it isn't documented in systemd.special(7). I guess it fscks /dev/fedora/root.

Next, sysroot.mount mounts /dev/fedora/root on /sysroot. Later, this will become the root of the file system. The sysroot.mount unit was previously generated automatically by systemd-fstab-generator. This generation is hardcoded into the C code of systemd-fstab-generator.

Final initrd Targets

The next row is for initrd-parse-etc.service. This invokes initrd-fs.target, the next row, which is followed by initrd.target. None of these seem to do anything significant on my machine.

Switching Roots

Now it's time for /sysroot to become / and everything not under /sysroot to go away.

The next row is dracut-pre-pivot.service. This runs Dracut's "pre-pivot" and "cleanup" hooks. These don't seem to do anything important on my system. (I've ignored the scripts that Dracut has run, across its various hooks, related to keeping network interface names persistent across boots. I'm continuing to ignore them here. Maybe I'll look into them in later revisions of this document.)

The next row is initrd-cleanup.service, which is defined as:

[Service]
Type=oneshot
ExecStart=systemctl --no-block isolate initrd-switch-root.target

The isolate command was new to me. systemctl(1) says that it starts the argument unit and stops everything else, with an exception for units marked as IgnoreOnIsolate=yes. For me, the latter units (based on grep) are systemd-journald-dev-log.socket and systemd-journald.socket. I see that systemd-journald-audit.socket also continues, so perhaps sockets are generally excepted. In practice, only a few units terminate right at this point (possibly some of these are by coincidence):

dracut-cmdline.service
dracut-pre-udev.service
dracut-initqueue.service
systemd-ask-password-plymouth.service
initrd-root-device.target
initrd.target

Two services are marked as Before=initrd-switch-root.target. These are the next two rows:

  • plymouth-switch-root.service, which runs /usr/bin/plymouth update-root-fs --new-root-dir=/sysroot. plymouth(1) says that this tells the plymouth daemon about the upcoming root file system change. It doesn't say anything about the implications of the root change or the notification, though.
  • initrd-udevadm-cleanup-db.service, which runs udevadm info --cleanup-db. udevadm(1) says that this makes udev clean up its database. It does not explain what this means, but udev(7) gives the hint that the db_persist flag on a database event means that it will be kept even after --cleanup-db. A grep shows that db_persist only shows up in /etc/udev/rules.d/11-dm.rules. This would seem to mean that only device manager nodes would be kept across --cleanup-db. This is wrong (the ttyS<number> nodes, at least, are also kept), so there must be more going on. I didn't investigate further.

After that, initrd-switch-root.target triggers. It runs systemctl --no-block switch-root /sysroot. This does the actual switch of root directories inside systemd's at PID 1 process:

  • Creates a memory-based file using memfd_create() and serializes all the systemd state to it.
  • Un-set some limits, etc., so that the new systemd knows what the defaults should be.
  • Switches root. There's a pivot_root system call that it tries to use first, followed by unmounting the old root in its new location. This approach won't necessarily work, so as an alternative it does the system call equivalent of mount --move /sysroot /.
  • Deletes all of the old initrd using the equivalent of rm -rf old_root. This mildly scares the crap out of me, but it should only destroy files that were extracted from the initrd cpio image.
  • Re-execs itself as /usr/lib/systemd/systemd --switched-root --system --deserialize <fd>, where <fd> is a file descriptor for the state serialized earlier.
  • The new systemd starts and initializes itself using the serialized state from <fd>.

Runtime Systemd Initialization

The plot shows all of the next 40 units all beginning at the same time, meaning that none of them had any unsatisfied dependencies when systemd restarted (or that their dependencies could be satisfied instantly).

Slices

Since the initrd is no longer mounted, at this point we no longer benefit from looking at what we extracted from it. Now the systemd units come from the runtime file system. We can list the directories that systemd searches for them with:

$ systemctl show -p UnitPath --value | tr ' ' '\n'
/etc/systemd/system.control
/run/systemd/system.control
/run/systemd/transient
/run/systemd/generator.early
/etc/systemd/system
/etc/systemd/system.attached
/run/systemd/system
/run/systemd/system.attached
/run/systemd/generator
/usr/local/lib/systemd/system
/usr/lib/systemd/system
/run/systemd/generator.late

Only a fraction of these directories exist, though:

$ for d in $(systemctl show -p UnitPath --value); do if test -d $d; then echo $d; fi; done
/run/systemd/transient
/etc/systemd/system
/run/systemd/system
/run/systemd/generator
/usr/lib/systemd/system
/run/systemd/generator.late

The first five units are slice units, followed a little bit later by the target that some of them mark themselves as Before:

machine.slice
system-getty.slice
system-modprobe.slice
system-systemd-fsck.slice
user.slice
⋮
slices.target

Rather than looking around the file system for these units, it's easier to examine units with systemctl cat, e.g.:

$ systemctl cat machine.slice
# /usr/lib/systemd/system/machine.slice
⋮
[Unit]
Description=Virtual Machine and Container Slice
Documentation=man:systemd.special(7)
Before=slices.target

Some of the above units are mysterious. For example, I can't find system-modprobe.slice anywhere. and systemctl cat doesn't know about it either, even though systemctl list shows it as loaded and active. The same is true for system-getty.slice. system-systemd-fsck.slice doesn't even show up in systemctl list.

Password Prompting

The next row is for systemd-ask-password-wall.path. Its corresponding service unit has:

ExecStart=systemd-tty-ask-password-agent --wall

Frankly, wall(1) seems to me like a dreadful way to prompt for a password. I hope this doesn't really get used much. I've never noticed it in practice.

No other password agent units are running, though, since the plymouth one terminated earlier, so I guess this is better than nothing. Maybe the idea is that needing passwords after running the initrd is unusual, or that if something likely to need them comes up then maybe a better agent will be configured first.

Automount

The next row is for proc-sys-fs-binfmt_misc.automount. This is a kind of systemd unit that is new to us, documented in systemd.automount(5). When the mount point is first used, systemd invokes the corresponding proc-sys-fs-binfmt_misc.mount unit, which mounts the binfmt_misc filesystem on /proc/sys/fs/binfmt_misc. I don't know why this is automounted instead of mounted proactively. It's the only automount unit on my system. When I checked, it was in fact mounted:

$ mount|grep binfmt
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=293)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)