Linux Kernel Internals--Booting
——
1.1 Building the Linux Kernel Image
This section explains the steps taken during compilation of the Linux kernel and
the output produced at each stage. The build process depends on the architectur
e so I would like to emphasize that we only consider building a Linux/x86 kernel
.
When the user types ’make zImage’ or ’make bzImage’ the resulting bootable kerne
l image is stored as arch/i386/boot/zImage or arch/i386/boot/bzImage respectivel
y. Here is how the image is built:
C and assembly source files are compiled into ELF relocatable object format (.o)
and some of them are grouped logically into archives (.a) using ar(1).
Using ld(1), the above .o and .a are linked into vmlinux which is a statically l
inked, non-stripped ELF 32-bit LSB 80386 executable file.
System.map is produced by nm vmlinux, irrelevant or uninteresting symbols are gr
epped out.
Enter directory arch/i386/boot.
Bootsector asm code bootsect.S is preprocessed either with or without -D__BIG_KE
RNEL__, depending on whether the target is bzImage or zImage, into bbootsect.s o
r bootsect.s respectively.
bbootsect.s is assembled and then converted into ’raw binary’ form called bboots
ect (or bootsect.s assembled and raw-converted into bootsect for zImage).
Setup code setup.S (setup.S includes video.S) is preprocessed into bsetup.s for
bzImage or setup.s for zImage. In the same way as the bootsector code, the diffe
rence is marked by -D__BIG_KERNEL__ present for bzImage. The result is then conv
erted into
’raw binary’ form called bsetup.
Enter directory arch/i386/boot/compressed and convert /usr/src/linux/vmlinux to
$tmppiggy (tmp filename) in raw binary format, removing .note and .comment ELF s
ections.
gzip -9 < $tmppiggy > $tmppiggy.gz
Link $tmppiggy.gz into ELF relocatable (ld -r) piggy.o.
Compile compression routines head.S and misc.c (still in arch/i386/boot/compress
ed directory) into ELF objects head.o and misc.o.
Link together head.o, misc.o and piggy.o into bvmlinux (or vmlinux for zImage, d
on’t mistake this for /usr/src/linux/vmlinux!). Note the difference between -Tte
xt 0x1000 used for vmlinux and -Ttext 0x100000 for bvmlinux, i.e. for bzImage co
mpression
loader is high-loaded.
Convert bvmlinux to ’raw binary’ bvmlinux.out removing .note and .comment ELF se
ctions.
Go back to arch/i386/boot directory and, using the program tools/build, cat toge
ther bbootsect, bsetup and compressed/bvmlinux.out into bzImage (delete extra ’b
’ above for zImage). This writes important variables like setup_sects and root_d
ev at the
end of the bootsector.
The size of the bootsector is always 512 bytes. The size of the setup must be gr
eater than 4 sectors but is limited above by about 12K - the rule is:
0x4000 bytes >= 512 + setup_sects * 512 + room for stack while running bootsecto
r/setup
We will see later where this limitation comes from.
The upper limit on the bzImage size produced at this step is about 2.5M for boot
ing with LILO and 0xFFFF paragraphs (0xFFFF0 = 1048560 bytes) for booting raw im
age, e.g. from floppy disk or CD-ROM (El-Torito emulation mode).
Note that while tools/build does validate the size of boot sector, kernel image
and lower bound of setup size, it does not check the *upper* bound of said setup
size. Therefore it is easy to build a broken kernel by just adding some large "
.space" at
the end of setup.S.
1.2 Booting: Overview
The boot process details are architecture-specific, so we shall focus our attent
ion on the IBM PC/IA32 architecture. Due to old design and backward compatibilit
y, the PC firmware boots the operating system in an old-fashioned manner. This p
rocess can
be separated into the following six logical stages:
BIOS selects the boot device.
BIOS loads the bootsector from the boot device.
Bootsector loads setup, decompression routines and compressed kernel image.
The kernel is uncompressed in protected mode.
Low-level initialisation is performed by asm code.
High-level C initialisation.
1.3 Booting: BIOS POST
The power supply starts the clock generator and asserts #POWERGOOD signal on the
bus.
CPU #RESET line is asserted (CPU now in real 8086 mode).
%ds=%es=%fs=%gs=%ss=0, %cs=0xFFFF0000,%eip = 0x0000FFF0 (ROM BIOS POST code).
All POST checks are performed with interrupts disabled.
IVT (Interrupt Vector Table) initialised at address 0.
The BIOS Bootstrap Loader function is invoked via int 0x19, with %dl containing
the boot device ’drive number’. This loads track 0, sector 1 at physical address
0x7C00 (0x07C0:0000).
1.4 Booting: bootsector and setup
The bootsector used to boot Linux kernel could be either:
Linux bootsector (arch/i386/boot/bootsect.S),
LILO (or other bootloader’s) bootsector, or
no bootsector (loadlin etc)
We consider here the Linux bootsector in detail. The first few lines initialise
the convenience macros to be used for segment values:
--------------------------------------------------------------------------------
29 SETUPSECS = 4 /* default nr of setup-sectors */
30 BOOTSEG = 0x07C0 /* original address of boot-sector */
31 INITSEG = DEF_INITSEG /* we move boot here - out of the way */
32 SETUPSEG = DEF_SETUPSEG /* setup starts here */
33 SYSSEG = DEF_SYSSEG /* system loaded at 0x10000 (65536) */
34 SYSSIZE = DEF_SYSSIZE /* system size: # of 16-byte clicks */
--------------------------------------------------------------------------------
(the numbers on the left are the line numbers of bootsect.S file) The values of
DEF_INITSEG, DEF_SETUPSEG, DEF_SYSSEG and DEF_SYSSIZE are taken from include/asm
/boot.h:
--------------------------------------------------------------------------------
/* Don’t touch these, unless you really know what you’re doing. */
#define DEF_INITSEG 0x9000
#define DEF_SYSSEG 0x1000
#define DEF_SETUPSEG 0x9020
#define DEF_SYSSIZE 0x7F00
--------------------------------------------------------------------------------
Now, let us consider the actual code of bootsect.S:
-------------------------------------------------------------------------------- {{分頁}}
54 movw $BOOTSEG, %ax
55 movw %ax, %ds
56 movw $INITSEG, %ax
57 movw %ax, %es
58 movw $256, %cx
59 subw %si, %si
60 subw %di, %di
61 cld
62 rep
63 movsw
64 ljmp $INITSEG, $go
65 # bde - changed 0xff00 to 0x4000 to use debugger at 0x6400 up (bde). We
66 # wouldn’t have to worry about this if we checked the top of memory. Also
67 # my BIOS can be configured to put the wini drive tables in high memory
68 # instead of in the vector table. The old stack might have clobbered the
69 # drive table.
70 go: movw $0x4000-12, %di # 0x4000 is an arbitrary value >=
71 # length of bootsect + length of
72 # setup + room for stack;
73 # 12 is disk parm size.
74 movw %ax, %ds # ax and es already contain INITSEG
75 movw %ax, %ss
76 movw %di, %sp # put stack at INITSEG:0x4000-12.
--------------------------------------------------------------------------------
Lines 54-63 move the bootsector code from address 0x7C00 to 0x90000. This is ach
ieved by:
set %ds:%si to $BOOTSEG:0 (0x7C0:0 = 0x7C00)
set %es:%di to $INITSEG:0 (0x9000:0 = 0x90000)
set the number of 16bit words in %cx (256 words = 512 bytes = 1 sector)
clear DF (direction) flag in EFLAGS to auto-increment addresses (cld)
go ahead and copy 512 bytes (rep movsw)
The reason this code does not use rep movsd is intentional (hint - .code16).
Line 64 jumps to label go: in the newly made copy of the bootsector, i.e. in seg
ment 0x9000. This and the following three instructions (lines 64-76) prepare the
stack at $INITSEG:0x4000-0xC, i.e. %ss = $INITSEG (0x9000) and %sp = 0x3FF4 (0x
4000-0xC).
This is where the limit on setup size comes from that we mentioned earlier (see
Building the Linux Kernel Image).
Lines 77-103 patch the disk parameter table for the first disk to allow multi-se
ctor reads:
--------------------------------------------------------------------------------
77 # Many BIOS’s default disk parameter tables will not recognise
78 # multi-sector reads beyond the maximum sector number specified
79 # in the default diskette parameter tables - this may mean 7
80 # sectors in some cases.
81 #
82 # Since single sector reads are slow and out of the question,
83 # we must take care of this by creating new parameter tables
84 # (for the first disk) in RAM. We will set the maximum sector
85 # count to 36 - the most we will encounter on an ED 2.88.
86 #
87 # High doesn’t hurt. Low does.
88 #
89 # Segments are as follows: ds = es = ss = cs - INITSEG, fs = 0,
90 # and gs is unused.
91 movw %cx, %fs # set fs to 0
92 movw $0x78, %bx # fs:bx is parameter table address
93 pushw %ds
94 ldsw %fs:(%bx), %si # ds:si is source
95 movb $6, %cl # copy 12 bytes
96 pushw %di # di = 0x4000-12.
97 rep # don’t need cld -> done on line 66
98 movsw
99 popw %di
100 popw %ds
101 movb $36, 0x4(%di) # patch sector count
102 movw %di, %fs:(%bx)
103 movw %es, %fs:2(%bx)
--------------------------------------------------------------------------------
The floppy disk controller is reset using BIOS service int 0x13 function 0 (rese
t FDC) and setup sectors are loaded immediately after the bootsector, i.e. at ph
ysical address 0x90200 ($INITSEG:0x200), again using BIOS service int 0x13, func
tion 2 (read
sector(s)). This happens during lines 107-124:
--------------------------------------------------------------------------------
107 load_setup:
108 xorb %ah, %ah # reset FDC
109 xorb %dl, %dl
110 int $0x13
111 xorw %dx, %dx # drive 0, head 0
112 movb $0x02, %cl # sector 2, track 0
113 movw $0x0200, %bx # address = 512, in INITSEG
114 movb $0x02, %ah # service 2, "read sector(s)"
115 movb setup_sects, %al # (assume all on head 0, track 0)
116 int $0x13 # read it
117 jnc ok_load_setup # ok - continue
118 pushw %ax # dump error code
119 call print_nl
120 movw %sp, %bp
121 call print_hex
122 popw %ax
123 jmp load_setup
124 ok_load_setup:
--------------------------------------------------------------------------------
If loading failed for some reason (bad floppy or someone pulled the diskette out
during the operation), we dump error code and retry in an endless loop. The onl
y way to get out of it is to reboot the machine, unless retry succeeds but usual
ly it
doesn’t (if something is wrong it will only get worse).
If loading setup_sects sectors of setup code succeeded we jump to label ok_load_
setup:.
Then we proceed to load the compressed kernel image at physical address 0x10000.
This is done to preserve the firmware data areas in low memory (0-64K). After t
he kernel is loaded, we jump to $SETUPSEG:0 (arch/i386/boot/setup.S). Once the d
ata is no
longer needed (e.g. no more calls to BIOS) it is overwritten by moving the entir
e (compressed) kernel image from 0x10000 to 0x1000 (physical addresses, of cours
e). This is done by setup.S which sets things up for protected mode and jumps to
0x1000
which is the head of the compressed kernel, i.e. arch/386/boot/compressed/{head.
S,misc.c}. This sets up stack and calls decompress_kernel() which uncompresses t
he kernel to address 0x100000 and jumps to it.
Note that old bootloaders (old versions of LILO) could only load the first 4 sec
tors of setup, which is why there is code in setup to load the rest of itself if
needed. Also, the code in setup has to take care of various combinations of loa
der
type/version vs zImage/bzImage and is therefore highly complex. {{分頁}}
Let us examine the kludge in the bootsector code that allows to load a big kerne
l, known also as "bzImage". The setup sectors are loaded as usual at 0x90200, bu
t the kernel is loaded 64K chunk at a time using a special helper routine that c
alls BIOS to
move data from low to high memory. This helper routine is referred to by bootsec
t_kludge in bootsect.S and is defined as bootsect_helper in setup.S. The bootsec
t_kludge label in setup.S contains the value of setup segment and the offset of
bootsect_helper code in it so that bootsector can use the lcall instruction to j
ump to it (inter-segment jump). The reason why it is in setup.S is simply becaus
e there is no more space left in bootsect.S (which is strictly not true - there
are
approximately 4 spare bytes and at least 1 spare byte in bootsect.S but that is
not enough, obviously). This routine uses&
評(píng)論