You are on page 1of 51

Introduction to the Linux Kernel

國立中正大學
資訊工程研究所
羅習五 老師
Kernel mode and kernel space

• The kernel includes a protected memory space


and full access to the hardware.
• This system state and memory space is collect
vely referred to as kernel-space
• A user process executes in
– User-space
– kernel-space
kernel-space

• The kernel typically resides in an elevated syst


em state compared to normal user applicaton
s.
– a protected memory space
– full access to the hardware
• This system state and memory space is collec
tvely referred to as kernel-space
user-space

They see a subset of the machine’s available res


ources and can perform certain system functons
Summary: Executng contexts

The modes a CPU executes in


• In kernel-space, in process context, executng
on behalf of a specific process
• In kernel-space, in interrupt context, not assoc
iated with a process, handling an interrupt
• In user-space, executng user code in a process
The modes a CPU executes in

User space
Process context

Kernel space
(including
bottom half)

CTX
Top halve

ISR
A preemptve kernel

• A preemptve OS (non preemptve kernel) ini


tate a context switch when the OS switch to
user mode.
• A preemptve kernel can also initate a conte
xt switch in kernel mode to satsfy the sched
uling policy.
Different Natures

• No libc support
– There are multple reasons for this, including some
chicken-and-the-egg situatons, but the primary re
ason is speed and size (upcall?).
– Many of the usual libc functons have been impl
emented inside the kernel.
• For example <linux/string.h>
Inline Functons

An inline functon is declared when the keyword


s statc and inline are used as part of the funct
on definiton.
For example:
static inline void wolf(unsigned long tail_size)
– Because they are marked statc , an exported func
ton is not created.
Inline Assembly

unsigned int low, high;


asm volatile("rdtsc" : "=a" (low), "=d" (high))
;
/* low and high now contain the lower and uppe
r 32-bits of the 64-bit tsc */
Branch Annotaton

/* we predict 'success' is nearly a


lways nonzero ... */
if (likely(success)) {
/* ... */
}
No (Easy) Use of Floatng Point

• When a user-space process uses floatng-point


instructons, the kernel manages the transiton
from integer to floatng point mode.
• Unlike user-space, the kernel does not have th
e luxury of seamless support for floatng point
because it cannot easily trap itself.
Small, Fixed-Size Stack

• The user-space stack can dynamically grow


• The kernel stack is small and fixed in size
– Historically, the kernel stack is two pages.
– Recently, the size of kernel stack is one page only!!
!
Process control block
(task_struct, thread_info)

• The task_struct structure is allocated via t


he slab allocator.
• To support “4k task_struct“, thread_inf
o was created that again lives at the bottom of
the stack.
thread_info
task_struct

The process descriptor


(task_struct) conta
ins all the informaton
about a specific proces
s.
task_struct
• State and executon informaton
– such as pending signals, binary format, pid, pointers to parents
and other related processes, priorites, and CPU tme
• Informaton on allocated virtual memory.
• Process credentals such as user and group ID, capabilites.
• Files used: filesystem informaton on all files handled by t
he process must be saved.
• Thread informaton, which records the CPU-specific
• Informaton on interprocess communicaton (IPC)
• Signal handlers used by the process.
task_struct and thread_info
Identfying the current process
• current is calculated by masking out the 13 (de
pends on the stack size) least significant bits of
the stack pointer to obtain the thread_info
structure.
8KB
Process State

The process
becomes runnable
if it receives a
signal.

the task does not


respond to signals
in this state
do_fork

long do_fork(unsigned long clone_flags, unsigned long stack_start,


struct pt_regs *regs, unsigned long stack_size,
int __user *parent_tidptr, int __user *child_tidptr)

• clone_flags: A flag set to specify duplication properties.


• stack_start: The start address of the user mode stack to be
used.
• regs: holding all registers in the order in which they are saved on
the parent’s kernel stack when a system call is executed
• parent_tdptr and child_tdptr: two pointers to addresses in use
rspace that hold the TIDs of the parent and child processes.
Code flow diagram for do_fork
copy_process

/* Perform scheduler related setup.


Assign this task to a CPU. */
flags (CLONE_XXX & copy_XXX)
flags (CLONE_XXX & copy_XXX)
Kernel Threads
• kernel threads do not have an address space (in
fact, their mm pointer is NULL).
• Kernel threads are, however, schedulable and p
reemptable as normal processes.
– Example: pdflush task and the ksoftirqd task.
• A kernel thread can be created only by another
kernel thread.
– int kernel_thread(int (*fn)(void *), void * arg, unsig
ned long flags)
the kernel threads
• run the command ps –ef. You should see some
thing like this…
• Kernel process names are surrounded by squa
re brackets ([]).
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Jan04 ? 00:00:00 /sbin/init
root 2 0 0 Jan04 ? 00:00:00 [kthreadd]
root 3 2 0 Jan04 ? 00:00:02 [ksoftirqd/0]
root 5 2 0 Jan04 ? 00:00:00 [kworker/u:0]
root 6 2 0 Jan04 ? 00:00:00 [migration/0]
root 7 2 0 Jan04 ? 00:00:00 [migration/1]
root 9 2 0 Jan04 ? 00:00:00 [ksoftirqd/1]
root 10 2 0 Jan04 ? 00:00:02 [kworker/0:1]
do_execve
int do_execve(char * filename,
char __user *__user *argv,
char __user *__user *envp,
struct pt_regs * regs)

• filename: the executable file

• argv & envp: The argument vector and environment

• regs: holding all registers in the order in which they are s


aved on the kernel stack when the system call is executed
Code flow diagram for
do_execve
Process terminaton

5. calls __exit_files(), __exit_fs(), exit_namespac


e(), and exit_sighand() to decrement the usag
e count of objects
6. sets the task's exit code, stored in the exit_co
de member of the task_struct
7. calls exit_notfy() to send signals to the task's
parent
8. calls schedule()
Removal of the Process Descriptor

After the parent has obtained informaton on its


terminated child (by invoking wait4), or signified
to the kernel that it does not care, the child's tas
k_struct is deallocated.
UNIX Scheduling Policy
• Scheduling policy determines what runs when
– fast process response tme (low latency)
– maximal system utlizaton (high throughput)
• Processes classificaton:
– I/O-bound processes: spends much of its tme submitting
and waitng on I/O requests
– Processor-bound processes: spend much of their tme ex
ecutng code
• Unix variants tends to favor I/O-bound processes, th
us providing good process response tme
32
Linux scheduler –
Process Priority
• Linux’s priority-based scheduling
– Rank processes based on their worth and need for processor tme.
– processes with a higher priority also receive a longer tmeslice.
– Both the user and the system may set a process's priority to influe
nce the scheduling behavior of the system.
• Dynamic priority-based scheduling
– Begins with an inital base priority
– Then enables the scheduler to increase or decrease the priority dy
namically to fulfill scheduling objectves.
– E.g., a process that is spending more tme waitng on I/O will recei
ve an elevated dynamic priority.

33
Timeslice
• The tmeslice is the numeric value that represen
ts how long a task can run untl it is pre-empted.
– too short => large overhead of switching process
– too long => poor interactve response
• Linux’s CFS scheduler does not directly assign t
meslices to processes.
– CFS assigns processes a proportion of the processor.
– the amount of processor tme that a process receive
s is a functon of the load of the system

34
2.4 scheduler - SMP
busy

run queue
busy

35
2.4 scheduler

• Non-preemptble kernel
– Set p->need_resched if schedule() should be invok
ed at the ‘next opportunity‘ (kernel => user mode)
.
• Round-robin
– task_struct->counter: number of clock tcks left to
run in this scheduling slice, decremented by a tme
r.

36
2.4 scheduler

1. Check if schedule() was invoked from interrupt


handler (due to a bug) and panic if so.
2. Use spin_lock_irq() to lock ‘runqueue_lock’
3. Check if a task is ‘runnable’
– in TASK_RUNNING state
– in TASK_INTERRUPTIBLE state and a signal is pending
4. Examine the ‘goodness’ of each process
5. Context switch
37
2.4 scheduler – ‘goodness’

• ‘goodness’: identfying the best candidate amo


ng all processes in the runqueue list.
– ‘goodness’ = 0: the entty has exhausted its quantu
m.
– 0 < ‘goodness’ < 1000: the entty is a conventonal
process/thread that has not exhausted its quantu
m; a higher value denotes a higher level of goodne
ss.

38
2.4 scheduler – ‘goodness’
(to improve multthreading performance)

if (p->mm == prev->mm)
return p->counter + p->priority + 1;
else
return p->counter + p->priority;

• A small bonus is given to the task p if it share


s the address space with the previous task.

39
2.4 scheduler - SMP

Examine the processor field of the processes an


d gives a consistent bonus (that is PROC_CHANG
E_PENALTY, usually 15) to the process that was l
ast executed on the ‘this_cpu’ CPU.

40
Scheduling policy

tme_quantumnew = bonusI/O + tmestatc

tme_quantumnew = tme_quantumold/2 + tme_q


uantum_table[statc_priority]
&
dynamic_priority ≈ tme_quantumnew

41
2.6 scheduler

run queue

task migraton
(put + pull)

run queue

42
2.6 scheduler –
User Preempton

• User preempton can occur


– When returning to user-space from a system call
– When returning to user-space from an interrupt h
andler

43
2.6 scheduler –
Kernel Preempton
• The Linux kernel is a fully preemptve kernel.
– It is possible to preempt a task at any point, so long as the kernel is in a sta
te in which it is safe to reschedule.
– “safe to reschedule”: kernel does not hold a lock
• The Linux design:
– adding of a preempton counter, preempt_count, to each process's thread
_info
– This count increments once for each lock that is acquired and decrements
once for each lock that is released
• Kernel preempton can also occur explicitly, when a task in the kern
el blocks or explicitly calls schedule().
– no additional logic is required to ensure that the kernel is in a state that is
safe to preempt!

44
 Most tasks have dynamic priorities

runqueue
that are based on their “nice” value
(static priority) plus or minus 5
 Interactivity of a task ≈
1/sleep_time

dynPrio = staticPrio +
bonus
tsk1
bonus = -5 ~ +5
bonus ≈ 1/sleep_time
tsk3

tsk2 tsk3

I/O
bound
expired

active

45
 When all tasks have exhausted

runqueue
their time slices, the two priority
arrays are exchanged!

tsk1

tsk3

tsk2
expired

active

46
2.6 scheduler –
CFS
• Classical schedulers compute tme slices for ea
ch process in the system and allow them to ru
n untl their tme slice/quantum is used up.
– After that, all process need to be recalculated.
• CFS considers only the wait tme of a process
– The task with the most need for CPU tme is sched
uled.

47
2.6 scheduler –
CFS

48
vruntme
• The vruntime variable stores the virtual runtm
e of a process, which is the actual runtme norm
alized by the number of runnable processes.
• The virtual runtme’s units are nanoseconds and
therefore vruntme is decoupled from the tmer
tck.
• The virtual runtme is used to help us approxim
ate the “ideal multtasking processor” that CFS is
modeling.
49
Adding Processes to the Tree
This would occur when a process becomes runn
able (wakes up) or is first created via fork().

se->vruntime += cfs_rq->min_vruntime;
update_curr(cfs_rq);
account_entity_enqueue(cfs_rq, se);
update_stats_enqueue(cfs_rq, se);
__enqueue_entty(cfs_rq, se);

請參考課本
50
homework (繳交報告)
• 比較 2.4 、 O(1) 、 CFS 的差異,至少必須
包含:
– 如何分配 tme slice
– 如何決定“下一個 task”
– 如何藉由一個比較好的 scheduling algorithm 提
高整個系統的 throughput ( hint : i/o-bound &
cpu-bound )
• 你或許可以添加下列內容
– 找出一個例子,完整的說明這三種 scheduling al
gorithm 的優缺點
51

You might also like