Linux内核分析之调度算法-服务器评测

linux调度算法在2.6.32中采用调度类实现模块式的调度方式。这样，能够很好的加入新的调度算法。

linux调度器是以模块方式提供的，这样做的目的是允许不同类型的进程可以有针对性地选择调度算法。这种模块化结构被称为调度器类，他允许多种不同哦可动态添加的调度算法并存，调度属于自己范畴的进程。每个调度器都有一个优先级，调度代码会按照优先级遍历调度类，拥有一个可执行进程的最高优先级的调度器类胜出，去选择下面要执行的那个程序。

linux上主要有两大类调度算法，CFS(完全公平调度算法）和实时调度算法。宏SCHED_NOMAL主要用于CFS调度，而SCHED_FIFO和SCHED_RR主要用于实时调度。如下面的宏定义：

* Scheduling policies

/*支援Real-Time Task的排程,包括有SCHED_FIFO與SCHED_RR.

/*(也稱為SCHED_OTHER): 主要用以排程

一般目的的Task.*/

#define SCHED_NORMAL 0

#define SCHED_FIFO 1

/*task預設的 Time Slice長度為100 msecs*/

#define SCHED_RR 2

/*主要用以讓Task可以延長執行的時間

(Time Slice),減少被中斷發生Task Context-Switch

的次數.藉此可以提高 Cache的利用率

(每次Context-Switch都會導致Cache-Flush). 比

較適合用在固定週期執行的Batch Jobs任

務主機上,而不適合用在需要使用者互

動的產品 (會由於Task切換的延遲,而

感覺到系統效能不佳或是反應太慢).*/

#define SCHED_BATCH 3

/* SCHED_ISO: reserved but not implemented yet */

/*為系統中的Idle Task排程.*/

#define SCHED_IDLE 5

linux调度算法实现的高层数据结构主要有运行实体、调度类、运行队列，下面我们主要看看这几个数据结构的字段和意义。

运行实体，rq结构体每个cpu有一个，主要存储一些基本的用于调度的信息，包括实时调度的和CFS调度的

/*每个处理器都会配置一个rq*/

struct rq {

/* runqueue lock: */

spinlock_t lock;

* nr_running and cpu_load should be in the same cacheline because

* remote CPUs use both these fields when doing load calculation.

/*用以记录目前处理器rq中执行task的数量*/

unsigned long nr_running;

#define CPU_LOAD_IDX_MAX 5

/*用以表示处理器的负载，在每个处理器的rq中

都会有对应到该处理器的cpu_load参数配置，在每次

处理器触发scheduler tick时，都会呼叫函数

update_cpu_load_active,进行cpu_load的更新。在系统初始化的时候

会呼叫函数sched_init把rq的cpu_load array初始化为0.

了解他的更新方式最好的方式是通过函数update_cpu_load,公式如下澹?

cpu_load[0]会直接等待rq中load.weight的值。

cpu_load[1]=(cpu_load[1]*(2-1)+cpu_load[0])/2

cpu_load[2]=(cpu_load[2]*(4-1)+cpu_load[0])/4

cpu_load[3]=(cpu_load[3]*(8-1)+cpu_load[0])/8

cpu_load[4]=(cpu_load[4]*(16-1)+cpu_load[0]/16

呼叫函数this_cpu_load时，所返回的cpu load值是cpu_load[0]

而在进行cpu blance或migration时，就会呼叫函数

source_load target_load取得对该处理器cpu_load index值，

来进行计算*/

unsigned long cpu_load[CPU_LOAD_IDX_MAX];

#ifdef CONFIG_NO_HZ

unsigned long last_tick_seen;

unsigned char in_nohz_recently;

#endif

/* capture load from *all* tasks on this cpu: */

/*load->weight值，会是目前所执行的schedule entity的

load->weight的总和，也就是说rq的load->weight越高，

也表示所负责的排程单元load->weight总和越高

表示处理器所负荷的执行单元也越重*/

struct load_weight load;

/*在每次scheduler tick中呼叫update_cpu_load时，

这个值就增加一，可以用来反馈目前cpu

load更新的次数*/

unsigned long nr_load_updates;

/*用来累加处理器进行context switch的次数，会在

函数schedule呼叫时进行累加，并可以通过函数

nr_context_switches统计目前所有处理器总共的context switch

次数，或是可以透过查看档案/proc/stat中的ctxt位得知目前

整个系统触发context switch的次数*/

u64 nr_switches;

u64 nr_migrations_in;

/*为cfs fair scheduling class 的rq*/

struct cfs_rq cfs;

/*为real-time scheduling class 的rq*/

struct rt_rq rt;

/*用以支援可以group cfs tasks的机制*/

#ifdef CONFIG_FAIR_GROUP_SCHED

/* list of leaf cfs_rq on this cpu: */

/*在有设置fair group scheduling 的环境下，

会基于原本cfs rq中包含有若干task的group

所成的排程集合，也就是说当有一个group a

就会有自己的cfs rq用来排程自己所属的tasks,

而属于这group a的tasks所使用到的处理器时间

就会以这group a总共所分的的时间为上限。

基于cgroup的fair group scheduling 架构，可以创造出

有阶层性的task组织，根据不同task的功能群组化

在配置给该群主对应的处理器资源，让属于

该群主下的task可以透过rq机制排程。使用属于

该群主下的资源。

这个变数主要是管理CFS RQ list，操作上可以透过函数

list_add_leaf_cfs_rq把一个group cfs rq加入到list中，或透过

函数list_del_leaf_cfs_rq把一个group cfs rq移除，并可以

透过for_each_leaf_cfs_rq把一个rq上得所有leaf cfs_rq走一遍

struct list_head leaf_cfs_rq_list;

#endif

/*用以支援可以group real-time tasks的机制*/

#ifdef CONFIG_RT_GROUP_SCHED

/*类似leaf_cfs_rq_list所扮演的角色，只是这里

是针对属于real-time的task,在实际操作上可以

透过函数list_add_leaf_rt_rq,list_del_leaf_rt_rq或

巨集for_each_leaf_rt_rq*/

struct list_head leaf_rt_rq_list;

#endif

* This is part of a global counter where only the total sum

* over all CPUs matters. A task can increase this counter on

* one CPU and if it got migrated afterwards it may decrease

* it on another CPU. Always updated under the runqueue lock:

/*一般来说，linux kernel 的task状态可以为TASK_RUNNING

TASK_INTERRUPTIBLE(sleep),

TASK_UNINTERRUPTIBLE(Deactivate Task,此时Task会从rq中

移除)或TASK_STOPPED.

透过这个变数会统计目前rq中有多少task属于

TASK_UNINTERRUPTIBLE的状态。当呼叫函数

active_task时，会把nr_uninterruptible值减一，并透过该函数

enqueue_task把对应的task依据所在的scheduling class

放在对应的rq中，并把目前rq中nr_running值加一*/

unsigned long nr_uninterruptible;

/*curr:指向目前处理器正在执行的task;

idle:指向属于idle-task scheduling class 的idle task;

stop:指向目前最高等级属于stop-task scheduling class

的task;*/

struct task_struct *curr, *idle;

/*基于处理器的jiffies值，用以记录下次进行处理器

balancing 的时间点*/

unsigned long next_balance;

/*用以存储context-switch发生时，前一个task的memory management

结构并可用在函数finish_task_switch中，透过函数mmdrop释放前一个

task的记忆体资源*/

struct mm_struct *prev_mm;

/*用以记录目前rq的clock值，基本上该值会等于透过sched_clock_cpu

(cpu_of(rq))的回传值，并会在每次呼叫scheduler_tick时透过

函数update_rq_clock更新目前rq clock值。

在实作部分，函数sched_clock_cpu会透过sched_clock_local或

ched_clock_remote取得对应的sched_clock_data,而处理的sched_clock_data

值，会透过函数sched_clock_tick在每次呼叫scheduler_tick时进行更新；

u64 clock;

/*用以记录目前rq中有多少task处于等待i/o的sleep状态

在实际的使用上，例如当driver接受来自task的调用，但处于等待i/o

回复的阶段时，为了充分利用处理器的执行资源，这时

就可以在driver中呼叫函数io_schedule，此时

就会把目前rq中的nr_iowait加一，并设定目前task的io_wait为1

然后触发scheduling 让其他task有机会可以得到处理器执行时间*/

atomic_t nr_iowait;

#ifdef CONFIG_SMP

/*root domain是基于多核心架构下的机制，

会由rq结构记住目前采用的root domain，其中包括了

目前的cpu mask(包括span,online rt overload), reference count 跟cpupri

当root domain有被rq参考到时，refcount 就加一，反之就减一。而cpu

mask span表示rq可挂上的cpu mask,noline为rq目前已经排程的

cpu mask cpu上执行real-time task.可以参考函数pull_rt_task，当一个rq中属于

real-time的task已经执行完毕，就会透过函数pull_rt_task从该

rq中属于rto_mask cpu mask 可以执行的处理器上，找出是否有一个处理器

有大于一个以上的real-time task，若有就会转到目前这个执行完成

real-time task 的处理器上

而cpupri不同于Task本身有区分140個(0-139)

Task Priority (0-99為RT Priority 而 100-139為Nice值 -20-19).

CPU Priority本身有102個Priority (包括,-1 為Invalid,

0為Idle,1為Normal,2-101對應到Real-Time Priority 0-99).

參考函式convert_prio, Task Priority如果是 140就會對應到

CPU Idle,如果是大於等於100就會對應到CPU Normal,

若是Task Priority介於0-99之間,就會對應到CPU Real-Time Priority 101-2之間.)

在實際的操作上,例如可以透過函式cpupri_find

帶入一個要插入的Real-Time Task,此時就會依據cpupri中

pri_to_cpu選擇一個目前執行Real-Time Task且該Task

的優先級比目前要插入的Task更低的處理器,

並透過CPU Mask(lowest_mask)返回目前可以選擇的處理器Mask.

實作的部份可以參考檔案kernel/sched_cpupri.c.

在初始化的過程中,會透過函式sched_init呼叫函式init_defrootdomain,

對Root Domain與 CPU Priority機制進行初始化.

struct root_domain *rd;

/*Schedule Domain是基於多核心架構下的機制.

每個處理器都會有一個基礎的Scheduling Domain,

Scheduling Domain可以有階層性的架構,透過parent

可以找到上一層的Domain,或是透過child找到

下一層的 Domain (NULL表示結尾.).並可透過span

栏位,表示這個Domain所能涵蓋的處理器範圍.

通常Base Domain會涵蓋系統中所有處理器的個數,

而Child Domain所能涵蓋的處理器個數不超過它的

Parent Domain. 而當在進行Scheduling Domain 中的Task Balance

時,就會以該Domain所能涵蓋的處理器為最大範圍.

同時,每個Schedule Domain都會包括一個或一個以上的

CPU Groups (結構為struct sched_group),並透過next變數把

CPU Groups串連在一起(成為一個單向的Circular linked list),

每個CPU Group都會有變數cpumask來定义這個CPU Group

所涵蓋的處理器範圍.並且CPU Group所包括的處理器

範圍,必需涵蓋在所屬的Schedule Domain處理器範圍中.

當進行Scheduling Domain的Balancing時,會以其下的CPU Groups

為單位,根據cpu_power （會是該Group所涵蓋的處理器

Tasks Loading的總和）來比較不同的CPU Groups的負荷,

以進行Tasks的移動,達到Balancing的目的.

在有支援SMP的架構下,會在函式sched_init中,呼叫open_softirq,

註冊 SCHED_SOFTIRQ Software IRQ与其对应的 Callback函式

run_rebalance_domains. 並會在每次呼叫函式scheduler_tick時,

透過函式trigger_load_balance确认是否目前的jiffies值已經

大於RunQueue下一次要觸發Load Balance的next_balance時間值,

並透過函式raise_softirq觸發SCHED_SOFTIRQ Software IRQ.

在Software IRQ觸發後,就會呼叫函式run_rebalance_domains,

並在函式rebalance_domains中,進行后续處理器上的

Scheduling Domain Load Balance動作.

有關Scheduling Domain進一步的內容,也可以參考

Linux Kernel文件 Documentation/scheduler/sched-domains.txt.

struct sched_domain *sd;

/*這值會等於函式idle_cpu的返回值,如果為1表示

目前CPU RunQueue中執行的為Idle Task. 反之為0,

則表示處理器執行的不是Idle Task (也就是說

處理器正在忙碌中.).*/

unsigned char idle_at_tick;

/* For active balancing */

/*若這值不為0,表示會有在Schedule排程動作

結束前,要呼叫的收尾函式. (实作為inline函式

post_schedule in kernel/sched.c),目前只有Real-Time Scheduling

Class有支援這個機制(會呼叫函式has_pushable_tasks

in kernel/sched_rt.c).*/

int post_schedule;

/*當RunQueue中此值為1,表示這個RunQueue正在進行

Fair Scheduling的Load Balance,此時會呼叫stop_one_cpu_nowait

暫停該RunQueue所屬處理器的排程,並透過函式

active_load_balance_cpu_stop,把Tasks從最忙碌的處理器,

移到Idle的處理器上執行.*/

int active_balance;

/*用以儲存目前進入Idle且負責進行 Load Balance

流程的處理器ID. 呼叫的流程為,在呼叫函式schedule時,

若該處理器RunQueue的nr_running為0 (也就是目前沒有

正在執行的Task),就會呼叫idle_balance,並觸發後續Load

Balance流程.*/

int push_cpu;

/* cpu of this runqueue: */

/*用以儲存目前运作這個RunQueue的處理器ID*/

int cpu;

/*為1表示目前此RunQueue有在對應的處理器掛上

並執行.*/

int online;

/*如果RunQueue中目前有Task正在執行,這個值會

等於目前該RunQueue的Load Weight除以目前RunQueue

中Task數目的均值.

(rq->avg_load_per_task = rq->load.weight / nr_running;).*/

unsigned long avg_load_per_task;

struct task_struct *migration_thread;

struct list_head migration_queue;

/*這個值會由Real-Time Scheduling Class呼叫函式

update_curr_rt,用以統計目前Real-Time Task執行時間的

均值,在這函式中會以目前RunQueue的clock_task

減去目前Task執行的起始時間,取得執行時間的

Delta值. (delta_exec = rq->clock_task – curr->se.exec_start; ).

在透過函式sched_rt_avg_update把這Delta值跟原本

RunQueue中的rt_avg值取平均值. 以運作的週期來看,

這個值可反應目前系統中Real-Time Task平均被

分配到的執行時間值.*/

u64 rt_avg;

/*這個值主要在函式sched_avg_update更新,以笔者手中

的Linux Kernel 2.6.38.6的實作來說,當RunQueue Clock

減去age_stamp大於 0.5秒 (=sched_avg_period),就會把這值

累加0.5秒 (單位都是nanoseconds). 從函式scale_rt_power

的實作來說,age_stamp值離RunQueue Clock越遠,表示total

值越大,available值也越大,而函式scale_rt_power返回的

div_u64計算結果也越大,最終 RunQueue的cpu_power

與Scheduling Domain中的Scheduling Group的cpu_power

值也就越大. (可參考函式update_cpu_power的實作).*/

u64 age_stamp;

/*這值會在觸發Scheduling時,若判斷目前處理器

RunQueue沒有正在運作的Task,就會透過函式

idle_balance更新這值為為目前RunQueue的clock值.

可用以表示這個處理器是何時進入到Idle的

狀態*/

u64 idle_stamp;

/*會在有Task運作且idle_stamp不為0 (表示前一個

狀態是在Idle)時以目前RunQueue的clock減去

idle_stmp所計算出的Delta值為依據,更新這個值

. 可反應目前處理器進入Idle狀態的時間長短*/

u64 avg_idle;

#endif

/* calc_load related fields */

/*用以記錄下一次計算CPU Load的時間,初始值

為目前的jiffies加上五秒與1次的Scheduling Tick的

間隔 (=jiffies + LOAD_FREQ,且LOAD_FREQ=(5*HZ+1))*/

unsigned long calc_load_update;

/*會等於RunQueue中nr_running與nr_uninterruptible的

總和.（可參考函式calc_load_fold_active）.*/

long calc_load_active;

#ifdef CONFIG_SCHED_HRTICK

#ifdef CONFIG_SMP

/*在函式init_rq_hrtick初始化RunQueue High-Resolution

Tick時,此值預設為0.

在函式hrtick_start中,會判斷目前觸發的RunQueue

跟目前處理器所使用的RunQueue是否一致,

若是,就直接呼叫函式hrtimer_restart,反之就會

依據RunQueue中hrtick_csd_pending的值,如果

hrtick_csd_pending為0,就會透過函式

__smp_call_function_single讓RunQueue所在的另一個

處理器執行rq->hrtick_csd.func 所指到的函式

__hrtick_start. 並等待該處理器執行完畢後,

才重新把hrtick_csd_pending設定為1.

也就是說, RunQueue的hrtick_csd_pending是用來作為

SMP架構下,由處理器A觸發處理器B執行

_hrtick_start函式的一個保護機制.而有關在

SMP下如何由一個處理器觸發另一個處理器

執行函式的機制,可以參考kernel/smp.c中

相關smp_call_function_xxxxxxx的實作.s*/

int hrtick_csd_pending;

/*用以儲存hrtick機制中,要跨處理器執行的

函式結構.*/

struct call_single_data hrtick_csd;

#endif

/*為High-Resolution Tick的结构,會透過函式

hrtimer_init初始化.*/

struct hrtimer hrtick_timer;

#endif

#ifdef CONFIG_SCHEDSTATS

/* latency stats */

/*為Scheduling Info.的統計結構,可以參考

include/linux/sched.h中的宣告. 例如在每次觸發

Schedule時,呼叫函式schedule_debug對上一個Task

的lock_depth進行確認(Fork一個新的Process 時,

會把此值預設為-1就是No-Lock,當呼叫

Kernel Lock時, 就會把Current Task的lock_depth加一.),

若lock_depth>=0,就會累加Scheduling Info.的bkl_count值,

用以代表Task Blocking的次數.*/

struct sched_info rq_sched_info;

/*可用以表示RunQueue中的Task所得到CPU執行

時間的累加值.

在發生Task Switch時,會透過sched_info_switch呼叫

sched_info_arrive並以目前RunQueue Clock值更新

Task 的sched_info.last_arrival時��,而在Task所分配時間

結束後,會在函式sched_info_depart中以現在的

RunQueue Clock值減去Task的sched_info.last_arrival

時間值,得到的 Delta作為變數rq_cpu_time的累

加值.*/

unsigned long long rq_cpu_time;

/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */

/* sys_sched_yield() stats */

/*用以統計呼叫System Call sys_sched_yield的次數.*/

unsigned int yld_count;

/* schedule() stats */

unsigned int sched_switch;

/*可用以統計觸發Scheduling的次數. 在每次觸發

Scheduling時,會透過函式schedule呼叫schedule_debug,

呼叫schedstat_inc對這變數進行累加.*/

unsigned int sched_count;

/*可用以統計進入到Idle Task的次數. 會在函式

pick_next_task_idle中,呼叫schedstat_inc對這變數進行

累加.*/

unsigned int sched_goidle;

/* try_to_wake_up() stats */

/*用以統計Wake Up Task的次數.*/

unsigned int ttwu_count;

/*用以統計Wake Up 同一個處理器Task的次數.*/

unsigned int ttwu_local;

/* BKL stats */

unsigned int bkl_count;

#endif

};

调度类，sched_class为对模块编程的上层支持，对于每个linux新添加进来的调度算法都需要有自己的调度类实例。

/*CFS排程機制在設計時,考慮到排程機制的

彈性,定義了Scheduler Class的機制,讓排程機制

可以根據設計的需求,延伸不同的排程模

組進來,每個新加入的排程機制都必須要

提供Scheduler Class的實作,結構為 struct sched_class*/

struct sched_class {

/*會指向下一個Scheduling Class,以筆者所採用

的Linux Kernel 2.6.38.6而言,Scheduling Class的順序為

stop_sched_class->rt_sched_class->fair_sched_class->idle_sched_class*/

const struct sched_class *next;

/*當Task屬於Runnable狀態時,就會呼叫這個函式

把Task配置到RunQueue RBTree中,進行排程動作,

並呼叫inc_nr_running將RunQueue中nr_running的值

加一.(nr_running用以代表目前RunQueue有多少

Runnable Task進行排程)*/

void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);

/*當Task不需要執行時,就會呼叫這個函式

把Task從RunQueue RBTree中移除,並呼叫

dec_nr_running將RunQueue中nr_running的值減一.*/

void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);

/*用以暫停目前正在執行中的Task,如果

sysctl_sched_compat_yield有設定,就會找出目前

RBTree中最右邊的Task(也就是vrruntime最多

的Task),讓目前Task的vrruntime值等於最右邊

Task值的vrruntime加一(可參考:

se->vruntime = rightmost->vruntime + 1),如此在下次

排程觸發時就會透過函式put_prev_task把目前

的Task放到RBTree的最右邊,也就等同於暫停

Task,讓該Task下次被執行到的機會最低.*/

void (*yield_task) (struct rq *rq);

/*用以決定一個Task是否可以中斷目前正在

運作的Task,取得執行權.以CFS本身的實作來說

(in sched_fair.c).如果想要取代目前Task的Task本身

的Scheduling Policy為 Batch或是Idle時,會直接返回,

不會用來取代目前正在執行中的Task.反之,

如果目前正在執行中的Task的Scheduling Policy

為Idle,就會直接由所傳入的Task取代目前正

在執行的Task.*/

void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);

/*用以在排程觸發時,從RunQueue RBTree中,

取出符合目前Scheduling邏輯的下一個要

被執行的Task.*/

struct task_struct * (*pick_next_task) (struct rq *rq);

/*用以在排程觸發時,把上一個執行完畢的

Task放到目前RunQueue RBTree中對應的位置.*/

void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP

/*通常用在執行一個新的程序,或是WakeUp

一個Task時,會根據目前SMP下每個處理器的

負荷,決定Task是否要切換到另一個處理器

的RunQueue去執行,執行時會返回最後目標

處理器的值.*/

int (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);

unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,

struct rq *busiest, unsigned long max_load_move,

struct sched_domain *sd, enum cpu_idle_type idle,

int *all_pinned, int *this_best_prio);

int (*move_one_task) (struct rq *this_rq, int this_cpu,

struct rq *busiest, struct sched_domain *sd,

enum cpu_idle_type idle);

void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);

void (*post_schedule) (struct rq *this_rq);

void (*task_wake_up) (struct rq *this_rq, struct task_struct *task);

void (*set_cpus_allowed)(struct task_struct *p,

const struct cpumask *newmask);

void (*rq_online)(struct rq *rq);

void (*rq_offline)(struct rq *rq);

#endif

/*這個函式用以改變Task目前所屬的Scheduling

Class與改變Task Group.*/

void (*set_curr_task) (struct rq *rq);

/*這是Scheduler的 Timer Tick來源,系統中觸發的

Scheduling Tick會呼叫這個函式 (看HZ設定多少,

100就是每秒呼叫這函式100次,1000就是每秒

呼叫這函式1000次),

用以讓排程機制可以決定哪些Task應該要配

執行與哪些Task應該要被移出RunQueue.*/

void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);

void (*task_new) (struct rq *rq, struct task_struct *p);

void (*switched_from) (struct rq *this_rq, struct task_struct *task,

int running);

void (*switched_to) (struct rq *this_rq, struct task_struct *task,

int running);

void (*prio_changed) (struct rq *this_rq, struct task_struct *task,

int oldprio, int running);

unsigned int (*get_rr_interval) (struct task_struct *task);

#ifdef CONFIG_FAIR_GROUP_SCHED

void (*moved_group) (struct task_struct *p);

#endif

};

调度实体，调度实体用于调度时间记账，linux中CFS和实时调度使用不同的调度实体。调度运行队列，对于不用的调度算法同样运用不用的运行队列，对于CFS调度，运用的是红黑树，而对于实时调度为组链表。在后面具体的调度算法介绍中我们会看到他们的运用。

上层调度，linux调度的核心函数为schedule，schedule函数封装了内核调度的框架。细节实现上调用具体的调度类中的函数实现。schedule函数主要流程为：

1，将当前进程从相应的运行队列中删除；

2，计算和更新调度实体和进程的相关调度信息；

3，将当前进重新插入到调度运行队列中，对于CFS调度，根据具体的运行时间进行插入而对于实时调度插入到对应优先级队列的队尾；

4，从运行队列中选择运行的下一个进程；

5，进程调度信息和上下文切换；

当进程上下文切换后（关于进程切换在前面的文章中有介绍），调度就基本上完成了，当前运行的进程就是切换过来的进程了。

/*内核和其他部分用于调用进程调度器的入口，选择

哪个进程可以运行，何时将其投入运行。schedule通常

都需要和一个具体的调度类相关联，也就是说，他

会找到一个最高优先级的调度类，后者需要有自己的

可运行队列，然后问后者谁才是下一个该运行的进程

该函数唯一重要的事情是，他回调用pick_next_task*/

asmlinkage void __sched schedule(void)

{

struct task_struct *prev, *next;

unsigned long *switch_count;

struct rq *rq;

int cpu;

need_resched:

preempt_disable();

cpu = smp_processor_id();

rq = cpu_rq(cpu);/*得到特定cpu的rq*/

rcu_sched_qs(cpu);

prev = rq->curr;/*当前的运行进程*/

switch_count = &prev->nivcsw;/*进程切换计数*/

release_kernel_lock(prev);

need_resched_nonpreemptible:

schedule_debug(prev);

if (sched_feat(HRTICK))

hrtick_clear(rq);

spin_lock_irq(&rq->lock);

update_rq_clock(rq);/*更新rq的clock属性*/

clear_tsk_need_resched(prev);/*清楚prev进程的调度位*/

if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {

if (unlikely(signal_pending_state(prev->state, prev)))

prev->state = TASK_RUNNING;

else/*从运行队列中删除prev进程,根据调度类的

不同，实现不同*/

deactivate_task(rq, prev, 1);

switch_count = &prev->nvcsw;

}

/*现只对实时进程有用*/

pre_schedule(rq, prev);

if (unlikely(!rq->nr_running))

idle_balance(cpu, rq);

/*将当前进程，也就是被切换出去的进程重新

插入到各自的运行队列中，对于CFS算法插入

到合适的位置上，对于实时调度插入到同一个

优先级队列的链表尾部*/

put_prev_task(rq, prev);

/*从各自的运行队列中选择下一个进程来运行*/

next = pick_next_task(rq);

if (likely(prev != next)) {

/*更新切换出去和进来进程以及对应rq的相关变量*/

sched_info_switch(prev, next);

perf_event_task_sched_out(prev, next, cpu);

rq->nr_switches++;/*切换记录*/

rq->curr = next;

++*switch_count;

/*上下文切换，在进程切换已经介绍*/

context_switch(rq, prev, next); /* unlocks the rq */

* the context switch might have flipped the stack from under

* us, hence refresh the local variables.

cpu = smp_processor_id();

rq = cpu_rq(cpu);

} else

spin_unlock_irq(&rq->lock);

/*对于实时进程有用到*/

post_schedule(rq);

if (unlikely(reacquire_kernel_lock(current) < 0))

goto need_resched_nonpreemptible;

preempt_enable_no_resched();

if (need_resched())

goto need_resched;

}

对于cpu_rq函数

/*通过向上加偏移的方式得到rq，这里可以看出

runqueues为一个rq结构的数组，cpu为数组下标*/

#define cpu_rq(cpu) (&per_cpu(runqueues, (cpu)))

deactivate_task函数实现

* deactivate_task – remove a task from the runqueue.

static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep)

{

if (task_contributes_to_load(p))

rq->nr_uninterruptible++;

/*具体操作*/

dequeue_task(rq, p, sleep);

dec_nr_running(rq);/*rq中当前进程的运行数减一*/

}

我们看具体的操作

static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)

{

if (sleep) {/*如果sleep不为0，更新se中相关变量*/

if (p->se.last_wakeup) {

update_avg(&p->se.avg_overlap,

p->se.sum_exec_runtime – p->se.last_wakeup);

p->se.last_wakeup = 0;

} else {

update_avg(&p->se.avg_wakeup,

sysctl_sched_wakeup_granularity);

}

}

/*更新进程的sched_info数据结构中相关属性*/

sched_info_dequeued(p);

/*调用具体调度类的函数从他的运行队列中删除*/

p->sched_class->dequeue_task(rq, p, sleep);

p->se.on_rq = 0;

}

可见，调用了具体运行队列的删除函数，我们看最关键的选择下一个进程的方式。

* Pick up the highest-prio task:

/*以优先级为序，从高到低，一次检查每个调度类

并且从高优先级的调度类中，选择最高优先级的进程

static inline struct task_struct *

pick_next_task(struct rq *rq)

{

const struct sched_class *class;

struct task_struct *p;

* Optimization: we know that if all tasks are in

* the fair class we can call that function directly:

if (likely(rq->nr_running == rq->cfs.nr_running)) {

p = fair_sched_class.pick_next_task(rq);

if (likely(p))

return p;

}

class = sched_class_highest;

for ( ; ; ) {/*对每一个调度类*/

p = class->pick_next_task(rq);/*调用该调度类中的函数，找出下一个task*/

if (p)

return p;

* Will never be NULL as the idle class always

* returns a non-NULL p:

/*访问下一个调度类*/

class = class->next;

}

}

可见，对于调度类的选择，同样以优先级进行。

对于进程调度信息的切换最终会调用__sched_info_switch

* Called when tasks are switched involuntarily due, typically, to expiring

* their time slice. (This may also be called when switching to or from

* the idle task.) We are only called when prev != next.

static inline void

__sched_info_switch(struct task_struct *prev, struct task_struct *next)

{

struct rq *rq = task_rq(prev);

* prev now departs the cpu. It’s not interesting to record

* stats about how efficient we were at scheduling the idle

* process, however.

if (prev != rq->idle)/*如果被切换出去的进程不是idle进程*/

sched_info_depart(prev);/*更新prev进程和他对应rq的相关变量*/

if (next != rq->idle)/*如果切换进来的进程不是idle进程*/

sched_info_arrive(next);/*更新next进程和对应队列的相关变量*/

}

* Called when a process ceases being the active-running process, either

* voluntarily or involuntarily. Now we can calculate how long we ran.

* Also, if the process is still in the TASK_RUNNING state, call

* sched_info_queued() to mark that it has now again started waiting on

* the runqueue.

static inline void sched_info_depart(struct task_struct *t)

{

/*计算在进程在rq中运行的时间长度*/

unsigned long long delta = task_rq(t)->clock –

t->sched_info.last_arrival;

/*更新RunQueue中的Task所得到CPU執行

時間的累加值.*/

rq_sched_info_depart(task_rq(t), delta);

/*如果被切换出去进程的状态是运行状态

那么将进程sched_info.last_queued设置为rq的clock

last_queued为最后一次排队等待运行的时间*/

if (t->state == TASK_RUNNING)

sched_info_queued(t);

}

* Called when a task finally hits the cpu. We can now calculate how

* long it was waiting to run. We also note when it began so that we

* can keep stats on how long its timeslice is.

static void sched_info_arrive(struct task_struct *t)

{

unsigned long long now = task_rq(t)->clock, delta = 0;

if (t->sched_info.last_queued)/*如果被切换进来前在运行进程中排队*/

delta = now – t->sched_info.last_queued;/*计算排队等待的时间长度*/

sched_info_reset_dequeued(t);/*因为进程将被切换进来运行，设定last_queued为0*/

t->sched_info.run_delay += delta;/*更新进程在运行队列里面等待的时间*/

t->sched_info.last_arrival = now;/*更新最后一次运行的时间*/

t->sched_info.pcount++;/*cpu上运行的次数加一*/

/*更新rq中rq_sched_info中的对应的变量*/

rq_sched_info_arrive(task_rq(t), delta);

}

对于schedule调度函数框架的分析基本是这样了，对于具体的CFS和实时调度的实现在后面分析。

前面对linux调度算法的框架进行了介绍，在这里对CFS（完全公平调度）算法进行分析。

CFS允许每个进程运行一段时间、循环轮转、选择运行最少的进程作为下一个运行进程，而不再采用分配给每个进程时间片的做法了，CFS在所有可运行进程总数基础上计算出一个进程应该运行多久，而不是依靠nice值来计算时间片。nice值在CFS中被作为进程获得的处理器运行比的权重：越高的nice值（越低的优先级）进程获得更低的处理器使用权重，这是相对默认nice值进程的进程而言的；相反，更低的nice值（越高的优先级）的进程获得更高的处理器使用权重。

数据结构

运行队列

struct cfs_rq {

struct load_weight load;/*运行负载*/

unsigned long nr_running;/*运行进程个数*/

u64 exec_clock;

u64 min_vruntime;/*保存的最小运行时间*/

struct rb_root tasks_timeline;/*运行队列树根*/

struct rb_node *rb_leftmost;/*保存的红黑树最左边的

节点，这个为最小运行时间的节点，当进程

选择下一个来运行时，直接选择这个*/

struct list_head tasks;

struct list_head *balance_iterator;

* ‘curr’ points to currently running entity on this cfs_rq.

* It is set to NULL otherwise (i.e when none are currently running).

struct sched_entity *curr, *next, *last;

unsigned int nr_spread_over;

#ifdef CONFIG_FAIR_GROUP_SCHED

struct rq *rq; /* cpu runqueue to which this cfs_rq is attached */

* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in

* a hierarchy). Non-leaf lrqs hold other higher schedulable entities

* (like users, containers etc.)

* leaf_cfs_rq_list ties together list of leaf cfs_rq’s in a cpu. This

* list is used during load balance.

struct list_head leaf_cfs_rq_list;

struct task_group *tg; /* group that “owns” this runqueue */

#ifdef CONFIG_SMP

* the part of load.weight contributed by tasks

unsigned long task_weight;

* h_load = weight * f(tg)

* Where f(tg) is the recursive weight fraction assigned to

* this group.

unsigned long h_load;

* this cpu’s part of tg->shares

unsigned long shares;

* load.weight at the time we set shares

unsigned long rq_weight;

#endif

#endif

};

运行实体结构为sched_entity，所有的调度器都必须对进程运行时间做记账。CFS不再有时间片的概念，但是他也必须维护每个进程运行的时间记账，因为他需要确保每个进程只在公平分配给他的处理器时间内运行。CFS使用调度器实体结构来最终运行记账

* CFS stats for a schedulable entity (task, task-group etc)

* Current field usage histogram:

* 4 se->block_start

* 4 se->run_node

* 4 se->sleep_start

* 6 se->load.weight

struct sched_entity {

struct load_weight load; /* for load-balancing */

struct rb_node run_node;

struct list_head group_node;

unsigned int on_rq;

u64 exec_start;

u64 sum_exec_runtime;

u64 vruntime;/*存放进程的虚拟运行时间,用于调度器的选择*/

u64 prev_sum_exec_runtime;

u64 last_wakeup;

u64 avg_overlap;

u64 nr_migrations;

u64 start_runtime;

u64 avg_wakeup;

u64 avg_running;

#ifdef CONFIG_SCHEDSTATS

…..

/*需要定义相关宏*/

};

调度器的实体作为一个名为se的成员变量，潜入在进程描述符struct task_struct内。

具体的调度类

* All the scheduling class methods:

static const struct sched_class fair_sched_class = {

.next = &idle_sched_class,/*下一个为idle进程*/

.enqueue_task = enqueue_task_fair,

.dequeue_task = dequeue_task_fair,

.yield_task = yield_task_fair,

.check_preempt_curr = check_preempt_wakeup,

.pick_next_task = pick_next_task_fair,

.put_prev_task = put_prev_task_fair,

#ifdef CONFIG_SMP

.select_task_rq = select_task_rq_fair,

.load_balance = load_balance_fair,

.move_one_task = move_one_task_fair,

#endif

.set_curr_task = set_curr_task_fair,

.task_tick = task_tick_fair,

.task_new = task_new_fair,

.prio_changed = prio_changed_fair,

.switched_to = switched_to_fair,

.get_rr_interval = get_rr_interval_fair,

#ifdef CONFIG_FAIR_GROUP_SCHED

.moved_group = moved_group_fair,

#endif

};

对于从运行队列中删除函数dequeue_task_fair

* The dequeue_task method is called before nr_running is

* decreased. We remove the task from the rbtree and

* update the fair scheduling stats:

static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)

{

struct cfs_rq *cfs_rq;

struct sched_entity *se = &p->se;

for_each_sched_entity(se) {/*考虑了组调度*/

cfs_rq = cfs_rq_of(se);/*获得se对应的运行队列*/

dequeue_entity(cfs_rq, se, sleep);

/* Don’t dequeue parent if it has other entities besides us */

if (cfs_rq->load.weight)

break;

sleep = 1;

}

/*更新hrtick*/

hrtick_update(rq);

}

/*删除动作发生在进程阻塞(变为不可运行状态)

或者终止时(结束运行)*/

static void

dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)

{

* Update run-time statistics of the ‘current’.

update_curr(cfs_rq);

update_stats_dequeue(cfs_rq, se);

if (sleep) {

#ifdef CONFIG_SCHEDSTATS

if (entity_is_task(se)) {

struct task_struct *tsk = task_of(se);

if (tsk->state & TASK_INTERRUPTIBLE)

se->sleep_start = rq_of(cfs_rq)->clock;

if (tsk->state & TASK_UNINTERRUPTIBLE)

se->block_start = rq_of(cfs_rq)->clock;

}

#endif

}

clear_buddies(cfs_rq, se);

if (se != cfs_rq->curr)

__dequeue_entity(cfs_rq, se);

account_entity_dequeue(cfs_rq, se);

update_min_vruntime(cfs_rq);

}

/*实现记账功能,由系统定时器周期调用*/

static void update_curr(struct cfs_rq *cfs_rq)

{

struct sched_entity *curr = cfs_rq->curr;

u64 now = rq_of(cfs_rq)->clock;/*now计时器*/

unsigned long delta_exec;

if (unlikely(!curr))

return;

* Get the amount of time the current task was running

* since the last time we changed load (this cannot

* overflow on 32 bits):

/*获得从最后一次修改负载后当前任务所占用的运行总时间*/

/*即计算当前进程的执行时间*/

delta_exec = (unsigned long)(now – curr->exec_start);

if (!delta_exec)/*如果本次没有执行过，不用重新更新了*/

return;

/*根据当前可运行进程总数对运行时间进行加权计算*/

__update_curr(cfs_rq, curr, delta_exec);

curr->exec_start = now;/*将exec_start属性置为now*/

if (entity_is_task(curr)) {/*下面为关于组调度的，暂时不分析了*/

struct task_struct *curtask = task_of(curr);

trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);

cpuacct_charge(curtask, delta_exec);

account_group_exec_runtime(curtask, delta_exec);

}

}

static inline void

__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,

unsigned long delta_exec)

{

unsigned long delta_exec_weighted;

schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));

/*总运行时间更新*/

curr->sum_exec_runtime += delta_exec;

/*更新cfs_rq的exec_clock*/

schedstat_add(cfs_rq, exec_clock, delta_exec);

/*用优先级和delta_exec来计算weighted以用于更细vruntime*/

delta_exec_weighted = calc_delta_fair(delta_exec, curr);

/*vruntime可以准确地测量给定进程的运行时间

而且可知道谁应该是下一个被运行的进程*/

/*更新进程的vruntime*/

curr->vruntime += delta_exec_weighted;

update_min_vruntime(cfs_rq);

}

static inline unsigned long

calc_delta_fair(unsigned long delta, struct sched_entity *se)

{

/*NICE_0_LOAD: 优先级0 的weight*/

/* 如果不是优先级0,就要调用calc_delta_mine计算delta的weight值*/

if (unlikely(se->load.weight != NICE_0_LOAD))

delta = calc_delta_mine(delta, NICE_0_LOAD, &se->load);

return delta;

}

/*在这里不打算详细分析calc_delta_mine (delta_exec,weight,lw),它的执行过程约为delta *= weight / lw.

从这个函数中可以看到,如果进程的优先级为0,那么就是返回delta.

如果不为0,就会调用calc_delta_mine()对delta值进行修正.对上面对calc_delta_mine()的说明来看

,有如下关系: Delta = delta* NICE_0_LOAD/ se->load

Se->load值是怎么来的呢? 可以跟踪sys_nice(),就可以发现se->load

其它就是表示nice对应的load值,nice越低,值越大.

据此,就可以得到一个结论.在执行相同时间的条件下(delta相同),

高优先的进程计算出来的delta值会比低优先级的进程计算出来

的低.应此,高优先的进程就会位于rb_tree的左边,在下次调度的

时候就会优先调度.

static unsigned long

calc_delta_mine(unsigned long delta_exec, unsigned long weight,

struct load_weight *lw)

{

u64 tmp;

if (!lw->inv_weight) {

if (BITS_PER_LONG > 32 && unlikely(lw->weight >= WMULT_CONST))

lw->inv_weight = 1;

else

lw->inv_weight = 1 + (WMULT_CONST-lw->weight/2)

/ (lw->weight+1);

}

tmp = (u64)delta_exec * weight;

* Check whether we’d overflow the 64-bit multiplication:

if (unlikely(tmp > WMULT_CONST))

tmp = SRR(SRR(tmp, WMULT_SHIFT/2) * lw->inv_weight,

WMULT_SHIFT/2);

else

tmp = SRR(tmp * lw->inv_weight, WMULT_SHIFT);

return (unsigned long)min(tmp, (u64)(unsigned long)LONG_MAX);

}

static void

account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)

{

/*cfs_rq->load更新*/

update_load_sub(&cfs_rq->load, se->load.weight);

if (!parent_entity(se))

dec_cpu_load(rq_of(cfs_rq), se->load.weight);

if (entity_is_task(se)) {/*组调度相关*/

add_cfs_task_weight(cfs_rq, -se->load.weight);

list_del_init(&se->group_node);

}

/*运行个数减一*/

cfs_rq->nr_running–;

se->on_rq = 0;/*表示不再运行队列中*/

}

删除函数最终将由下面函数从红黑树里面删除

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)

{

if (cfs_rq->rb_leftmost == &se->run_node) {

struct rb_node *next_node;

next_node = rb_next(&se->run_node);

cfs_rq->rb_leftmost = next_node;

}

rb_erase(&se->run_node, &cfs_rq->tasks_timeline);

}

向运行队列中添加项的函数为enqueue_task_fair完成

static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)

{

struct cfs_rq *cfs_rq;

struct sched_entity *se = &p->se;

/*对于主调度，会对一个组中的所有进程进行操作*/

for_each_sched_entity(se) {

if (se->on_rq)

break;

cfs_rq = cfs_rq_of(se);

enqueue_entity(cfs_rq, se, wakeup);

wakeup = 1;

}

hrtick_update(rq);

}

该函数更新相关调度信息后最终会调用下面函数插入运行进程的红黑树

static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)

{

struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;

struct rb_node *parent = NULL;

struct sched_entity *entry;

s64 key = entity_key(cfs_rq, se);

int leftmost = 1;

* Find the right place in the rbtree:

while (*link) {

parent = *link;

entry = rb_entry(parent, struct sched_entity, run_node);

* We dont care about collisions. Nodes with

* the same key stay together.

*//*key为被插入进程的vruntime*/

if (key < entity_key(cfs_rq, entry)) {

link = &parent->rb_left;

} else {

link = &parent->rb_right;

leftmost = 0;

}

}

* Maintain a cache of leftmost tree entries (it is frequently

* used):

if (leftmost)

cfs_rq->rb_leftmost = &se->run_node;

rb_link_node(&se->run_node, parent, link);

rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);

}

可见CFS的运行队列布局是放在红黑树里面的，而这颗红黑树的排序方式是按照运行实体的vruntime来的。vruntime的计算方式在上面已经做了分析。

进程选择

CFS调度算法的核心是选择具有最小vruntine的任务。运行队列采用红黑树方式存放，其中节点的键值便是可运行进程的虚拟运行时间。CFS调度器选取待运行的下一个进程，是所有进程中vruntime最小的那个，他对应的便是在树中最左侧的叶子节点。实现选择的函数为pick_next_task_fair

static struct task_struct *pick_next_task_fair(struct rq *rq)

{

struct task_struct *p;

struct cfs_rq *cfs_rq = &rq->cfs;

struct sched_entity *se;

if (unlikely(!cfs_rq->nr_running))

return NULL;

do {/*此循环为了考虑组调度*/

se = pick_next_entity(cfs_rq);

set_next_entity(cfs_rq, se);/*设置为当前运行进程*/

cfs_rq = group_cfs_rq(se);

} while (cfs_rq);

p = task_of(se);

hrtick_start_fair(rq, p);

return p;

}

该函数最终调用__pick_next_entity完成实质工作

/*函数本身并不会遍历数找到最左叶子节点(是

所有进程中vruntime最小的那个),因为该值已经缓存

在rb_leftmost字段中*/

static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq)

{

/*rb_leftmost为保存的红黑树的最左边的节点*/

struct rb_node *left = cfs_rq->rb_leftmost;

if (!left)

return NULL;

return rb_entry(left, struct sched_entity, run_node);

}

总结：CFS调度运行队列采用红黑树方式组织，红黑树种的key值以vruntime排序。每次选择下一个进程运行时即是选择最左边的一个进程运行。而对于入队和处队都会更新调度队列、调度实体的相关信息。

linux内核中提供了两种实时调度策略：SCHED_FIFO和SCHED_RR，其中RR是带有时间片的FIFO。这两种调度算法实现的都是静态优先级。内核不为实时进程计算动态优先级。这能保证给定优先级别的实时进程总能抢占优先级比他低得进程。linux的实时调度算法提供了一种软实时工作方式。实时优先级范围从0到MAX_RT_PRIO减一。默认情况下，MAX_RT_PRIO为100，所以默认的实时优先级范围是从0到99.SCHED_NORMAL级进程的nice值共享了这个取值空间；他的取值范围是从MAX_RT_PRIO到MAX_RT_PRIO+40.也就是说，在默认情况下，nice值从-20到19直接对应的是从100到139的实时优先级范围。

数据结构

实时调度的优先级队列

实时调度的优先级队列是一组链表，每个优先级对应一个链表。

/*实时调度的优先级队列，实时调度中，运行进程

根据优先级放到对应的队列里面，对于相同的优先级

的进程后面来的进程放到同一优先级队列的队尾

对于FIFO/RR调度，各自的进程需要设置相关的属性

进程运行时，要根据task中的这些属性判断和设置

放弃cpu的时机(运行完或是时间片用完)*/

struct rt_prio_array {

DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1); /* include 1 bit for delimiter */

struct list_head queue[MAX_RT_PRIO];

};

运行队列

运行队列组织实时调度的相关信息

/* Real-Time classes’ related field in a runqueue: */

struct rt_rq {

struct rt_prio_array active;

unsigned long rt_nr_running;

#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED

struct {

int curr; /* highest queued rt task prio */

#ifdef CONFIG_SMP

int next; /* next highest */

#endif

} highest_prio;

#endif

#ifdef CONFIG_SMP

unsigned long rt_nr_migratory;

unsigned long rt_nr_total;

int overloaded;

struct plist_head pushable_tasks;

#endif

int rt_throttled;

u64 rt_time;

u64 rt_runtime;

/* Nests inside the rq lock: */

spinlock_t rt_runtime_lock;

#ifdef CONFIG_RT_GROUP_SCHED

unsigned long rt_nr_boosted;

struct rq *rq;

struct list_head leaf_rt_rq_list;

struct task_group *tg;

struct sched_rt_entity *rt_se;

#endif

};

进程调度实体

struct sched_rt_entity {

struct list_head run_list;

unsigned long timeout;

unsigned int time_slice;

int nr_cpus_allowed;

struct sched_rt_entity *back;

#ifdef CONFIG_RT_GROUP_SCHED

struct sched_rt_entity *parent;

/* rq on which this entity is (to be) queued: */

struct rt_rq *rt_rq;

/* rq “owned” by this entity/group: */

struct rt_rq *my_q;

#endif

};

调度类

static const struct sched_class rt_sched_class = {

.next = &fair_sched_class,

.enqueue_task = enqueue_task_rt,

.dequeue_task = dequeue_task_rt,

.yield_task = yield_task_rt,

.check_preempt_curr = check_preempt_curr_rt,

.pick_next_task = pick_next_task_rt,

.put_prev_task = put_prev_task_rt,

#ifdef CONFIG_SMP

.select_task_rq = select_task_rq_rt,

.load_balance = load_balance_rt,

.move_one_task = move_one_task_rt,

.set_cpus_allowed = set_cpus_allowed_rt,

.rq_online = rq_online_rt,

.rq_offline = rq_offline_rt,

.pre_schedule = pre_schedule_rt,

.post_schedule = post_schedule_rt,

.task_wake_up = task_wake_up_rt,

.switched_from = switched_from_rt,

#endif

.set_curr_task = set_curr_task_rt,

.task_tick = task_tick_rt,

.get_rr_interval = get_rr_interval_rt,

.prio_changed = prio_changed_rt,

.switched_to = switched_to_rt,

};

从运行队列中移除函数dequeue_task_rt

static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)

{

struct sched_rt_entity *rt_se = &p->rt;

/*更新调度信息*/

update_curr_rt(rq);

/*实际工作，将rt_se从运行队列中删除然后

添加到队列尾部*/

dequeue_rt_entity(rt_se);

/*从hash表中删除*/

dequeue_pushable_task(rq, p);

}

* Update the current task’s runtime statistics. Skip current tasks that

* are not in our scheduling class.

static void update_curr_rt(struct rq *rq)

{

struct task_struct *curr = rq->curr;

struct sched_rt_entity *rt_se = &curr->rt;

struct rt_rq *rt_rq = rt_rq_of_se(rt_se);

u64 delta_exec;

if (!task_has_rt_policy(curr))/*判断是否问实时调度进程*/

return;

/*执行时间*/

delta_exec = rq->clock – curr->se.exec_start;

if (unlikely((s64)delta_exec < 0))

delta_exec = 0;

schedstat_set(curr->se.exec_max, max(curr->se.exec_max, delta_exec));

/*更新当前进程的总的执行时间*/

curr->se.sum_exec_runtime += delta_exec;

account_group_exec_runtime(curr, delta_exec);

/*更新执行的开始时间*/

curr->se.exec_start = rq->clock;

cpuacct_charge(curr, delta_exec);/*主调度相关*/

/*更新调度信息*/

sched_rt_avg_update(rq, delta_exec);

if (!rt_bandwidth_enabled())

return;

for_each_sched_rt_entity(rt_se) {

rt_rq = rt_rq_of_se(rt_se);

if (sched_rt_runtime(rt_rq) != RUNTIME_INF) {

spin_lock(&rt_rq->rt_runtime_lock);

rt_rq->rt_time += delta_exec;

if (sched_rt_runtime_exceeded(rt_rq))

resched_task(curr);

spin_unlock(&rt_rq->rt_runtime_lock);

}

}

}

static void dequeue_rt_entity(struct sched_rt_entity *rt_se)

{

/*从运行队列中删除*/

dequeue_rt_stack(rt_se);

for_each_sched_rt_entity(rt_se) {

struct rt_rq *rt_rq = group_rt_rq(rt_se);

if (rt_rq && rt_rq->rt_nr_running)

__enqueue_rt_entity(rt_se);/*添加到队列尾*/

}

}

* Because the prio of an upper entry depends on the lower

* entries, we must remove entries top – down.

static void dequeue_rt_stack(struct sched_rt_entity *rt_se)

{

struct sched_rt_entity *back = NULL;

for_each_sched_rt_entity(rt_se) {/*遍历整个组调度实体*/

rt_se->back = back;/*可见rt_se的back实体为组调度中前一个调度实体*/

back = rt_se;

}

/*将组中的所有进程从运行队列中移除*/

for (rt_se = back; rt_se; rt_se = rt_se->back) {

if (on_rt_rq(rt_se))

__dequeue_rt_entity(rt_se);

}

}

删除的实际工作为：

static void __dequeue_rt_entity(struct sched_rt_entity *rt_se)

{

struct rt_rq *rt_rq = rt_rq_of_se(rt_se);

struct rt_prio_array *array = &rt_rq->active;

list_del_init(&rt_se->run_list);

if (list_empty(array->queue + rt_se_prio(rt_se)))

__clear_bit(rt_se_prio(rt_se), array->bitmap);

dec_rt_tasks(rt_se, rt_rq);

}

添加进程到运行队列

* Adding/removing a task to/from a priority array:

static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)

{

struct sched_rt_entity *rt_se = &p->rt;

if (wakeup)

rt_se->timeout = 0;

/*实际工作*/

enqueue_rt_entity(rt_se);

if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)

enqueue_pushable_task(rq, p);/*添加到对应的hash表中*/

}

static void enqueue_rt_entity(struct sched_rt_entity *rt_se)

{

dequeue_rt_stack(rt_se);/*从运行队列中删除*/

for_each_sched_rt_entity(rt_se)

__enqueue_rt_entity(rt_se);/*添加到运行队列尾部*/

}

实际的添加工作：

static void __enqueue_rt_entity(struct sched_rt_entity *rt_se)

{

struct rt_rq *rt_rq = rt_rq_of_se(rt_se);

struct rt_prio_array *array = &rt_rq->active;

struct rt_rq *group_rq = group_rt_rq(rt_se);

struct list_head *queue = array->queue + rt_se_prio(rt_se);

* Don’t enqueue the group if its throttled, or when empty.

* The latter is a consequence of the former when a child group

* get throttled and the current group doesn’t have any other

* active members.

if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running))

return;

list_add_tail(&rt_se->run_list, queue);

__set_bit(rt_se_prio(rt_se), array->bitmap);

inc_rt_tasks(rt_se, rt_rq);

}

选择下一个进程运行

static struct task_struct *pick_next_task_rt(struct rq *rq)

{

struct task_struct *p = _pick_next_task_rt(rq);/*实际工作*/

/* The running task is never eligible for pushing */

if (p)

dequeue_pushable_task(rq, p);

#ifdef CONFIG_SMP

* We detect this state here so that we can avoid taking the RQ

* lock again later if there is no need to push

rq->post_schedule = has_pushable_tasks(rq);

#endif

return p;

}

static struct task_struct *_pick_next_task_rt(struct rq *rq)

{

struct sched_rt_entity *rt_se;

struct task_struct *p;

struct rt_rq *rt_rq;

rt_rq = &rq->rt;

if (unlikely(!rt_rq->rt_nr_running))

return NULL;

if (rt_rq_throttled(rt_rq))

return NULL;

do {/*遍历组调度中的每个进程*/

rt_se = pick_next_rt_entity(rq, rt_rq);

BUG_ON(!rt_se);

rt_rq = group_rt_rq(rt_se);

} while (rt_rq);

p = rt_task_of(rt_se);

/*更新执行域*/

p->se.exec_start = rq->clock;

return p;

}

static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,

struct rt_rq *rt_rq)

{

struct rt_prio_array *array = &rt_rq->active;

struct sched_rt_entity *next = NULL;

struct list_head *queue;

int idx;

/*找到第一个可用的*/

idx = sched_find_first_bit(array->bitmap);

BUG_ON(idx >= MAX_RT_PRIO);

/*从链表组中找到对应的链表*/

queue = array->queue + idx;

next = list_entry(queue->next, struct sched_rt_entity, run_list);

/*返回找到的运行实体*/

return next;

}

总结：对于实时调度，基于linux内核调度框架下作的工作比较简单，把所有的运行进程根据优先级放到不用的队列里面，采用位图方式进行使用记录。进队列仅仅是删除原来队列里面的本进程，然后将他挂到队列尾部；而对于“移除”操作，也仅仅是从队列里面移除后添加到运行队列尾部。

Linux内核分析之调度算法

相关推荐

分类

听说打赏我的人，都进福布斯排行榜啦！

支付宝扫一扫打赏

微信扫一扫打赏

QQ咨询

回顶部