|
內(nèi)核中的互斥之我見
|
| http://www. 作者:e4gle 發(fā)表于:2003-02-21 12:05:46 |
內(nèi)核中的互斥之我見 by wheelz
看了前面各位的討論,我也有些想法,與大家商榷。
需要澄清的是,互斥手段的選擇,不是根據(jù)臨界區(qū)的大小,而是根據(jù)臨界區(qū)的性質(zhì),以及 有哪些部分的代碼,即哪些內(nèi)核執(zhí)行路徑來爭(zhēng)奪。
從嚴(yán)格意義上說,semaphore和spinlock_XXX屬于不同層次的互斥手段,前者的 實(shí)現(xiàn)有賴于后者,這有點(diǎn)象HTTP和TCP的關(guān)系,都是協(xié)議,但層次是不同的。
先說semaphore,它是進(jìn)程級(jí)的,用于多個(gè)進(jìn)程之間對(duì)資源的互斥,雖然也是在 內(nèi)核中,但是該內(nèi)核執(zhí)行路徑是以進(jìn)程的身份,代表進(jìn)程來爭(zhēng)奪資源的。如果 競(jìng)爭(zhēng)不上,會(huì)有context switch,進(jìn)程可以去sleep,但CPU不會(huì)停,會(huì)接著運(yùn)行 其他的執(zhí)行路徑。從概念上說,這和單CPU或多CPU沒有直接的關(guān)系,只是在 semaphore本身的實(shí)現(xiàn)上,為了保證semaphore結(jié)構(gòu)存取的原子性,在多CPU中需要spinlock來互斥。
在內(nèi)核中,更多的是要保持內(nèi)核各個(gè)執(zhí)行路徑之間的數(shù)據(jù)訪問互斥,這是最基本的互斥問題,即保持?jǐn)?shù)據(jù)修改的原子性。semaphore的實(shí)現(xiàn),也要依賴這個(gè)。在單CPU中,主要是中斷和bottom_half的問題,因此,開關(guān)中斷就可以了。在多CPU中,又加上了其他CPU的干擾,因此需要spinlock來幫助。這兩個(gè)部分結(jié)合起來,就形成了spinlock_XXX。它的特點(diǎn)是,一旦CPU進(jìn)入了spinlock_XXX,它就不會(huì)干別的,而是一直空轉(zhuǎn),直到鎖定成功為止。因此,這就決定了被spinlock_XXX鎖住的臨界區(qū)不能停,更不能context switch,要存取完數(shù)據(jù)后趕快出來,以便其他的在空轉(zhuǎn)的執(zhí)行路徑能夠獲得spinlock。這也是spinlock的原則所在。如果當(dāng)前執(zhí)行路徑一定要進(jìn)行context switch,那就要在schedule()之前釋放spinlock,否則,容易死鎖。因?yàn)樵谥袛嗪蚥h中,沒有context,無法進(jìn)行context switch,只能空轉(zhuǎn)等待spinlock,你context switch走了,誰知道猴年馬月才能回來。
因?yàn)閟pinlock的原意和目的就是保證數(shù)據(jù)修改的原子性,因此也沒有理由在spinlock 鎖住的臨界區(qū)中停留。
spinlock_XXX有很多形式,有
spin_lock()/spin_unlock(), spin_lock_irq()/spin_unlock_irq(), spin_lock_irqsave/spin_unlock_irqrestore() spin_lock_bh()/spin_unlock_bh()
local_irq_disable/local_irq_enable local_bh_disable/local_bh_enable
那么,在什么情況下具體用哪個(gè)呢?這要看是在什么內(nèi)核執(zhí)行路徑中,以及要與哪些內(nèi)核 執(zhí)行路徑相互斥。我們知道,內(nèi)核中的執(zhí)行路徑主要有:
1 用戶進(jìn)程的內(nèi)核態(tài),此時(shí)有進(jìn)程context,主要是代表進(jìn)程在執(zhí)行系統(tǒng)調(diào)用 等。 2 中斷或者異?;蛘咦韵莸?,從概念上說,此時(shí)沒有進(jìn)程context,不能進(jìn)行 context switch。 3 bottom_half,從概念上說,此時(shí)也沒有進(jìn)程context。 4 同時(shí),相同的執(zhí)行路徑還可能在其他的CPU上運(yùn)行。
這樣,考慮這四個(gè)方面的因素,通過判斷我們要互斥的數(shù)據(jù)會(huì)被這四個(gè)因素中 的哪幾個(gè)來存取,就可以決定具體使用哪種形式的spinlock。如果只要和其他CPU 互斥,就要用spin_lock/spin_unlock,如果要和irq及其他CPU互斥,就要用 spin_lock_irq/spin_unlock_irq,如果既要和irq及其他CPU互斥,又要保存 EFLAG的狀態(tài),就要用spin_lock_irqsave/spin_unlock_irqrestore,如果 要和bh及其他CPU互斥,就要用spin_lock_bh/spin_unlock_bh,如果不需要和 其他CPU互斥,只要和irq互斥,則用local_irq_disable/local_irq_enable, 如果不需要和其他CPU互斥,只要和bh互斥,則用local_bh_disable/local_bh_enable, 等等。值得指出的是,對(duì)同一個(gè)數(shù)據(jù)的互斥,在不同的內(nèi)核執(zhí)行路徑中, 所用的形式有可能不同(見下面的例子)。
舉一個(gè)例子。在中斷部分中有一個(gè)irq_desc_t類型的結(jié)構(gòu)數(shù)組變量irq_desc[], 該數(shù)組每個(gè)成員對(duì)應(yīng)一個(gè)irq的描述結(jié)構(gòu),里面有該irq的響應(yīng)函數(shù)等。 在irq_desc_t結(jié)構(gòu)中有一個(gè)spinlock,用來保證存取(修改)的互斥。
對(duì)于具體一個(gè)irq成員,irq_desc[irq],對(duì)其存取的內(nèi)核執(zhí)行路徑有兩個(gè),一是 在設(shè)置該irq的響應(yīng)函數(shù)時(shí)(setup_irq),這通常發(fā)生在module的初始化階段,或 系統(tǒng)的初始化階段;二是在中斷響應(yīng)函數(shù)中(do_IRQ)。代碼如下:
int setup_irq(unsigned int irq, struct irqaction * new) { int shared = 0; unsigned long flags; struct irqaction *old, **p; irq_desc_t *desc = irq_desc + irq;
/* * Some drivers like serial.c use request_irq() heavily, * so we have to be careful not to interfere with a * running system. */ if (new->flags & SA_SAMPLE_RANDOM) { /* * This function might sleep, we want to call it first, * outside of the atomic block. * Yes, this might clear the entropy pool if the wrong * driver is attempted to be loaded, without actually * installing a new handler, but is this really a problem, * only the sysadmin is able to do this. */ rand_initialize_irq(irq); }
/* * The following block of code has to be executed atomically */ [1] spin_lock_irqsave(&desc->lock,flags); p = &desc->action; if ((old = *p) != NULL) { /* Can't share interrupts unless both agree to */ if (!(old->flags & new->flags & SA_SHIRQ)) { [2] spin_unlock_irqrestore(&desc->lock,flags); return -EBUSY; }
/* add new interrupt at end of irq queue */ do { p = &old->next; old = *p; } while (old); shared = 1; }
*p = new;
if (!shared) { desc->depth = 0; desc->status &= ~(IRQ_DISABLED | IRQ_AUTODETECT | IRQ_WAITING); desc->handler->startup(irq); } [3] spin_unlock_irqrestore(&desc->lock,flags);
register_irq_proc(irq); return 0; }
asmlinkage unsigned int do_IRQ(struct pt_regs regs) { /* * We ack quickly, we don't want the irq controller * thinking we're snobs just because some other CPU has * disabled global interrupts (we have already done the * INT_ACK cycles, it's too late to try to pretend to the * controller that we aren't taking the interrupt). * * 0 return value means that this irq is already being * handled by some other CPU. (or is disabled) */ int irq = regs.orig_eax & 0xff; /* high bits used in ret_from_ code */ int cpu = smp_processor_id(); irq_desc_t *desc = irq_desc + irq; struct irqaction * action; unsigned int status;
kstat.irqs[cpu][irq]++; [4] spin_lock(&desc->lock); desc->handler->ack(irq); /* REPLAY is when Linux resends an IRQ that was dropped earlier WAITING is used by probe to mark irqs that are being tested */ status = desc->status & ~(IRQ_REPLAY | IRQ_WAITING); status |= IRQ_PENDING; /* we _want_ to handle it */
/* * If the IRQ is disabled for whatever reason, we cannot * use the action we have. */ action = NULL; if (!(status & (IRQ_DISABLED | IRQ_INPROGRESS))) { action = desc->action; status &= ~IRQ_PENDING; /* we commit to handling */ status |= IRQ_INPROGRESS; /* we are handling it */ } desc->status = status;
/* * If there is no IRQ handler or it was disabled, exit early. Since we set PENDING, if another processor is handling a different instance of this same irq, the other processor will take care of it. */ if (!action) goto out;
/* * Edge triggered interrupts need to remember * pending events. * This applies to any hw interrupts that allow a second * instance of the same irq to arrive while we are in do_IRQ * or in the handler. But the code here only handles the _second_ * instance of the irq, not the third or fourth. So it is mostly * useful for irq hardware that does not mask cleanly in an * SMP environment. */ for (;;) { [5] spin_unlock(&desc->lock); handle_IRQ_event(irq, ®s, action); [6] spin_lock(&desc->lock); if (!(desc->status & IRQ_PENDING)) break; desc->status &= ~IRQ_PENDING; } desc->status &= ~IRQ_INPROGRESS; out: /* * The ->end() handler has to deal with interrupts which got * disabled while the handler was running. */ desc->handler->end(irq); [7] spin_unlock(&desc->lock);
if (softirq_pending(cpu)) do_softirq(); return 1; }
在setup_irq()中,因?yàn)槠渌鸆PU可能同時(shí)在運(yùn)行setup_irq(),或者在運(yùn)行setup_irq()時(shí), 本地irq中斷來了,要執(zhí)行do_IRQ()以修改desc->status。為了同時(shí)防止來自其他CPU和 本地irq中斷的干擾,如[1][2][3]處所示,使用了spin_lock_irqsave/spin_unlock_irqrestore()
而在do_IRQ()中,因?yàn)閐o_IRQ()本身是在中斷中,而且此時(shí)還沒有開中斷,本CPU中沒有 什么可以中斷其運(yùn)行,其他CPU則有可能在運(yùn)行setup_irq(),或者也在中斷中,但這二者 對(duì)本地do_IRQ()的影響沒有區(qū)別,都是來自其他CPU的干擾,因此只需要用spin_lock/spin_unlock, 如[4][5][6][7]處所示。值得注意的是[5]處,先釋放該spinlock,再調(diào)用具體的響應(yīng)函數(shù)。
再舉個(gè)例子:
static void tasklet_hi_action(struct softirq_action *a) { int cpu = smp_processor_id(); struct tasklet_struct *list;
[8] local_irq_disable(); list = tasklet_hi_vec[cpu].list; tasklet_hi_vec[cpu].list = NULL; [9] local_irq_enable();
while (list) { struct tasklet_struct *t = list;
list = list->next;
if (tasklet_trylock(t)) { if (!atomic_read(&t->count)) { if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) BUG(); t->func(t->data); tasklet_unlock(t); continue; } tasklet_unlock(t); }
[10] local_irq_disable(); t->next = tasklet_hi_vec[cpu].list; tasklet_hi_vec[cpu].list = t; __cpu_raise_softirq(cpu, HI_SOFTIRQ); [11] local_irq_enable(); } }
這里,對(duì)tasklet_hi_vec[cpu]的修改,不存在CPU之間的競(jìng)爭(zhēng),因?yàn)槊總€(gè)CPU有各自獨(dú)立的數(shù)據(jù), 所以只要防止irq的干擾,用local_irq_disable/local_irq_enable即可,如[8][9][10][11]處 所示。
|
|
|
內(nèi)核中的互斥之我見
|
| http://www. 作者:e4gle 發(fā)表于:2003-02-21 12:05:46 |
內(nèi)核中的互斥之我見 by wheelz
看了前面各位的討論,我也有些想法,與大家商榷。
需要澄清的是,互斥手段的選擇,不是根據(jù)臨界區(qū)的大小,而是根據(jù)臨界區(qū)的性質(zhì),以及 有哪些部分的代碼,即哪些內(nèi)核執(zhí)行路徑來爭(zhēng)奪。
從嚴(yán)格意義上說,semaphore和spinlock_XXX屬于不同層次的互斥手段,前者的 實(shí)現(xiàn)有賴于后者,這有點(diǎn)象HTTP和TCP的關(guān)系,都是協(xié)議,但層次是不同的。
先說semaphore,它是進(jìn)程級(jí)的,用于多個(gè)進(jìn)程之間對(duì)資源的互斥,雖然也是在 內(nèi)核中,但是該內(nèi)核執(zhí)行路徑是以進(jìn)程的身份,代表進(jìn)程來爭(zhēng)奪資源的。如果 競(jìng)爭(zhēng)不上,會(huì)有context switch,進(jìn)程可以去sleep,但CPU不會(huì)停,會(huì)接著運(yùn)行 其他的執(zhí)行路徑。從概念上說,這和單CPU或多CPU沒有直接的關(guān)系,只是在 semaphore本身的實(shí)現(xiàn)上,為了保證semaphore結(jié)構(gòu)存取的原子性,在多CPU中需要spinlock來互斥。
在內(nèi)核中,更多的是要保持內(nèi)核各個(gè)執(zhí)行路徑之間的數(shù)據(jù)訪問互斥,這是最基本的互斥問題,即保持?jǐn)?shù)據(jù)修改的原子性。semaphore的實(shí)現(xiàn),也要依賴這個(gè)。在單CPU中,主要是中斷和bottom_half的問題,因此,開關(guān)中斷就可以了。在多CPU中,又加上了其他CPU的干擾,因此需要spinlock來幫助。這兩個(gè)部分結(jié)合起來,就形成了spinlock_XXX。它的特點(diǎn)是,一旦CPU進(jìn)入了spinlock_XXX,它就不會(huì)干別的,而是一直空轉(zhuǎn),直到鎖定成功為止。因此,這就決定了被spinlock_XXX鎖住的臨界區(qū)不能停,更不能context switch,要存取完數(shù)據(jù)后趕快出來,以便其他的在空轉(zhuǎn)的執(zhí)行路徑能夠獲得spinlock。這也是spinlock的原則所在。如果當(dāng)前執(zhí)行路徑一定要進(jìn)行context switch,那就要在schedule()之前釋放spinlock,否則,容易死鎖。因?yàn)樵谥袛嗪蚥h中,沒有context,無法進(jìn)行context switch,只能空轉(zhuǎn)等待spinlock,你context switch走了,誰知道猴年馬月才能回來。
因?yàn)閟pinlock的原意和目的就是保證數(shù)據(jù)修改的原子性,因此也沒有理由在spinlock 鎖住的臨界區(qū)中停留。
spinlock_XXX有很多形式,有
spin_lock()/spin_unlock(), spin_lock_irq()/spin_unlock_irq(), spin_lock_irqsave/spin_unlock_irqrestore() spin_lock_bh()/spin_unlock_bh()
local_irq_disable/local_irq_enable local_bh_disable/local_bh_enable
那么,在什么情況下具體用哪個(gè)呢?這要看是在什么內(nèi)核執(zhí)行路徑中,以及要與哪些內(nèi)核 執(zhí)行路徑相互斥。我們知道,內(nèi)核中的執(zhí)行路徑主要有:
1 用戶進(jìn)程的內(nèi)核態(tài),此時(shí)有進(jìn)程context,主要是代表進(jìn)程在執(zhí)行系統(tǒng)調(diào)用 等。 2 中斷或者異常或者自陷等,從概念上說,此時(shí)沒有進(jìn)程context,不能進(jìn)行 context switch。 3 bottom_half,從概念上說,此時(shí)也沒有進(jìn)程context。 4 同時(shí),相同的執(zhí)行路徑還可能在其他的CPU上運(yùn)行。
這樣,考慮這四個(gè)方面的因素,通過判斷我們要互斥的數(shù)據(jù)會(huì)被這四個(gè)因素中 的哪幾個(gè)來存取,就可以決定具體使用哪種形式的spinlock。如果只要和其他CPU 互斥,就要用spin_lock/spin_unlock,如果要和irq及其他CPU互斥,就要用 spin_lock_irq/spin_unlock_irq,如果既要和irq及其他CPU互斥,又要保存 EFLAG的狀態(tài),就要用spin_lock_irqsave/spin_unlock_irqrestore,如果 要和bh及其他CPU互斥,就要用spin_lock_bh/spin_unlock_bh,如果不需要和 其他CPU互斥,只要和irq互斥,則用local_irq_disable/local_irq_enable, 如果不需要和其他CPU互斥,只要和bh互斥,則用local_bh_disable/local_bh_enable, 等等。值得指出的是,對(duì)同一個(gè)數(shù)據(jù)的互斥,在不同的內(nèi)核執(zhí)行路徑中, 所用的形式有可能不同(見下面的例子)。
舉一個(gè)例子。在中斷部分中有一個(gè)irq_desc_t類型的結(jié)構(gòu)數(shù)組變量irq_desc[], 該數(shù)組每個(gè)成員對(duì)應(yīng)一個(gè)irq的描述結(jié)構(gòu),里面有該irq的響應(yīng)函數(shù)等。 在irq_desc_t結(jié)構(gòu)中有一個(gè)spinlock,用來保證存取(修改)的互斥。
對(duì)于具體一個(gè)irq成員,irq_desc[irq],對(duì)其存取的內(nèi)核執(zhí)行路徑有兩個(gè),一是 在設(shè)置該irq的響應(yīng)函數(shù)時(shí)(setup_irq),這通常發(fā)生在module的初始化階段,或 系統(tǒng)的初始化階段;二是在中斷響應(yīng)函數(shù)中(do_IRQ)。代碼如下:
int setup_irq(unsigned int irq, struct irqaction * new) { int shared = 0; unsigned long flags; struct irqaction *old, **p; irq_desc_t *desc = irq_desc + irq;
/* * Some drivers like serial.c use request_irq() heavily, * so we have to be careful not to interfere with a * running system. */ if (new->flags & SA_SAMPLE_RANDOM) { /* * This function might sleep, we want to call it first, * outside of the atomic block. * Yes, this might clear the entropy pool if the wrong * driver is attempted to be loaded, without actually * installing a new handler, but is this really a problem, * only the sysadmin is able to do this. */ rand_initialize_irq(irq); }
/* * The following block of code has to be executed atomically */ [1] spin_lock_irqsave(&desc->lock,flags); p = &desc->action; if ((old = *p) != NULL) { /* Can't share interrupts unless both agree to */ if (!(old->flags & new->flags & SA_SHIRQ)) { [2] spin_unlock_irqrestore(&desc->lock,flags); return -EBUSY; }
/* add new interrupt at end of irq queue */ do { p = &old->next; old = *p; } while (old); shared = 1; }
*p = new;
if (!shared) { desc->depth = 0; desc->status &= ~(IRQ_DISABLED | IRQ_AUTODETECT | IRQ_WAITING); desc->handler->startup(irq); } [3] spin_unlock_irqrestore(&desc->lock,flags);
register_irq_proc(irq); return 0; }
asmlinkage unsigned int do_IRQ(struct pt_regs regs) { /* * We ack quickly, we don't want the irq controller * thinking we're snobs just because some other CPU has * disabled global interrupts (we have already done the * INT_ACK cycles, it's too late to try to pretend to the * controller that we aren't taking the interrupt). * * 0 return value means that this irq is already being * handled by some other CPU. (or is disabled) */ int irq = regs.orig_eax & 0xff; /* high bits used in ret_from_ code */ int cpu = smp_processor_id(); irq_desc_t *desc = irq_desc + irq; struct irqaction * action; unsigned int status;
kstat.irqs[cpu][irq]++; [4] spin_lock(&desc->lock); desc->handler->ack(irq); /* REPLAY is when Linux resends an IRQ that was dropped earlier WAITING is used by probe to mark irqs that are being tested */ status = desc->status & ~(IRQ_REPLAY | IRQ_WAITING); status |= IRQ_PENDING; /* we _want_ to handle it */
/* * If the IRQ is disabled for whatever reason, we cannot * use the action we have. */ action = NULL; if (!(status & (IRQ_DISABLED | IRQ_INPROGRESS))) { action = desc->action; status &= ~IRQ_PENDING; /* we commit to handling */ status |= IRQ_INPROGRESS; /* we are handling it */ } desc->status = status;
/* * If there is no IRQ handler or it was disabled, exit early. Since we set PENDING, if another processor is handling a different instance of this same irq, the other processor will take care of it. */ if (!action) goto out;
/* * Edge triggered interrupts need to remember * pending events. * This applies to any hw interrupts that allow a second * instance of the same irq to arrive while we are in do_IRQ * or in the handler. But the code here only handles the _second_ * instance of the irq, not the third or fourth. So it is mostly * useful for irq hardware that does not mask cleanly in an * SMP environment. */ for (;;) { [5] spin_unlock(&desc->lock); handle_IRQ_event(irq, ®s, action); [6] spin_lock(&desc->lock); if (!(desc->status & IRQ_PENDING)) break; desc->status &= ~IRQ_PENDING; } desc->status &= ~IRQ_INPROGRESS; out: /* * The ->end() handler has to deal with interrupts which got * disabled while the handler was running. */ desc->handler->end(irq); [7] spin_unlock(&desc->lock);
if (softirq_pending(cpu)) do_softirq(); return 1; }
在setup_irq()中,因?yàn)槠渌鸆PU可能同時(shí)在運(yùn)行setup_irq(),或者在運(yùn)行setup_irq()時(shí), 本地irq中斷來了,要執(zhí)行do_IRQ()以修改desc->status。為了同時(shí)防止來自其他CPU和 本地irq中斷的干擾,如[1][2][3]處所示,使用了spin_lock_irqsave/spin_unlock_irqrestore()
而在do_IRQ()中,因?yàn)閐o_IRQ()本身是在中斷中,而且此時(shí)還沒有開中斷,本CPU中沒有 什么可以中斷其運(yùn)行,其他CPU則有可能在運(yùn)行setup_irq(),或者也在中斷中,但這二者 對(duì)本地do_IRQ()的影響沒有區(qū)別,都是來自其他CPU的干擾,因此只需要用spin_lock/spin_unlock, 如[4][5][6][7]處所示。值得注意的是[5]處,先釋放該spinlock,再調(diào)用具體的響應(yīng)函數(shù)。
再舉個(gè)例子:
static void tasklet_hi_action(struct softirq_action *a) { int cpu = smp_processor_id(); struct tasklet_struct *list;
[8] local_irq_disable(); list = tasklet_hi_vec[cpu].list; tasklet_hi_vec[cpu].list = NULL; [9] local_irq_enable();
while (list) { struct tasklet_struct *t = list;
list = list->next;
if (tasklet_trylock(t)) { if (!atomic_read(&t->count)) { if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) BUG(); t->func(t->data); tasklet_unlock(t); continue; } tasklet_unlock(t); }
[10] local_irq_disable(); t->next = tasklet_hi_vec[cpu].list; tasklet_hi_vec[cpu].list = t; __cpu_raise_softirq(cpu, HI_SOFTIRQ); [11] local_irq_enable(); } }
這里,對(duì)tasklet_hi_vec[cpu]的修改,不存在CPU之間的競(jìng)爭(zhēng),因?yàn)槊總€(gè)CPU有各自獨(dú)立的數(shù)據(jù), 所以只要防止irq的干擾,用local_irq_disable/local_irq_enable即可,如[8][9][10][11]處 所示。 |
|
|