1.概念select、poll、epoll都是事件觸發(fā)機(jī)制,當(dāng)?shù)却氖录l(fā)生就觸發(fā)進(jìn)行處理,用于I/O復(fù)用 2.簡(jiǎn)單例子理解
3.select函數(shù)3.1函數(shù)詳解int select(int maxfdp1,fd_set *readset,fd_set *writeset,fd_set *exceptset,const struct timeval *timeout) //返回值:就緒描述符的數(shù)目,超時(shí)返回0,出錯(cuò)返回-1 1)第一個(gè)參數(shù)maxfdp1指定待測(cè)試的描述符個(gè)數(shù),它的值是待測(cè)試的最大描述符加1(因此把該參數(shù)命名為maxfdp1),描述字0、1、2...maxfdp1-1均將被測(cè)試(即使你中間有不想測(cè)的) 2)中間的三個(gè)參數(shù)readset、writeset和exceptset指定我們要讓內(nèi)核測(cè)試讀、寫和異常條件的描述符。如果對(duì)某一個(gè)的條件不感興趣,就可以把它設(shè)為空指針。fd_set存放著描述符,它是一個(gè)long類型的數(shù)組,是一個(gè)bitmap,可通過以下四個(gè)宏進(jìn)行設(shè)置: void FD_ZERO(fd_set *fdset); //清空集合 void FD_SET(int fd, fd_set *fdset); //將一個(gè)給定的文件描述符加入集合之中 void FD_CLR(int fd, fd_set *fdset); //將一個(gè)給定的文件描述符從集合中刪除 int FD_ISSET(int fd, fd_set *fdset); // 檢查集合中指定的文件描述符是否可以讀寫 3)timeout告知內(nèi)核等待所指定描述符中的任何一個(gè)就緒可花多少時(shí)間。其timeval結(jié)構(gòu)用于指定這段時(shí)間的秒數(shù)和微秒數(shù) struct timeval {
long tv_sec; //seconds
long tv_usec; //microseconds
};這個(gè)參數(shù)有三種可能: ①永遠(yuǎn)等待下去:僅在有一個(gè)描述符準(zhǔn)備好I/O時(shí)才返回;為此,把該參數(shù)設(shè)置為空指針NULL(等到你好了我才返回) ②等待一段固定時(shí)間:在有一個(gè)描述符準(zhǔn)備好I/O時(shí)返回,但是不超過由該參數(shù)所指向的timeval結(jié)構(gòu)中指定的秒數(shù)和微秒數(shù)(我到了固定時(shí)間就返回) ③根本不等待:檢查描述符后立即返回,這稱為輪詢。為此,該參數(shù)必須指向一個(gè)timeval結(jié)構(gòu),而且其中的定時(shí)器值必須為0(我不斷地檢查你好沒好,不管你好沒好我都返回) 3.2實(shí)現(xiàn)過程
如圖,select會(huì)在1~7之間不斷循環(huán) 1)使用copy_from_user將fd_set(描述符集合)拷貝到內(nèi)核 2)注冊(cè)一個(gè)函數(shù)__pollwait,也是就所謂的poll方法 3)遍歷所有描述符fd,調(diào)用其對(duì)應(yīng)的poll方法(對(duì)于socket,這個(gè)poll方法是sock_poll,sock_poll根據(jù)情況會(huì)調(diào)用到tcp_poll,udp_poll或者datagram_poll),poll方法的主要工作就是把current進(jìn)程掛到fd對(duì)應(yīng)的設(shè)備等待隊(duì)列中,當(dāng)fd可讀寫時(shí),會(huì)喚醒等待隊(duì)列上睡眠的進(jìn)程;poll方法返回的是一個(gè)描述讀寫是否就緒的mask掩碼,用這個(gè)mask掩碼給fd_set賦值 4)遍歷完以后,如果發(fā)現(xiàn)有可讀寫的mask掩碼,則跳到7 5)如果沒有,則調(diào)用schedule_timeout使current進(jìn)程進(jìn)入睡眠 6)睡眠期間如果有fd可讀寫時(shí),或者超過了睡眠時(shí)間,current進(jìn)程會(huì)被喚醒獲得CPU進(jìn)行工作,跳到3 7)使用copy_to_user把fd_set從內(nèi)核拷貝到用戶空間 最后,進(jìn)程在用戶空間檢查fd_set,找到可讀寫的fd,對(duì)其進(jìn)行I/O操作 3.3缺點(diǎn)1)select可監(jiān)聽的文件描述符數(shù)量較小,linux上默認(rèn)為1024,由宏定義FD_SETSIZE確定 2)每次調(diào)用select,都需要把整個(gè)fd集合從用戶態(tài)拷貝到內(nèi)核態(tài),返回時(shí)再?gòu)膬?nèi)核態(tài)拷貝到用戶態(tài),存在開銷 3)current進(jìn)程每次被喚醒時(shí)都要遍歷所有的fd(即輪詢),這樣做效率很低 3.4實(shí)例#include <stdio.h>
#include <sys/select.h>
#include <sys/time.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
int max(int a, int b)
{
return(a >= b ? a : b);
}
void str_cli(FILE *fp, int sockfd)
{
int maxfdpl;
fd_set rset;
char sendline[4096], recvline[4096];
FD_ZERO(&rset);
for (;;)
{
FD_SET(fileno(fp), &rset);
FD_SET(sockfd, &rset);
maxfdpl = max(fileno(fp), sockfd) 1;
if (select(maxfdpl, &rset, NULL, NULL, NULL) < 0)
{
perror("select");
exit(1);
}
if (FD_ISSET(sockfd, &rset)) /* socket is readable */
{
if (readline(sockfd, recvline, 4096) == 0)
{
printf("str_cli: server terminated prematurely\n");
exit(1);
}
fputs(recvline, stdout);
}
if (FD_ISSET(fileno(fp), &rset)) /* input is readable */
{
if (fgets(sendline, 4096, fp) == NULL)
return;
writen(sockfd, sendline, strlen(sendline));
}
}
}4.poll函數(shù)4.1函數(shù)詳解#include <poll.h> int poll(struct pollfd fds[], nfds_t nfds, int timeout); 1)poll使用一個(gè)結(jié)構(gòu)數(shù)組fds來存放套接字描述符,其中每一個(gè)元素為pollfd結(jié)構(gòu) struct pollfd {
int fd;//表示文件描述符
short events;//表示請(qǐng)求檢測(cè)的事件
short revents; //表示檢測(cè)之后返回的事件,如果當(dāng)某個(gè)fd有狀態(tài)變化時(shí),revents的值就不為空
};為了加快處理速度和提高系統(tǒng)性能,poll將會(huì)把fds中所有struct pollfd表示為內(nèi)核的struct poll_list鏈表,即內(nèi)核層是用鏈表來保存描述符 struct poll_list {
struct poll_list *next;
int len;
struct pollfd entries[0];
};2)參數(shù)說明 fds:存放需要被檢測(cè)狀態(tài)的Socket描述符;與select不同(select函數(shù)在調(diào)用之后,會(huì)清空檢測(cè)socket描述符的數(shù)組),每當(dāng)調(diào)用poll之后,不會(huì)清空這個(gè)數(shù)組,而是將有狀態(tài)變化的描述符結(jié)構(gòu)的revents變量狀態(tài)變化,操作起來比較方便; 3)返回值
4.2實(shí)現(xiàn)過程poll的實(shí)現(xiàn)過程與select差不多 4.2優(yōu)點(diǎn)1)poll沒有最大數(shù)量的限制,struct pollfd數(shù)組fds大小的可以根據(jù)我們自己的需要來定義(但是數(shù)量過大后性能也是會(huì)下降) 4.3缺點(diǎn)和select的兩個(gè)缺點(diǎn)一樣 5.epoll函數(shù)epoll是linux下select/poll的改進(jìn) 5.1函數(shù)詳解epoll會(huì)調(diào)用三個(gè)函數(shù),分別如下:
int epoll_create(int size); // size:用來告訴內(nèi)核這個(gè)監(jiān)聽的描述符數(shù)量,必須大于0,否則會(huì)返回錯(cuò)誤EINVAL,這只是對(duì)內(nèi)核初始分配內(nèi)部數(shù)據(jù)結(jié)構(gòu)的一個(gè)建議,從源碼上看,這個(gè)size其實(shí)沒有啥用?。。?/pre>
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); /* epfd:是epoll_create()的返回值。 op:表示op操作,用三個(gè)宏來表示:添加EPOLL_CTL_ADD,刪除EPOLL_CTL_DEL,修改EPOLL_CTL_MOD,分別添加、刪除和修改對(duì)fd的監(jiān)聽事件 fd:是需要監(jiān)聽的fd(文件描述符) epoll_event:是告訴內(nèi)核需要監(jiān)聽什么事,ET模式也是在這個(gè)結(jié)構(gòu)里設(shè)置 */ 1)調(diào)用copy_from_user把epoll_event結(jié)構(gòu)拷貝到內(nèi)核空間(網(wǎng)上很多博客說epoll使用了共享內(nèi)存,這個(gè)是完全錯(cuò)誤的 ,可以閱讀源碼,會(huì)發(fā)現(xiàn)完全沒有使用共享內(nèi)存的任何api) 2)將需要監(jiān)聽的socket fd加入到紅黑樹中(也可刪除和修改,若存在則立即返回,不存在則添加到樹上),在插入的過程中還會(huì)為這個(gè)socket注冊(cè)一個(gè)回調(diào)函數(shù)ep_poll_callback,當(dāng)它就緒時(shí)時(shí),就會(huì)立刻執(zhí)行這個(gè)回調(diào)函數(shù)(而不是像select/poll中執(zhí)行喚醒操作default_wake_function) 3)回調(diào)函數(shù)ep_poll_callback的作用:會(huì)把就緒的fd放入就緒鏈表,再喚醒current進(jìn)程
int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout); epoll_wait會(huì)在1~6之間不斷循環(huán) 1)epoll_wait判斷就緒鏈表是否為空 2)如果不空,則跳到6 3)如果為空,則調(diào)用schedule_timeout使current進(jìn)程進(jìn)入睡眠 4)睡眠期間如果有fd就緒了,就緒fd會(huì)調(diào)用回調(diào)函數(shù)ep_poll_callback,回調(diào)函數(shù)會(huì)把就緒的fd放入就緒鏈表,并喚醒current進(jìn)程,然后跳到1 5)或者超過了睡眠時(shí)間,也跳到1 6)使用__put_user把就緒的fd拷貝到用戶空間 5.2epoll的兩種模式5.2.1水平觸發(fā)模式(LT:level-triggered) 1)LT模式是epoll默認(rèn)的工作模式,可支持阻塞和非阻塞套接字 2)傳統(tǒng)的select/poll都是這種模式 3)實(shí)現(xiàn)過程:當(dāng)一個(gè)fd就緒時(shí),回調(diào)函數(shù)會(huì)把該fd放入就緒鏈表中,這時(shí)調(diào)用epoll_wait,就會(huì)把這個(gè)就緒fd拷貝到用戶態(tài),然后清空就緒鏈表,最后epoll_wait干了件事,就是檢查這個(gè)fd,如果這個(gè)fd確實(shí)未被處理,又把該fd放回到剛剛清空的就緒鏈表,于是這個(gè)fd又會(huì)被下次的epoll_wait返回 5.2.1邊緣觸發(fā)模式(ET:edge-triggered) 1)二者的差異在于LT模式下只要某個(gè)socket處于readable/writable狀態(tài),無論什么時(shí)候進(jìn)行epoll_wait都會(huì)返回該socket;而ET模式下只有某個(gè)fd從unreadable變?yōu)閞eadable或從unwritable變?yōu)閣ritable時(shí)(相當(dāng)于高低電平觸發(fā)),epoll_wait才會(huì)返回該socket 2)這種差異導(dǎo)致ET模式下,正確的讀寫方式必須為: 讀:只要可讀,就一直讀,直到讀完緩沖區(qū) 寫:只要可寫,就一直寫,直到寫滿緩沖區(qū) 為什么?
//讀
if (events[i].events & EPOLLIN)
{
n = 0;
while ((nread = read(fd, buf n, BUFSIZ - 1)) > 0)//直到讀完,讀完時(shí)read返回0
{
n = nread;
if (nread == -1 && errno != EAGAIN)
{
perror("read error");
}
}
ev.data.fd = fd;
ev.events = events[i].events | EPOLLOUT;
epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &ev);
}
//寫
if (events[i].events & EPOLLOUT)
{
int nwrite, data_size = strlen(buf);
n = data_size;
while (n > 0)//直到寫滿,寫滿時(shí)n減少到0
{
nwrite = write(fd, buf data_size - n, n);
if (nwrite < n)
{
if (nwrite == -1 && errno != EAGAIN)
{
perror("write error");
}
break;
}
n -= nwrite;
}
ev.data.fd = fd;
ev.events = EPOLLIN | EPOLLET;
epoll_ctl(epfd, EPOLL_CTL_MOD, fd, &ev); //修改sockfd上要處理的事件為EPOLIN
}3)這樣的讀寫方式導(dǎo)致了ET模式只支持非阻塞套接字,因?yàn)樵谧枞捉幼窒聲?huì)出現(xiàn)一些問題:因?yàn)橐恢弊x直到把數(shù)據(jù)讀完,所以一般在編寫epoll邊緣觸發(fā)模式的程序時(shí),會(huì)用一個(gè)循環(huán)一直讀取socket,當(dāng)沒有數(shù)據(jù)可讀了的時(shí)候,阻塞式socket勢(shì)必就一直阻塞下去了,就不是阻塞在epoll_wait上了,造成其他socket餓死 4)LT模式每次都會(huì)返回可讀的套接口,ET模式滿足邊緣條件時(shí)才返回可讀的套接口,減少了重復(fù)的epoll系統(tǒng)調(diào)用,因此效率要比LT模式高,但是對(duì)編程要求高,需要細(xì)致的處理每個(gè)事件,否則容易發(fā)生丟失事件的情況 5.3優(yōu)點(diǎn)1)epoll可監(jiān)聽的描述符數(shù)量很大,上限為系統(tǒng)所有進(jìn)程最大可打開文件的數(shù)目,具體數(shù)目可以cat /proc/sys/fs/file-max查看(ubuntu14.04上為98875) 2)select/poll每次調(diào)用都要進(jìn)行整個(gè)fd集合在用戶態(tài)和內(nèi)核態(tài)之間的拷貝,而epoll返回時(shí)只需拷貝就緒fd,減少了拷貝的開銷 3)select/poll、epoll都是睡眠和喚醒多次交替,但是select/poll在“醒著”的時(shí)候要遍歷整個(gè)fd集合,而epoll在“醒著”的時(shí)候只要判斷就緒鏈表是否為空就行了,大大提升了效率 5.4epoll源碼/*
* 在深入了解epoll的實(shí)現(xiàn)之前, 先來了解內(nèi)核的3個(gè)方面.
* 1. 等待隊(duì)列 waitqueue
* 我們簡(jiǎn)單解釋一下等待隊(duì)列:
* 隊(duì)列頭(wait_queue_head_t)往往是資源生產(chǎn)者,
* 隊(duì)列成員(wait_queue_t)往往是資源消費(fèi)者,
* 當(dāng)頭的資源ready后, 會(huì)逐個(gè)執(zhí)行每個(gè)成員指定的回調(diào)函數(shù),
* 來通知它們資源已經(jīng)ready了, 等待隊(duì)列大致就這個(gè)意思.
* 2. 內(nèi)核的poll機(jī)制
* 被Poll的fd, 必須在實(shí)現(xiàn)上支持內(nèi)核的Poll技術(shù),
* 比如fd是某個(gè)字符設(shè)備,或者是個(gè)socket, 它必須實(shí)現(xiàn)
* file_operations中的poll操作, 給自己分配有一個(gè)等待隊(duì)列頭.
* 主動(dòng)poll fd的某個(gè)進(jìn)程必須分配一個(gè)等待隊(duì)列成員, 添加到
* fd的對(duì)待隊(duì)列里面去, 并指定資源ready時(shí)的回調(diào)函數(shù).
* 用socket做例子, 它必須有實(shí)現(xiàn)一個(gè)poll操作, 這個(gè)Poll是
* 發(fā)起輪詢的代碼必須主動(dòng)調(diào)用的, 該函數(shù)中必須調(diào)用poll_wait(),
* poll_wait會(huì)將發(fā)起者作為等待隊(duì)列成員加入到socket的等待隊(duì)列中去.
* 這樣socket發(fā)生狀態(tài)變化時(shí)可以通過隊(duì)列頭逐個(gè)通知所有關(guān)心它的進(jìn)程.
* 這一點(diǎn)必須很清楚的理解, 否則會(huì)想不明白epoll是如何
* 得知fd的狀態(tài)發(fā)生變化的.
* 3. epollfd本身也是個(gè)fd, 所以它本身也可以被epoll,
* 可以猜測(cè)一下它是不是可以無限嵌套epoll下去...
*
* epoll基本上就是使用了上面的1,2點(diǎn)來完成.
* 可見epoll本身并沒有給內(nèi)核引入什么特別復(fù)雜或者高深的技術(shù),
* 只不過是已有功能的重新組合, 達(dá)到了超過select的效果.
*/
/*
* 相關(guān)的其它內(nèi)核知識(shí):
* 1. fd我們知道是文件描述符, 在內(nèi)核態(tài), 與之對(duì)應(yīng)的是struct file結(jié)構(gòu),
* 可以看作是內(nèi)核態(tài)的文件描述符.
* 2. spinlock, 自旋鎖, 必須要非常小心使用的鎖,
* 尤其是調(diào)用spin_lock_irqsave()的時(shí)候, 中斷關(guān)閉, 不會(huì)發(fā)生進(jìn)程調(diào)度,
* 被保護(hù)的資源其它CPU也無法訪問. 這個(gè)鎖是很強(qiáng)力的, 所以只能鎖一些
* 非常輕量級(jí)的操作.
* 3. 引用計(jì)數(shù)在內(nèi)核中是非常重要的概念,
* 內(nèi)核代碼里面經(jīng)常有些release, free釋放資源的函數(shù)幾乎不加任何鎖,
* 這是因?yàn)檫@些函數(shù)往往是在對(duì)象的引用計(jì)數(shù)變成0時(shí)被調(diào)用,
* 既然沒有進(jìn)程在使用在這些對(duì)象, 自然也不需要加鎖.
* struct file 是持有引用計(jì)數(shù)的.
*/
/* --- epoll相關(guān)的數(shù)據(jù)結(jié)構(gòu) --- */
/*
* This structure is stored inside the "private_data" member of the file
* structure and rapresent the main data sructure for the eventpoll
* interface.
*/
/* 每創(chuàng)建一個(gè)epoll句柄, 內(nèi)核就會(huì)分配一個(gè)eventpoll與之對(duì)應(yīng)*/
struct eventpoll
{
/* Protect the this structure access */
spinlock_t lock;
/*
* This mutex is used to ensure that files are not removed
* while epoll is using them. This is held during the event
* collection loop, the file cleanup path, the epoll file exit
* code and the ctl operations.
*/
/* 添加, 修改或者刪除監(jiān)聽fd的時(shí)候, 以及epoll_wait返回, 向用戶空間
* 傳遞數(shù)據(jù)時(shí)都會(huì)持有這個(gè)互斥鎖, 所以在用戶空間可以放心的在多個(gè)線程
* 中同時(shí)執(zhí)行epoll相關(guān)的操作, 內(nèi)核級(jí)已經(jīng)做了保護(hù). */
struct mutex mtx;
/* Wait queue used by sys_epoll_wait() */
/* 調(diào)用epoll_wait()時(shí), 我們就是"睡"在了這個(gè)等待隊(duì)列上... */
wait_queue_head_t wq;
/* Wait queue used by file->poll() */
/* 這個(gè)用于epollfd本事被poll的時(shí)候... */
wait_queue_head_t poll_wait;
/* List of ready file descriptors */
/* 所有已經(jīng)ready的epitem都在這個(gè)鏈表里面 */
struct list_head rdllist;
/* RB tree root used to store monitored fd structs */
/* 所有要監(jiān)聽的epitem都在這里 */
struct rb_root rbr;
/*
這是一個(gè)單鏈表鏈接著所有的struct epitem當(dāng)event轉(zhuǎn)移到用戶空間時(shí)
*/
* This is a single linked list that chains all the "struct epitem" that
* happened while transfering ready events to userspace w / out
* holding->lock.
* /
struct epitem *ovflist;
/* The user that created the eventpoll descriptor */
/* 這里保存了一些用戶變量, 比如fd監(jiān)聽數(shù)量的最大值等等 */
struct user_struct *user;
};
/*
* Each file descriptor added to the eventpoll interface will
* have an entry of this type linked to the "rbr" RB tree.
*/
/* epitem 表示一個(gè)被監(jiān)聽的fd */
struct epitem
{
/* RB tree node used to link this structure to the eventpoll RB tree */
/* rb_node, 當(dāng)使用epoll_ctl()將一批fds加入到某個(gè)epollfd時(shí), 內(nèi)核會(huì)分配
* 一批的epitem與fds們對(duì)應(yīng), 而且它們以rb_tree的形式組織起來, tree的root
* 保存在epollfd, 也就是struct eventpoll中.
* 在這里使用rb_tree的原因我認(rèn)為是提高查找,插入以及刪除的速度.
* rb_tree對(duì)以上3個(gè)操作都具有O(lgN)的時(shí)間復(fù)雜度 */
struct rb_node rbn;
/* List header used to link this structure to the eventpoll ready list */
/* 鏈表節(jié)點(diǎn), 所有已經(jīng)ready的epitem都會(huì)被鏈到eventpoll的rdllist中 */
struct list_head rdllink;
/*
* Works together "struct eventpoll"->ovflist in keeping the
* single linked chain of items.
*/
/* 這個(gè)在代碼中再解釋... */
struct epitem *next;
/* The file descriptor information this item refers to */
/* epitem對(duì)應(yīng)的fd和struct file */
struct epoll_filefd ffd;
/* Number of active wait queue attached to poll operations */
int nwait;
/* List containing poll wait queues */
struct list_head pwqlist;
/* The "container" of this item */
/* 當(dāng)前epitem屬于哪個(gè)eventpoll */
struct eventpoll *ep;
/* List header used to link this item to the "struct file" items list */
struct list_head fllink;
/* The structure that describe the interested events and the source fd */
/* 當(dāng)前的epitem關(guān)系哪些events, 這個(gè)數(shù)據(jù)是調(diào)用epoll_ctl時(shí)從用戶態(tài)傳遞過來 */
struct epoll_event event;
};
struct epoll_filefd
{
struct file *file;
int fd;
};
/* poll所用到的鉤子Wait structure used by the poll hooks */
struct eppoll_entry
{
/* List header used to link this structure to the "struct epitem" */
struct list_head llink;
/* The "base" pointer is set to the container "struct epitem" */
struct epitem *base;
/*
* Wait queue item that will be linked to the target file wait
* queue head.
*/
wait_queue_t wait;
/* The wait queue head that linked the "wait" wait queue item */
wait_queue_head_t *whead;
};
/* Wrapper struct used by poll queueing */
struct ep_pqueue
{
poll_table pt;
struct epitem *epi;
};
/* Used by the ep_send_events() function as callback private data */
struct ep_send_events_data
{
int maxevents;
struct epoll_event __user *events;
};
//SYSCALL_DEFINE1是一個(gè)宏,用于定義有一個(gè)參數(shù)的系統(tǒng)調(diào)用函數(shù);
//這就是epoll_create真身,先進(jìn)行判斷size是否>0,若是則直接調(diào)用epoll_create1
//所以其實(shí)int epoll_create(int size);中的size真的沒啥用?。?!
SYSCALL_DEFINE1(epoll_create, int size)
{
if (size <= 0)
return -EINVAL;//無效的參數(shù),#define EINVAL 22 /* Invalid argument */
return sys_epoll_create1(0);
}
/* epoll_create1 */
SYSCALL_DEFINE1(epoll_create1, int, flags)
{
int error;
struct eventpoll *ep = NULL;//主描述符
/* Check the EPOLL_* constant for consistency. */
/* 這句沒啥用處... */
BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
/* 對(duì)于epoll來講, 目前唯一有效的FLAG就是CLOEXEC */
if (flags & ~EPOLL_CLOEXEC)
return -EINVAL;
/*
* Create the internal data structure ("struct eventpoll").
*/
/* 分配一個(gè)struct eventpoll, 分配和初始化細(xì)節(jié)我們隨后深聊~ */
error = ep_alloc(&ep);
if (error < 0)
return error;
/*
* Creates all the items needed to setup an eventpoll file. That is,
* a file structure and a free file descriptor.
*/
/* 這里是創(chuàng)建一個(gè)匿名fd, 說起來就話長(zhǎng)了...長(zhǎng)話短說:
* epollfd本身并不存在一個(gè)真正的文件與之對(duì)應(yīng), 所以內(nèi)核需要?jiǎng)?chuàng)建一個(gè)
* "虛擬"的文件, 并為之分配真正的struct file結(jié)構(gòu), 而且有真正的fd.
* 這里2個(gè)參數(shù)比較關(guān)鍵:
* eventpoll_fops, fops就是file operations, 就是當(dāng)你對(duì)這個(gè)文件(這里是虛擬的)進(jìn)行操作(比如讀)時(shí),
* fops里面的函數(shù)指針指向真正的操作實(shí)現(xiàn), 類似C 里面虛函數(shù)和子類的概念.
* epoll只實(shí)現(xiàn)了poll和release(就是close)操作, 其它文件系統(tǒng)操作都有VFS全權(quán)處理了.
* ep, ep就是struct epollevent, 它會(huì)作為一個(gè)私有數(shù)據(jù)保存在struct file的private指針里面.
* 其實(shí)說白了, 就是為了能通過fd找到struct file, 通過struct file能找到eventpoll結(jié)構(gòu).
* 如果懂一點(diǎn)Linux下字符設(shè)備驅(qū)動(dòng)開發(fā), 這里應(yīng)該是很好理解的,
* 推薦閱讀 <Linux device driver 3rd>
*/
error = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep,
O_RDWR | (flags & O_CLOEXEC));
if (error < 0)
ep_free(ep);
return error;
}
/*
* 創(chuàng)建好epollfd后, 接下來我們要往里面添加fd咯
* 來看epoll_ctl
* epfd 就是epollfd
* op ADD,MOD,DEL
* fd 需要監(jiān)聽的描述符
* event 我們關(guān)心的events
*/
SYSCALL_DEFINE4(epoll_ctl, int epfd, int op, int fd, struct epoll_event __user* event)
{
int error;
struct file *file, *tfile;
struct eventpoll *ep;
struct epitem *epi;
struct epoll_event epds;
error = -EFAULT;
/*
* 錯(cuò)誤處理以及從用戶空間將epoll_event結(jié)構(gòu)copy到內(nèi)核空間.
*/
if (ep_op_has_event(op) &&
copy_from_user(&epds, event, sizeof(struct epoll_event)))
goto error_return;
/* Get the "struct file *" for the eventpoll file */
/* 取得struct file結(jié)構(gòu), epfd既然是真正的fd, 那么內(nèi)核空間
* 就會(huì)有與之對(duì)于的一個(gè)struct file結(jié)構(gòu)
* 這個(gè)結(jié)構(gòu)在epoll_create1()中, 由函數(shù)anon_inode_getfd()分配 */
error = -EBADF;
file = fget(epfd);
if (!file)
goto error_return;
/* Get the "struct file *" for the target file */
/* 我們需要監(jiān)聽的fd, 它當(dāng)然也有個(gè)struct file結(jié)構(gòu), 上下2個(gè)不要搞混了哦 */
tfile = fget(fd);
if (!tfile)
goto error_fput;
/* The target file descriptor must support poll */
error = -EPERM;
/* 如果監(jiān)聽的文件不支持poll, 那就沒轍了.
* 你知道什么情況下, 文件會(huì)不支持poll嗎?
*/
if (!tfile->f_op || !tfile->f_op->poll)
goto error_tgt_fput;
/*
* We have to check that the file structure underneath the file descriptor
* the user passed to us _is_ an eventpoll file. And also we do not permit
* adding an epoll file descriptor inside itself.
*/
error = -EINVAL;
/* epoll不能自己監(jiān)聽自己... */
if (file == tfile || !is_file_epoll(file))
goto error_tgt_fput;
/*
* At this point it is safe to assume that the "private_data" contains
* our own data structure.
*/
/* 取到我們的eventpoll結(jié)構(gòu), 來自與epoll_create1()中的分配 */
ep = file->private_data;
/* 接下來的操作有可能修改數(shù)據(jù)結(jié)構(gòu)內(nèi)容, 鎖之~ */
mutex_lock(&ep->mtx);
/*
* Try to lookup the file inside our RB tree, Since we grabbed "mtx"
* above, we can be sure to be able to use the item looked up by
* ep_find() till we release the mutex.
*/
/* 對(duì)于每一個(gè)監(jiān)聽的fd, 內(nèi)核都有分配一個(gè)epitem結(jié)構(gòu),
* 而且我們也知道, epoll是不允許重復(fù)添加fd的,
* 所以我們首先查找該fd是不是已經(jīng)存在了.
* ep_find()其實(shí)就是RBTREE查找, 跟C STL的map差不多一回事, O(lgn)的時(shí)間復(fù)雜度.
*/
epi = ep_find(ep, tfile, fd);
error = -EINVAL;
switch (op) {
/* 首先我們關(guān)心添加 */
case EPOLL_CTL_ADD:
if (!epi) {
/* 之前的find沒有找到有效的epitem, 證明是第一次插入, 接受!
* 這里我們可以知道, POLLERR和POLLHUP事件內(nèi)核總是會(huì)關(guān)心的
* */
epds.events |= POLLERR | POLLHUP;
/* rbtree插入, 詳情見ep_insert()的分析
* 其實(shí)我覺得這里有insert的話, 之前的find應(yīng)該
* 是可以省掉的... */
error = ep_insert(ep, &epds, tfile, fd);
}
else
/* 找到了!? 重復(fù)添加! */
error = -EEXIST;
break;
/* 刪除和修改操作都比較簡(jiǎn)單 */
case EPOLL_CTL_DEL:
if (epi)
error = ep_remove(ep, epi);
else
error = -ENOENT;
break;
case EPOLL_CTL_MOD:
if (epi) {
epds.events |= POLLERR | POLLHUP;
error = ep_modify(ep, epi, &epds);
}
else
error = -ENOENT;
break;
}
mutex_unlock(&ep->mtx);
error_tgt_fput:
fput(tfile);
error_fput:
fput(file);
error_return:
return error;
}
/*
* ep_insert()在epoll_ctl()中被調(diào)用, 完成往epollfd里面添加一個(gè)監(jiān)聽fd的工作
* tfile是fd在內(nèi)核態(tài)的struct file結(jié)構(gòu)
*/
static int ep_insert(struct eventpoll *ep, struct epoll_event *event,struct file *tfile, int fd)
{
int error, revents, pwake = 0;
unsigned long flags;
struct epitem *epi;
struct ep_pqueue epq;
/* 查看是否達(dá)到當(dāng)前用戶的最大監(jiān)聽數(shù) */
if (unlikely(atomic_read(&ep->user->epoll_watches) >=
max_user_watches))
return -ENOSPC;
/* 從著名的slab中分配一個(gè)epitem */
if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
return -ENOMEM;
/* Item initialization follow here ... */
/* 這些都是相關(guān)成員的初始化... */
INIT_LIST_HEAD(&epi->rdllink);
INIT_LIST_HEAD(&epi->fllink);
INIT_LIST_HEAD(&epi->pwqlist);
epi->ep = ep;
/* 這里保存了我們需要監(jiān)聽的文件fd和它的file結(jié)構(gòu) */
ep_set_ffd(&epi->ffd, tfile, fd);
epi->event = *event;
epi->nwait = 0;
/* 這個(gè)指針的初值不是NULL哦... */
epi->next = EP_UNACTIVE_PTR;
/* Initialize the poll table using the queue callback */
/* 好, 我們終于要進(jìn)入到poll的正題了 */
epq.epi = epi;
/* 初始化一個(gè)poll_table
* 其實(shí)就是指定調(diào)用poll_wait(注意不是epoll_wait!!!)時(shí)的回調(diào)函數(shù),和我們關(guān)心哪些events,
* ep_ptable_queue_proc()就是我們的回調(diào)啦, 初值是所有event都關(guān)心 */
init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
/*
* Attach the item to the poll hooks and get current event bits.
* We can safely use the file* here because its usage count has
* been increased by the caller of this function. Note that after
* this operation completes, the poll callback can start hitting
* the new item.
*/
/* 這一部很關(guān)鍵, 也比較難懂, 完全是內(nèi)核的poll機(jī)制導(dǎo)致的...
* 首先, f_op->poll()一般來說只是個(gè)wrapper, 它會(huì)調(diào)用真正的poll實(shí)現(xiàn),
* 拿UDP的socket來舉例, 這里就是這樣的調(diào)用流程: f_op->poll(), sock_poll(),
* udp_poll(), datagram_poll(), sock_poll_wait(), 最后調(diào)用到我們上面指定的
* ep_ptable_queue_proc()這個(gè)回調(diào)函數(shù)...(好深的調(diào)用路徑...).
* 完成這一步, 我們的epitem就跟這個(gè)socket關(guān)聯(lián)起來了, 當(dāng)它有狀態(tài)變化時(shí),
* 會(huì)通過ep_poll_callback()來通知.
* 最后, 這個(gè)函數(shù)還會(huì)查詢當(dāng)前的fd是不是已經(jīng)有啥event已經(jīng)ready了, 有的話
* 會(huì)將event返回. */
revents = tfile->f_op->poll(tfile, &epq.pt);
/*
* We have to check if something went wrong during the poll wait queue
* install process. Namely an allocation for a wait queue failed due
* high memory pressure.
*/
error = -ENOMEM;
if (epi->nwait < 0)
goto error_unregister;
/* Add the current item to the list of active epoll hook for this file */
/* 這個(gè)就是每個(gè)文件會(huì)將所有監(jiān)聽自己的epitem鏈起來 */
spin_lock(&tfile->f_lock);
list_add_tail(&epi->fllink, &tfile->f_ep_links);
spin_unlock(&tfile->f_lock);
/*
* Add the current item to the RB tree. All RB tree operations are
* protected by "mtx", and ep_insert() is called with "mtx" held.
*/
/* 都搞定后, 將epitem插入到對(duì)應(yīng)的eventpoll中去 */
ep_rbtree_insert(ep, epi);
/* We have to drop the new item inside our item list to keep track of it */
spin_lock_irqsave(&ep->lock, flags);
/* If the file is already "ready" we drop it inside the ready list */
/* 到達(dá)這里后, 如果我們監(jiān)聽的fd已經(jīng)有事件發(fā)生, 那就要處理一下 */
if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
/* 將當(dāng)前的epitem加入到ready list中去 */
list_add_tail(&epi->rdllink, &ep->rdllist);
/* Notify waiting tasks that events are available */
/* 誰在epoll_wait, 就喚醒它... */
if (waitqueue_active(&ep->wq))
wake_up_locked(&ep->wq);
/* 誰在epoll當(dāng)前的epollfd, 也喚醒它... */
if (waitqueue_active(&ep->poll_wait))
pwake ;
}
spin_unlock_irqrestore(&ep->lock, flags);
atomic_inc(&ep->user->epoll_watches);
/* We have to call this outside the lock */
if (pwake)
ep_poll_safewake(&ep->poll_wait);
return 0;
error_unregister:
ep_unregister_pollwait(ep, epi);
/*
* We need to do this because an event could have been arrived on some
* allocated wait queue. Note that we don't care about the ep->ovflist
* list, since that is used/cleaned only inside a section bound by "mtx".
* And ep_insert() is called with "mtx" held.
*/
spin_lock_irqsave(&ep->lock, flags);
if (ep_is_linked(&epi->rdllink))
list_del_init(&epi->rdllink);
spin_unlock_irqrestore(&ep->lock, flags);
kmem_cache_free(epi_cache, epi);
return error;
}
/*
* 這個(gè)是關(guān)鍵性的回調(diào)函數(shù), 當(dāng)我們監(jiān)聽的fd發(fā)生狀態(tài)改變時(shí), 它會(huì)被調(diào)用.
* 參數(shù)key被當(dāng)作一個(gè)unsigned long整數(shù)使用, 攜帶的是events.
*/
static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
int pwake = 0;
unsigned long flags;
struct epitem *epi = ep_item_from_wait(wait);//從等待隊(duì)列獲取epitem.需要知道哪個(gè)進(jìn)程掛載到這個(gè)設(shè)備
struct eventpoll *ep = epi->ep;//獲取
spin_lock_irqsave(&ep->lock, flags);
/*
* If the event mask does not contain any poll(2) event, we consider the
* descriptor to be disabled. This condition is likely the effect of the
* EPOLLONESHOT bit that disables the descriptor when an event is received,
* until the next EPOLL_CTL_MOD will be issued.
*/
if (!(epi->event.events & ~EP_PRIVATE_BITS))
goto out_unlock;
/*
* Check the events coming with the callback. At this stage, not
* every device reports the events in the "key" parameter of the
* callback. We need to be able to handle both cases here, hence the
* test for "key" != NULL before the event match test.
*/
/* 沒有我們關(guān)心的event... */
if (key && !((unsigned long)key & epi->event.events))
goto out_unlock;
/*
* If we are trasfering events to userspace, we can hold no locks
* (because we're accessing user memory, and because of linux f_op->poll()
* semantics). All the events that happens during that period of time are
* chained in ep->ovflist and requeued later on.
*/
/*
* 這里看起來可能有點(diǎn)費(fèi)解, 其實(shí)干的事情比較簡(jiǎn)單:
* 如果該callback被調(diào)用的同時(shí), epoll_wait()已經(jīng)返回了,
* 也就是說, 此刻應(yīng)用程序有可能已經(jīng)在循環(huán)獲取events,
* 這種情況下, 內(nèi)核將此刻發(fā)生event的epitem用一個(gè)單獨(dú)的鏈表
* 鏈起來, 不發(fā)給應(yīng)用程序, 也不丟棄, 而是在下一次epoll_wait
* 時(shí)返回給用戶.
*/
if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
if (epi->next == EP_UNACTIVE_PTR) {
epi->next = ep->ovflist;
ep->ovflist = epi;
}
goto out_unlock;
}
/* If this file is already in the ready list we exit soon */
/* 將當(dāng)前的epitem放入ready list */
if (!ep_is_linked(&epi->rdllink))
list_add_tail(&epi->rdllink, &ep->rdllist);
/*
* Wake up ( if active ) both the eventpoll wait list and the ->poll()
* wait list.
*/
/* 喚醒epoll_wait... */
if (waitqueue_active(&ep->wq))
wake_up_locked(&ep->wq);
/* 如果epollfd也在被poll, 那就喚醒隊(duì)列里面的所有成員. */
if (waitqueue_active(&ep->poll_wait))
pwake ;
out_unlock:
spin_unlock_irqrestore(&ep->lock, flags);
/* We have to call this outside the lock */
if (pwake)
ep_poll_safewake(&ep->poll_wait);
return 1;
}
/*
* Implement the event wait interface for the eventpoll file. It is the kernel
* part of the user space epoll_wait(2).
*/
SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
int, maxevents, int, timeout)
{
int error;
struct file *file;
struct eventpoll *ep;
/* The maximum number of event must be greater than zero */
if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
return -EINVAL;
/* Verify that the area passed by the user is writeable */
/* 這個(gè)地方有必要說明一下:
* 內(nèi)核對(duì)應(yīng)用程序采取的策略是"絕對(duì)不信任",
* 所以內(nèi)核跟應(yīng)用程序之間的數(shù)據(jù)交互大都是copy, 不允許(也時(shí)候也是不能...)指針引用.
* epoll_wait()需要內(nèi)核返回?cái)?shù)據(jù)給用戶空間, 內(nèi)存由用戶程序提供,
* 所以內(nèi)核會(huì)用一些手段來驗(yàn)證這一段內(nèi)存空間是不是有效的.
*/
if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event))) {
error = -EFAULT;
goto error_return;
}
/* Get the "struct file *" for the eventpoll file */
error = -EBADF;
/* 獲取epollfd的struct file, epollfd也是文件嘛 */
file = fget(epfd);
if (!file)
goto error_return;
/*
* We have to check that the file structure underneath the fd
* the user passed to us _is_ an eventpoll file.
*/
error = -EINVAL;
/* 檢查一下它是不是一個(gè)真正的epollfd... */
if (!is_file_epoll(file))
goto error_fput;
/*
* At this point it is safe to assume that the "private_data" contains
* our own data structure.
*/
/* 獲取eventpoll結(jié)構(gòu) */
ep = file->private_data;
/* Time to fish for events ... */
/* OK, 睡覺, 等待事件到來~~ */
error = ep_poll(ep, events, maxevents, timeout);
error_fput:
fput(file);
error_return:
return error;
}
/* 這個(gè)函數(shù)真正將執(zhí)行epoll_wait的進(jìn)程帶入睡眠狀態(tài)... */
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, int maxevents, long timeout)
{
int res, eavail;
unsigned long flags;
long jtimeout;
wait_queue_t wait;//等待隊(duì)列
/*
* Calculate the timeout by checking for the "infinite" value (-1)
* and the overflow condition. The passed timeout is in milliseconds,
* that why (t * HZ) / 1000.
*/
/* 計(jì)算睡覺時(shí)間, 毫秒要轉(zhuǎn)換為HZ */
jtimeout = (timeout < 0 || timeout >= EP_MAX_MSTIMEO) ?
MAX_SCHEDULE_TIMEOUT : (timeout * HZ 999) / 1000;
retry:
spin_lock_irqsave(&ep->lock, flags);
res = 0;
/* 如果ready list不為空, 就不睡了, 直接干活... */
if (list_empty(&ep->rdllist))
{
/*
* We don't have any available event to return to the caller.
* We need to sleep here, and we will be wake up by
* ep_poll_callback() when events will become available.
*/
/* OK, 初始化一個(gè)等待隊(duì)列, 準(zhǔn)備直接把自己掛起,
* 注意current是一個(gè)宏, 代表當(dāng)前進(jìn)程 */
init_waitqueue_entry(&wait, current);//初始化等待隊(duì)列,wait表示當(dāng)前進(jìn)程
__add_wait_queue_exclusive(&ep->wq, &wait);//掛載到ep結(jié)構(gòu)的等待隊(duì)列
for (;;)
{
/*
* We don't want to sleep if the ep_poll_callback() sends us
* a wakeup in between. That's why we set the task state
* to TASK_INTERRUPTIBLE before doing the checks.
*/
/* 將當(dāng)前進(jìn)程設(shè)置位睡眠, 但是可以被信號(hào)喚醒的狀態(tài),
* 注意這個(gè)設(shè)置是"將來時(shí)", 我們此刻還沒睡! */
set_current_state(TASK_INTERRUPTIBLE);
/* 如果這個(gè)時(shí)候, ready list里面有成員了,
* 或者睡眠時(shí)間已經(jīng)過了, 就直接不睡了... */
if (!list_empty(&ep->rdllist) || !jtimeout)
break;
/* 如果有信號(hào)產(chǎn)生, 也起床... */
if (signal_pending(current))
{
res = -EINTR;
break;
}
/* 啥事都沒有,解鎖, 睡覺... */
spin_unlock_irqrestore(&ep->lock, flags);
/* jtimeout這個(gè)時(shí)間后, 會(huì)被喚醒,
* ep_poll_callback()如果此時(shí)被調(diào)用,
* 那么我們就會(huì)直接被喚醒, 不用等時(shí)間了...
* 再次強(qiáng)調(diào)一下ep_poll_callback()的調(diào)用時(shí)機(jī)是由被監(jiān)聽的fd
* 的具體實(shí)現(xiàn), 比如socket或者某個(gè)設(shè)備驅(qū)動(dòng)來決定的,
* 因?yàn)榈却?duì)列頭是他們持有的, epoll和當(dāng)前進(jìn)程
* 只是單純的等待...
**/
jtimeout = schedule_timeout(jtimeout);//睡覺
spin_lock_irqsave(&ep->lock, flags);
}
__remove_wait_queue(&ep->wq, &wait);
/* OK 我們醒來了... */
set_current_state(TASK_RUNNING);
}
/* Is it worth to try to dig for events ? */
eavail = !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR;
spin_unlock_irqrestore(&ep->lock, flags);
/*
* Try to transfer events to user space. In case we get 0 events and
* there's still timeout left over, we go trying again in search of
* more luck.
*/
/* 如果一切正常, 有event發(fā)生, 就開始準(zhǔn)備數(shù)據(jù)copy給用戶空間了... */
if (!res && eavail &&
!(res = ep_send_events(ep, events, maxevents)) && jtimeout)
goto retry;
return res;
}
|
|
|