let's start from bottom up :)有的時(shí)候用英語(yǔ)說(shuō)話比漢語(yǔ)要簡(jiǎn)潔和有意思一點(diǎn):)
一個(gè)lance得到數(shù)據(jù)以后總會(huì)這樣處理:
skb = dev_alloc_skb (....);
skb->protocol = eth_type_trans(skb, dev);
....
netif_rx (skb);
eth_type_trans函數(shù)在net/ethernet/eth.c里面,作用當(dāng)然很簡(jiǎn)單了,大家可以自己看;).
而netif_rx函數(shù)是在net/core/dev.c里面的,假定沒(méi)有定義CONFIG_CPU_IS_SLOW(我覺(jué)得自己的CPU不慢:))和CONFIG_NET_HW_FLOWCONTROL(很少有人會(huì)意識(shí)到很多網(wǎng)卡有流量控制把,不過(guò)沒(méi)有交換設(shè)備的支持,想憑這個(gè)東西達(dá)到Qos也沒(méi)什么
用)以后的代碼是這樣的:
void netif_rx(struct sk_buff *skb)
{
skb->stamp = xtime;
if (backlog.qlen <= netdev_max_backlog) {
if (backlog.qlen) {
if (netdev_dropping == 0) {
skb_queue_tail(&backlog,skb);
mark_bh(NET_BH);
return;
}
atomic_inc(&netdev_rx_dropped);
kfree_skb(skb);
return;
}
netdev_dropping = 0;
skb_queue_tail(&backlog,skb);
mark_bh(NET_BH);
return;
}
netdev_dropping = 1;
atomic_inc(&netdev_rx_dropped);
kfree_skb(skb);
}
xtime是當(dāng)前的時(shí)間,一個(gè)struct timeval,利用gettimeofday函數(shù)得到的就是這個(gè)東西的內(nèi)容.backlog是一個(gè)sk_buff的雙向鏈表, netdev_dropping初始化為0,如果沒(méi)有定義CONFIG_NET_HW_FLOWCONTROL,這個(gè)變量一直都將是0.skb_queue_tail就是把一個(gè)sk_buff加入到backlog雙向隊(duì)列中去.然后mark_bh是設(shè)置了一個(gè)全局變量相對(duì)位移NET_BH處的bit就返回了.這個(gè)bit的設(shè)置將使得內(nèi)核下次schedule的時(shí)候從TIMER_BH向下處理時(shí)檢查到NET_BH處發(fā)現(xiàn)有設(shè)置就會(huì)調(diào)用對(duì)應(yīng)NET_BH優(yōu)先級(jí)的函數(shù)net_bh來(lái)處理,這個(gè)回調(diào)函數(shù)是在net_dev_init函數(shù)里面調(diào)用init_bh設(shè)置的,呵呵,兄弟們,如果感興趣可以自己再init_bh看看設(shè)置一個(gè)自己的處理backlog的函數(shù)啊.
Linux在這里采取了一個(gè)古怪的策略進(jìn)行控制權(quán)的轉(zhuǎn)移和處理機(jī)優(yōu)先級(jí)的處理.另一個(gè)函數(shù)net_bh來(lái)處理從backlog中間得到包,它是這樣的(假定沒(méi)定義CONFIG_BRIDGE這個(gè)選項(xiàng)):
void net_bh(void)
{
struct packet_type *ptype;
struct packet_type *pt_prev;
unsigned short type;
unsigned long start_time = jiffies;
NET_PROFILE_ENTER(net_bh);
if (qdisc_head.forw != &qdisc_head)
qdisc_run_queues();
while (!skb_queue_empty(&backlog))
{
struct sk_buff * skb;
if (jiffies - start_time > 1)
goto net_bh_break;
skb = skb_dequeue(&backlog);
#ifdef CONFIG_NET_FASTROUTE
if (skb->pkt_type == PACKET_FASTROUTE) {
dev_queue_xmit(skb);
continue;
}
#endif
/* XXX until we figure out every place to modify.. */
skb->h.raw = skb->nh.raw = skb->data;
if(skb->mac.raw < skb->head || skb->mac.raw > skb->data){
printk(KERN_CRIT "%s: wrong mac.raw ptr, proto=%04x\n",
skb->dev->name, skb->protocol);
kfree_skb(skb);
continue;
}
type = skb->protocol;
pt_prev = NULL;
for (ptype = ptype_all; ptype!=NULL; ptype=ptype->next)
{
if (!ptype->dev || ptype->dev == skb->dev) {
if(pt_prev)
{
struct sk_buff *skb2=skb_clone(skb, GFP_ATOMIC);
if(skb2)
pt_prev->func(skb2,skb->dev, pt_prev);
}
pt_prev=ptype;
}
}
for (ptype = ptype_base[ntohs(type)&15]; ptype != NULL;
ptype = ptype->next)
{
if (ptype->type == type && (!ptype->dev ||
ptype->dev==skb->dev))
{
if(pt_prev)
{
struct sk_buff *skb2;
skb2=skb_clone(skb, GFP_ATOMIC);
if(skb2)
pt_prev->func(skb2, skb->dev, pt_prev);
}
pt_prev=ptype;
}
} /* End of protocol list loop */
if(pt_prev)
pt_prev->func(skb, skb->dev, pt_prev);
else {
kfree_skb(skb);
}
} /* End of queue loop */
if (qdisc_head.forw != &qdisc_head)
qdisc_run_queues();
netdev_dropping = 0;
NET_PROFILE_LEAVE(net_bh);
return;
net_bh_break:
mark_bh(NET_BH);
NET_PROFILE_LEAVE(net_bh);
return;
}
這個(gè)函數(shù)其實(shí)很簡(jiǎn)單,NET_PROFILE_ENTER當(dāng)然是一個(gè)宏展開(kāi)了,它其實(shí)就是include/net/profile.h里面的net_profile_enter函數(shù),而NET_PROFILE_LEAVE是profile.h文件里面的net_profile_leave函數(shù),有興趣的看看把.:)幫我解疑.
qdisc_head是一個(gè)Qdisc_head類型,是一個(gè)全局變量,看名字和處理順序應(yīng)該看作是一個(gè)Quick DISCovery的隊(duì)列,如果不為空的話我們就要運(yùn)行qdisc_run_queues函數(shù)進(jìn)行清理了,不過(guò)我并不清楚這個(gè)queue的意義,這個(gè)變量和函數(shù)都在net/sched/sch_generic.c里面獲得的.大家看了給我答疑把,xixi
下面的東西挺簡(jiǎn)單的,我就不說(shuō)了,值得注意的是:
1.大家還記得ptype_all和ptype_base嗎?就是調(diào)用dev_add_pack加入的數(shù)組啊,最終也調(diào)用了pt_prev->func(....)
2.系統(tǒng)先處理ptype_all然后才處理的ptype_base
3.每處理一個(gè)sk_buff如果超過(guò)1jiffies(x86上為50ms)就再等待下次調(diào)用
4.sk_clone是一個(gè)快速拷貝,沒(méi)有拷貝數(shù)據(jù),只是復(fù)制頭部而已
packet 函數(shù) 看看在net/packet/af_packet.c里面的packet_create函數(shù),這個(gè)就是通過(guò)packet_proto_init加入的回調(diào)函數(shù),假設(shè)定義了CONFIG_SOCK_PACKET,代碼整理如下,這個(gè)函數(shù)是在用戶創(chuàng)建鏈路層socket的時(shí)候被調(diào)用的:
static int packet_create(struct socket *sock, int protocol)
{
struct sock *sk;
int err;
if (!capable(CAP_NET_RAW))
return -EPERM;
if (sock->type != SOCK_DGRAM && sock->type != SOCK_RAW
&& sock->type != SOCK_PACKET
)
return -ESOCKTNOSUPPORT;
//只有socket(AF_PACKET, [SOCK_DGRAM, SOCK_RAW],
//或者socket(AF_INET, SOCK_PACKET ,才能調(diào)用成功
sock->state = SS_UNCONNECTED;
MOD_INC_USE_COUNT;
err = -ENOBUFS;
sk = sk_alloc(PF_PACKET, GFP_KERNEL, 1);
if (sk == NULL)
goto out;
sk->reuse = 1;
sock->ops = &packet_ops;
if (sock->type == SOCK_PACKET)
sock->ops = &packet_ops_spkt;
//如果是old_style的SOCK_PACKET,就使用packet_ops_spkt
//如果是AF_PACKET,就使用packet_ops作為對(duì)應(yīng)的socket的
//回調(diào)函數(shù)
sock_init_data(sock,sk);
sk->protinfo.af_packet = kmalloc(sizeof(struct packet_opt),
GFP_KERNEL);
//protinfo是一個(gè)union
if (sk->protinfo.af_packet == NULL)
goto out_free;
memset(sk->protinfo.af_packet, 0, sizeof(struct packet_opt));
sk->zapped=0;
//這個(gè)zapped屬性表示一個(gè)TCP的socket收到了RST
sk->family = PF_PACKET;
sk->num = protocol;
sk->protinfo.af_packet->prot_hook.func = packet_rcv;
if (sock->type == SOCK_PACKET)
sk->protinfo.af_packet->prot_hook.func = packet_rcv_spkt;
sk->protinfo.af_packet->prot_hook.data = (void *)sk;
if (protocol) {
sk->protinfo.af_packet->prot_hook.type = protocol;
dev_add_pack(&sk->protinfo.af_packet->prot_hook);
//注意到了沒(méi)有,如果protocol非零的話也可以dev_add_pack
//的,不過(guò)當(dāng)然不能達(dá)到phrack55-12的目的,因?yàn)檫@時(shí)候你的
//數(shù)據(jù)已經(jīng)在用戶地址空間了,內(nèi)核的數(shù)據(jù)也是改不了的
sk->protinfo.af_packet->running = 1;
}
sklist_insert_socket(&packet_sklist, sk);
//這個(gè)函數(shù)顯然應(yīng)該實(shí)現(xiàn)非常簡(jiǎn)單,在net/core/sock.c里面.
//packet_sklist是用來(lái)給每個(gè)socket通知interface狀態(tài)變化
//的消息的,包括UP/DOWN/MULTICAST_LIST_CHANGE
//這個(gè)回調(diào)函數(shù)的實(shí)現(xiàn)是我們說(shuō)過(guò)的register_netdev_notifier
return(0);
out_free:
sk_free(sk);
out:
MOD_DEC_USE_COUNT;
return err;
}
只有在創(chuàng)建了packet socket以后應(yīng)用程序才能接收鏈路層的數(shù)據(jù)包.而只有你設(shè)置了一個(gè)非零的protocol以后才能dev_add_pack,你的socket才能接收數(shù)據(jù)的.現(xiàn)在看來(lái),dev_add_pack確實(shí)是實(shí)現(xiàn)底層數(shù)據(jù)改寫(xiě)的一個(gè)重要的函數(shù).所以下面我們
將注意dev_add_pack設(shè)置的回調(diào)函數(shù)func的使用.
packet_rcv 我們已經(jīng)知道了,如果使用socket(AF_SOCKET, ..)產(chǎn)生一個(gè)PACKET SOCKET的話,dev_add_pack加入的函數(shù)是packet_rcv,下面是這個(gè)在net/packet/af_packet.c里面的函數(shù):
static int packet_rcv(struct sk_buff *skb, struct device *dev,
struct packet_type *pt)
{
struct sock *sk;
struct sockaddr_ll *sll = (struct sockaddr_ll*)skb->cb;
sk = (struct sock *) pt->data;
//我們?cè)趐acket_create中令data = sk了,remember?
if (skb->pkt_type == PACKET_LOOPBACK) {
kfree_skb(skb);
return 0;
}
skb->dev = dev;
sll->sll_family = AF_PACKET;
sll->sll_hatype = dev->type;
sll->sll_protocol = skb->protocol;
sll->sll_pkttype = skb->pkt_type;
sll->sll_ifindex = dev->ifindex;
sll->sll_halen = 0;
if (dev->hard_header_parse)
sll->sll_halen = dev->hard_header_parse(skb, sll->sll_addr);
if (dev->hard_header)
if (sk->type != SOCK_DGRAM)
skb_push(skb, skb->data - skb->mac.raw);
else if (skb->pkt_type == PACKET_OUTGOING)
skb_pull(skb, skb->nh.raw - skb->data);
if (sock_queue_rcv_skb(sk,skb)<0)
{
kfree_skb(skb);
return 0;
}
return(0);
}
pkt_type屬性是什么地方確定的?
這里還有幾個(gè)函數(shù)要說(shuō)明:
skb_pull在include/linux/skbuff.h中間:
extern __inline__ char *__skb_pull(struct sk_buff *skb,
unsigned int len)
{
skb->len-=len;
return skb->data+=len;
}
extern __inline__ unsigned char * skb_pull(struct sk_buff *skb,
unsigned int len)
{
if (len > skb->len)
return NULL;
return __skb_pull(skb,len);
}
不過(guò)是把頭部的數(shù)據(jù)空出來(lái),相應(yīng)調(diào)整數(shù)據(jù)頭部data的地址和長(zhǎng)度.
同樣skb_push在include/linux/skbuff.h中間:
extern __inline__ unsigned char *__skb_push(struct sk_buff *skb,
unsigned int len)
{
skb->data-=len;
skb->len+=len;
return skb->data;
}
extern __inline__ unsigned char *skb_push(struct sk_buff *skb,
unsigned int len)
{
skb->data-=len;
skb->len+=len;
if(skb->data head)
{
__label__ here;
skb_under_panic(skb, len, &&here);
here: ;
}
return skb->data;
}
這個(gè)調(diào)整使數(shù)據(jù)長(zhǎng)度加長(zhǎng),和skb_pull相反,不過(guò)skb_push顯然更加安全一點(diǎn).
在上面的程序中間,如果設(shè)備有一個(gè)明確的link_level_header,就考慮要不要調(diào)整數(shù)據(jù)長(zhǎng)度和地址,如果sk->type不是SOCK_DGRAM的話,說(shuō)明程序?qū)φ麄€(gè)數(shù)據(jù)包包括ll地址都感興趣.這樣需要加長(zhǎng)數(shù)據(jù)段使得數(shù)據(jù)包含ll頭部.不然如果數(shù)據(jù)是向外走的,則需要把數(shù)據(jù)裁減到只包含從網(wǎng)絡(luò)層數(shù)據(jù)包頭開(kāi)始的地方.所以是從nh.raw剪掉data,這就是差值.(nh=network header)
經(jīng)過(guò)了這些處理以后,現(xiàn)在的skb已經(jīng)是可以提交的了,這樣就調(diào)用sock_queue_rcv_skb函數(shù)將這個(gè)skb加入到相應(yīng)socket的接收緩沖區(qū)中去.