星星博客 »  > 

IO多路复用的技术内幕

前言

多路复用的场景自始至终都是为了最大限度减少用户和内核态之间的切换,充分利用多核时代下的cpu,是对于编写高性能网络编程0阻塞IO的唯一途经,是IO阻塞的克星,我们的php和java底层都大量使用了该技术。

借鉴书籍

《linux设备驱动程序》
《unix高级环境编程第三版》
《unix网络编程卷1》
《linux系统编程》
《c++多线程服务端编程》
《linux内核设计和实现》

借鉴文章,linux内核之旅,于4月16发表的技术文章

https://mp.weixin.qq.com/s?__biz=MzI3NzA5MzUxNA==&mid=2664609790&idx=1&sn=1e8db814b07314f11987d05a2d39eff4&chksm=f04d9a1bc73a130dd2a038ea8897318727ef01dae9be6d47dbf3de330bbf1922616eed453bc1&mpshare=1&scene=1&srcid=0416SDDqQp39OdUbCdubBrtC&sharer_sharetime=1621227010848&sharer_shareid=5e515bbf7ee8c90073cb9ee88a70946c&exportkey=AfO8rNOj0ZKMItebFDULqek%3D&pass_ticket=UuoC59lXqHGXOGzMSm5ofeOWMG2yapNgE2cE%2BEH5geoLuDOPDapqVFgJHdu2VF2D&wx_header=0#rd

一、高性能编程的绊脚石

关于休眠

1.驱动编程中,重要的可中断简单休眠函数

//初始化等待队列
init_waitqueue_head(wait_queue_head_t* queue);

//加入休眠队列
wait_event_interruptible(wq, condition)

//唤醒休眠进程
wake_up_interruptible(wait_queue_header_t *queue);

2.等待队列的数据结构设计

struct wait_queue_head {
        spinlock_t              lock;
        struct list_head        head;
};
typedef struct wait_queue_head wait_queue_head_t;

struct wait_queue_entry {
        unsigned int            flags;
        void                    *private;
        wait_queue_func_t       func;
        struct list_head        entry;
};

这两个是十分重要的数据结构:

关联关系其实很容易搞清楚,透过(long)&(((type*)0)->field)我们可以得到

struct list_head        entry;

在 wait_queue_entry中的位置,因为我们在双向链表中很容易知道链表位置,所以也知道偏移量可以很容易计算出来wait_queue_entry的位置,具体如图

在这里插入图片描述

3. 看看wait_event_interruptible做了什么

wait_event_interruptible的源码:

详细代码请查看linux /usr/src 下面头文件中的wait.h

#define __wait_event_interruptible(wq_head, condition)                          \
        ___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0,             \
                      schedule())

___wait_event的具体实现:

/*
 * The below macro ___wait_event() has an explicit shadow of the __ret
 * variable when used from the wait_event_*() macros.
 *
 * This is so that both can use the ___wait_cond_timeout() construct
 * to wrap the condition.
 *
 * The type inconsistency of the wait_event_*() __ret variable is also
 * on purpose; we use long where we can return timeout values and int
 * otherwise.
 */

#define ___wait_event(wq_head, condition, state, exclusive, ret, cmd)           \
({                                                                              \
        __label__ __out;                                                        \
        struct wait_queue_entry __wq_entry;                                     \
        long __ret = ret;       /* explicit shadow */                           \
                                                                                \
        init_wait_entry(&__wq_entry, exclusive ? WQ_FLAG_EXCLUSIVE : 0);        \
        for (;;) {                                                              \
                long __int = prepare_to_wait_event(&wq_head, &__wq_entry, state);\
                                                                                \
                if (condition)                                                  \
                        break;                                                  \
                                                                                \
                if (___wait_is_interruptible(state) && __int) {                 \
                        __ret = __int;                                          \
                        goto __out;                                             \
                }                                                               \
                                                                                \
                cmd;                                                            \
        }                                                                       \
        finish_wait(&wq_head, &__wq_entry);                                     \
__out:  __ret;                                                                  \
})

一副具体的流程图:

在这里插入图片描述

4.知识点总结

1.休眠结束后做的一个重要的事情是判断等待成立的条件是否成立

2.内核中是将休眠的进程加入到一个链表中,然后用schedule()函数进行调度

3.等待进程多的话链表会变长,内核会对链表调度变吃力

5.反思

我们需要思考一个问题,什么时候会出现IO阻塞?顺着这个思路,我们继续研究网络IO

再看linux内核中的阻塞和非阻塞IO

1.什么是文件描述符?

内核(kernel)利用文件描述符(file descriptor)来访问文件。
文件描述符是非负整数。打开现存文件或新建文件时,内核会返回一个文件描述符。读写文件也需要使用文件描述符来指定待读写的文件。
中文名 文件描述符
外文名 file descriptor

2.如何论证文件描述符的存在?

内核(kernel)利用文件描述符(file descriptor)来访问文件。
文件描述符是非负整数。打开现存文件或新建文件时,内核会返回一个文件描述符。读写文件也需要使用文件描述符来指定待读写的文件。
中文名 文件描述符
外文名 file descriptor

3.进程中的文件列表是怎样的

在这里插入图片描述

4.进程阻塞IO的条件是什么

1)进程调用read,描述符缓冲区内没有数据刻度

2)进程调用write,描述符没有足够空间可写

5.设置非阻塞的方式有哪些?

1)fcntl函数

2)ioctl函数

  1. open函数设置标志位

6.如何自己实现一个阻塞和非阻塞驱动IO?

以read为例子:

ssize_t scull_p_read(struct file *filp, char __user *buf, size_t count,
                loff_t *f_pos)
{
	struct scull_pipe *dev = filp->private_data;

	if (down_interruptible(&dev->sem))
		return -ERESTARTSYS;

	while(dev->rp == dev->wp) {
		up(&dev->sem);	

		if (filp->f_flags & O_NONBLOCK)
		{
			return -EAGAIN;
		}

		if (wait_event_interruptible(dev->inq, (dev->rp != dev->wp)))
		{
			return -EINTR;
		}

		if (down_interruptible(&dev->sem))
		{
			return -EINTR;
		}
	}

	if (dev->wp > dev->rp)
		count = min(count, (size_t)(dev->wp - dev->rp));
	else
		count = min(count, (size_t)(dev->end - dev->rp));

	if (copy_to_user(buf, dev->rp, count))
	{
		up(&dev->sem);
		return -EFAULT;
	}
	
	dev->rp += count;

	if (dev->rp == dev->end) {
		dev->rp = dev->buffer;
	}

	up(&dev->sem);
	wake_up_interruptible(&dev->outq);

	return count;
}

思路梳理:

在这里插入图片描述

7.现象总结

1.进程结束休眠后第一时间去检查休眠前等待条件,缓冲区是否有数据刻度

2.阻塞IO和非阻塞IO都面临一个问题,我们不知道什么时候这个描述符可以读,写也是同样的道理,我们也需要知道描述符什么时候可以写

3.做网络编程我们一定要减少read和write的使用次数,合理调用减少用户和内核态的切换,从而提高程序性能。

4.注意内核的copy_from_user函数和copy_to_user函数也会令进程休眠

反思

我们为了实现高性能的网络编程,需要一个系统调用,需要知道描述符什么时候可以读,什么时候可以写,从而能准确使用read和write去避免IO阻塞,让read和write百分百有效

二、linux网络编程发展过程

前言

从unix系统诞生以来,unix、linux的高性能编程之路,经过漫长探索,BSD伯克利套接字被在各个领域广泛使用

早年的多进程编程时代

1.主进程accept,子进程IO

利用fork的内存页复制机制(早期是复制内存,当下linux是对内存页写时复制),在无锁的accept接收到连接的时候预先派生子进程,在子进程里进行IO读写,read和write

在这里插入图片描述

2.多进程accept,惊群上锁

多个进程上锁监听accept,如果有客户端连接到服务端,多个进程都被唤醒,出现惊群现象,需要对accept进行上锁保护,之前说过linux进程休眠的时候会被加入到链表中,然后被依次唤醒

在这里插入图片描述
局限:

子进程过多产生了非常多不好的影响:
1.在连接过多的时候内存让服务器的连接很难维持在几千个,内存很难跟的上
2.cpu切片困难
3.频繁的用户和内核态切换
4.连接很多需要购买很多的机器,维护成本大
5.频繁的创建销毁进程开销大

3.多线程编程的开始

线程相对于进程更轻量,占用的系统资源更少,IPC通讯可以通过内存,速度更快

1)一个客户端套接字一个线程负责IO

2)维护一个线程池,各自accept

4.IO多路复用的诞生和发展

非阻塞套接字和多路复用早已成为了主流

1) 关于select

为什么我们要使用select,用TCP中的connect为例子,connect的RTT波动时间非常大,默认超时时间在极端情况下可以达到75秒,我们需要一个系统调用知道这个套接字描述符什么时候准备好,经典的用法是非阻塞套接字+select

我们可以使用select设置我们关心的可写可读集合

在这里插入图片描述

具体源码见unix网络编程上卷351页,非常经典的非阻塞IO案例

在这里插入图片描述

局限

1.我们知道套接字准备就绪,不知道套接字发生的是可读还是可写

2.多个描述符不知道具体哪些描述符可读、哪些可写

2)关于poll

具备select判断套接字可读可写的功能,还可以知道具体的事件

在这里插入图片描述
poll函数系统调用

在这里插入图片描述
如何使用poll

在这里插入图片描述

局限

解决了在套接字准备就绪的时候知道套接字是可写就绪还是可读就绪,但是依然不能解决在多个套接字读写的时候,只返回准备就绪的套接字,时间复杂度依然是O(N),我们依然每次需要循环全部套接字去找寻可读或者可写的套接字

3)多路复用改造后的服务器程序

主进程负责使用多路复用accept,通过匿名域套接字socketpair发送给子进程,子进程用多路复用负责处理读写IO任务

在这里插入图片描述

4)linux高级轮训技术 实质是驱动技术的解决方案

Solaris上有/dev/poll 设备驱动,或者FreeBsd的kqueue(mac系统也有这个系统调用)去解决,windows下有IO端口完成也可以做IO多路复用

总结

无论是poll还是select,都存在一个问题,当连接数过多,都需要检查每一个连接的标志位,去判断套接字是否可读或者可写,当连接数很庞大,时间复杂度是O(n),那么我们可不可以通过一个系统调用返回已经可读或者可写就绪的描述符呢?

三、关于EPOLL使用和细节

1.epoll详细系统调用

在这里插入图片描述
在这里插入图片描述
事件

在这里插入图片描述

2.epoll的使用

在这里插入图片描述
在这里插入图片描述

3.epoll和poll与select的比较

在这里插入图片描述

4.关于水平触发

在这里插入图片描述

5.关于边缘触发

在这里插入图片描述

6.linux设备驱动中的poll

内核中的poll、select和epoll都是基于一个文件提供的poll方法

当用户空间程序在驱动程序关联的文件描述符上执行poll、select或epoll系统调用时,该驱动方法将被调用。
该设备方法分为两步:
① 在一个或多个可指示查询状态变化的等待队列上调用 poll_wait. 如果没有文件描述符可用来执行 I/O, 则内核使进程在传递到该系统调用的所有文件描述符对应的等待队列上等待。
② 返回一个用来描述操作是否可以立即无阻塞执行的位掩码。
在这里插入图片描述

7.epoll使用注意事项

1.close掉描述符后,会自动从epoll集合里去掉

2.使用边缘触发,一定要使用非阻塞套接字,一次性读完,一直读到EAGAIN为止

3.多线程epoll_wait监控服务端套接字一定要注意linux惊群现象

8.epoll的优势在多核中优势

主进程负责使用多路复用accept,通过匿名域套接字socketpair发送给子进程,子进程用多路复用负责处理读写IO任务

四、linux内核中的poll和epoll

1.poll内核揭秘

我们需要知道

1)我们需要知道:
内核如何知道描述符准备就绪的?

2)我们需要知道:
内核是如何休眠的

源码

位置:
/usr/src/linux-source-5.4.0/fs/select.c

源码:

SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds,
                int, timeout_msecs)
{
        struct timespec64 end_time, *to = NULL;
        int ret;

        if (timeout_msecs >= 0) {
                to = &end_time;
                poll_select_set_timeout(to, timeout_msecs / MSEC_PER_SEC,
                        NSEC_PER_MSEC * (timeout_msecs % MSEC_PER_SEC));
        }

        ret = do_sys_poll(ufds, nfds, to);

        if (ret == -ERESTARTNOHAND) {
                struct restart_block *restart_block;

                restart_block = &current->restart_block;
                restart_block->fn = do_restart_poll;
                restart_block->poll.ufds = ufds;
                restart_block->poll.nfds = nfds;

                if (timeout_msecs >= 0) {
                        restart_block->poll.tv_sec = end_time.tv_sec;
                        restart_block->poll.tv_nsec = end_time.tv_nsec;
                        restart_block->poll.has_timeout = 1;
                } else
                        restart_block->poll.has_timeout = 0;

                ret = -ERESTART_RESTARTBLOCK;
        }
        return ret;
}

static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
		struct timespec64 *end_time)
{
	struct poll_wqueues table;
	int err = -EFAULT, fdcount, len;
	/* Allocate small arguments on the stack to save memory and be
	   faster - use long to make sure the buffer is aligned properly
	   on 64 bit archs to avoid unaligned access */
	long stack_pps[POLL_STACK_ALLOC/sizeof(long)];
	struct poll_list *const head = (struct poll_list *)stack_pps;
 	struct poll_list *walk = head;
 	unsigned long todo = nfds;

	if (nfds > rlimit(RLIMIT_NOFILE))
		return -EINVAL;

	len = min_t(unsigned int, nfds, N_STACK_PPS);
	for (;;) {
		walk->next = NULL;
		walk->len = len;
		if (!len)
			break;

		if (copy_from_user(walk->entries, ufds + nfds-todo,
					sizeof(struct pollfd) * walk->len))
			goto out_fds;

		todo -= walk->len;
		if (!todo)
			break;

		len = min(todo, POLLFD_PER_PAGE);
		walk = walk->next = kmalloc(struct_size(walk, entries, len),
					    GFP_KERNEL);
		if (!walk) {
			err = -ENOMEM;
			goto out_fds;
		}
	}

	poll_initwait(&table);
	fdcount = do_poll(head, &table, end_time);
	poll_freewait(&table);

	for (walk = head; walk; walk = walk->next) {
		struct pollfd *fds = walk->entries;
		int j;

		for (j = 0; j < walk->len; j++, ufds++)
			if (__put_user(fds[j].revents, &ufds->revents))
				goto out_fds;
  	}

	err = fdcount;
out_fds:
	walk = head->next;
	while (walk) {
		struct poll_list *pos = walk;
		walk = walk->next;
		kfree(pos);
	}

	return err;
}

注意do_poll是个大循环,如果超时或者有事件准备好,被中断会再次循环检查描述符集合,时间复杂度是O(n)

源码大体梳理

在这里插入图片描述

2.epoll内幕揭秘

1.思索epoll

维持一个过万的连接可不是一个小工程,单独一次上下文切换就要几微秒,思索epoll加入一个进程有过万的连接?

2.eventpoll的核心结构

在这里插入图片描述

3.epoll_create做了什么?

源码

位置:/usr/src/linux-source-5.4.0/fs/eventpoll.c

SYSCALL_DEFINE1(epoll_create, int, size)
{
        if (size <= 0)
                return -EINVAL;

        return do_epoll_create(0);
}

/*
 * Open an eventpoll file descriptor.
 */
static int do_epoll_create(int flags)
{
	int error, fd;
	struct eventpoll *ep = NULL;
	struct file *file;

	/* Check the EPOLL_* constant for consistency.  */
	BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);

	if (flags & ~EPOLL_CLOEXEC)
		return -EINVAL;
	/*
	 * Create the internal data structure ("struct eventpoll").
	 */
	error = ep_alloc(&ep);
	if (error < 0)
		return error;
	/*
	 * Creates all the items needed to setup an eventpoll file. That is,
	 * a file structure and a free file descriptor.
	 */
	fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
	if (fd < 0) {
		error = fd;
		goto out_free_ep;
	}
	file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
				 O_RDWR | (flags & O_CLOEXEC));
	if (IS_ERR(file)) {
		error = PTR_ERR(file);
		goto out_free_fd;
	}
	ep->file = file;
	fd_install(fd, file);
	return fd;

out_free_fd:
	put_unused_fd(fd);
out_free_ep:
	ep_free(ep);
	return error;
}

static int ep_alloc(struct eventpoll **pep)
{
	int error;
	struct user_struct *user;
	struct eventpoll *ep;

	user = get_current_user();
	error = -ENOMEM;
	ep = kzalloc(sizeof(*ep), GFP_KERNEL);
	if (unlikely(!ep))
		goto free_uid;

	mutex_init(&ep->mtx);
	rwlock_init(&ep->lock);
	init_waitqueue_head(&ep->wq);
	init_waitqueue_head(&ep->poll_wait);
	INIT_LIST_HEAD(&ep->rdllist);
	ep->rbr = RB_ROOT_CACHED;
	ep->ovflist = EP_UNACTIVE_PTR;
	ep->user = user;

	*pep = ep;

	return 0;

free_uid:
	free_uid(user);
	return error;
}


流程梳理

wq: 等待队列链表。软中断数据就绪的时候会通过 wq 来找到阻塞在 epoll 对象上的用户进程。

rbr: 一棵红黑树。为了支持对海量连接的高效查找、插入和删除,eventpoll 内部使用了一棵红黑树。通过这棵树来管理用户进程下添加进来的所有 socket 连接。

rdllist: 就绪的描述符的链表。当有的连接就绪的时候,内核会把就绪的连接放到 rdllist 链表里。这样应用进程只需要判断链表就能找出就绪进程,而不用去遍历整棵树。

4.epoll_ctl是整个epoll的核心

源码

/*
 * The following function implements the controller interface for
 * the eventpoll file that enables the insertion/removal/change of
 * file descriptors inside the interest set.
 */
SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
		struct epoll_event __user *, event)
{
	int error;
	int full_check = 0;
	struct fd f, tf;
	struct eventpoll *ep;
	struct epitem *epi;
	struct epoll_event epds;
	struct eventpoll *tep = NULL;

	error = -EFAULT;
	if (ep_op_has_event(op) &&
	    copy_from_user(&epds, event, sizeof(struct epoll_event)))
		goto error_return;

	error = -EBADF;
	f = fdget(epfd);
	if (!f.file)
		goto error_return;

	/* Get the "struct file *" for the target file */
	tf = fdget(fd);
	if (!tf.file)
		goto error_fput;

	/* The target file descriptor must support poll */
	error = -EPERM;
	if (!file_can_poll(tf.file))
		goto error_tgt_fput;

	/* Check if EPOLLWAKEUP is allowed */
	if (ep_op_has_event(op))
		ep_take_care_of_epollwakeup(&epds);

	/*
	 * We have to check that the file structure underneath the file descriptor
	 * the user passed to us _is_ an eventpoll file. And also we do not permit
	 * adding an epoll file descriptor inside itself.
	 */
	error = -EINVAL;
	if (f.file == tf.file || !is_file_epoll(f.file))
		goto error_tgt_fput;

	/*
	 * epoll adds to the wakeup queue at EPOLL_CTL_ADD time only,
	 * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
	 * Also, we do not currently supported nested exclusive wakeups.
	 */
	if (ep_op_has_event(op) && (epds.events & EPOLLEXCLUSIVE)) {
		if (op == EPOLL_CTL_MOD)
			goto error_tgt_fput;
		if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
				(epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
			goto error_tgt_fput;
	}

	/*
	 * At this point it is safe to assume that the "private_data" contains
	 * our own data structure.
	 */
	ep = f.file->private_data;

	/*
	 * When we insert an epoll file descriptor, inside another epoll file
	 * descriptor, there is the change of creating closed loops, which are
	 * better be handled here, than in more critical paths. While we are
	 * checking for loops we also determine the list of files reachable
	 * and hang them on the tfile_check_list, so we can check that we
	 * haven't created too many possible wakeup paths.
	 *
	 * We do not need to take the global 'epumutex' on EPOLL_CTL_ADD when
	 * the epoll file descriptor is attaching directly to a wakeup source,
	 * unless the epoll file descriptor is nested. The purpose of taking the
	 * 'epmutex' on add is to prevent complex toplogies such as loops and
	 * deep wakeup paths from forming in parallel through multiple
	 * EPOLL_CTL_ADD operations.
	 */
	mutex_lock_nested(&ep->mtx, 0);
	if (op == EPOLL_CTL_ADD) {
		if (!list_empty(&f.file->f_ep_links) ||
				ep->gen == loop_check_gen ||
						is_file_epoll(tf.file)) {
			full_check = 1;
			mutex_unlock(&ep->mtx);
			mutex_lock(&epmutex);
			if (is_file_epoll(tf.file)) {
				error = -ELOOP;
				if (ep_loop_check(ep, tf.file) != 0)
					goto error_tgt_fput;
			} else {
				get_file(tf.file);
				list_add(&tf.file->f_tfile_llink,
							&tfile_check_list);
			}
			mutex_lock_nested(&ep->mtx, 0);
			if (is_file_epoll(tf.file)) {
				tep = tf.file->private_data;
				mutex_lock_nested(&tep->mtx, 1);
			}
		}
	}

	/*
	 * Try to lookup the file inside our RB tree, Since we grabbed "mtx"
	 * above, we can be sure to be able to use the item looked up by
	 * ep_find() till we release the mutex.
	 */
	epi = ep_find(ep, tf.file, fd);

	error = -EINVAL;
	switch (op) {
	case EPOLL_CTL_ADD:
		if (!epi) {
			epds.events |= EPOLLERR | EPOLLHUP;
			error = ep_insert(ep, &epds, tf.file, fd, full_check);
		} else
			error = -EEXIST;
		break;
	case EPOLL_CTL_DEL:
		if (epi)
			error = ep_remove(ep, epi);
		else
			error = -ENOENT;
		break;
	case EPOLL_CTL_MOD:
		if (epi) {
			if (!(epi->event.events & EPOLLEXCLUSIVE)) {
				epds.events |= EPOLLERR | EPOLLHUP;
				error = ep_modify(ep, epi, &epds);
			}
		} else
			error = -ENOENT;
		break;
	}
	if (tep != NULL)
		mutex_unlock(&tep->mtx);
	mutex_unlock(&ep->mtx);

error_tgt_fput:
	if (full_check) {
		clear_tfile_check_list();
		loop_check_gen++;
		mutex_unlock(&epmutex);
	}

	fdput(tf);
error_fput:
	fdput(f);
error_return:

	return error;
}
/*
 * Must be called with "mtx" held.
 */
static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
		     struct file *tfile, int fd, int full_check)
{
	int error, pwake = 0;
	__poll_t revents;
	long user_watches;
	struct epitem *epi;
	struct ep_pqueue epq;

	lockdep_assert_irqs_enabled();

	user_watches = atomic_long_read(&ep->user->epoll_watches);
	if (unlikely(user_watches >= max_user_watches))
		return -ENOSPC;
	if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
		return -ENOMEM;

	/* Item initialization follow here ... */
	INIT_LIST_HEAD(&epi->rdllink);
	INIT_LIST_HEAD(&epi->fllink);
	INIT_LIST_HEAD(&epi->pwqlist);
	epi->ep = ep;
	ep_set_ffd(&epi->ffd, tfile, fd);
	epi->event = *event;
	epi->nwait = 0;
	epi->next = EP_UNACTIVE_PTR;
	if (epi->event.events & EPOLLWAKEUP) {
		error = ep_create_wakeup_source(epi);
		if (error)
			goto error_create_wakeup_source;
	} else {
		RCU_INIT_POINTER(epi->ws, NULL);
	}

	/* Add the current item to the list of active epoll hook for this file */
	spin_lock(&tfile->f_lock);
	list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);
	spin_unlock(&tfile->f_lock);

	/*
	 * Add the current item to the RB tree. All RB tree operations are
	 * protected by "mtx", and ep_insert() is called with "mtx" held.
	 */
	ep_rbtree_insert(ep, epi);

	/* now check if we've created too many backpaths */
	error = -EINVAL;
	if (full_check && reverse_path_check())
		goto error_remove_epi;

	/* Initialize the poll table using the queue callback */
	epq.epi = epi;
	init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);

	/*
	 * Attach the item to the poll hooks and get current event bits.
	 * We can safely use the file* here because its usage count has
	 * been increased by the caller of this function. Note that after
	 * this operation completes, the poll callback can start hitting
	 * the new item.
	 */
	revents = ep_item_poll(epi, &epq.pt, 1);

	/*
	 * We have to check if something went wrong during the poll wait queue
	 * install process. Namely an allocation for a wait queue failed due
	 * high memory pressure.
	 */
	error = -ENOMEM;
	if (epi->nwait < 0)
		goto error_unregister;

	/* We have to drop the new item inside our item list to keep track of it */
	write_lock_irq(&ep->lock);

	/* record NAPI ID of new item if present */
	ep_set_busy_poll_napi_id(epi);

	/* If the file is already "ready" we drop it inside the ready list */
	if (revents && !ep_is_linked(epi)) {
		list_add_tail(&epi->rdllink, &ep->rdllist);
		ep_pm_stay_awake(epi);

		/* Notify waiting tasks that events are available */
		if (waitqueue_active(&ep->wq))
			wake_up(&ep->wq);
		if (waitqueue_active(&ep->poll_wait))
			pwake++;
	}

	write_unlock_irq(&ep->lock);

	atomic_long_inc(&ep->user->epoll_watches);

	/* We have to call this outside the lock */
	if (pwake)
		ep_poll_safewake(&ep->poll_wait);

	return 0;

error_unregister:
	ep_unregister_pollwait(ep, epi);
error_remove_epi:
	spin_lock(&tfile->f_lock);
	list_del_rcu(&epi->fllink);
	spin_unlock(&tfile->f_lock);

	rb_erase_cached(&epi->rbn, &ep->rbr);

	/*
	 * We need to do this because an event could have been arrived on some
	 * allocated wait queue. Note that we don't care about the ep->ovflist
	 * list, since that is used/cleaned only inside a section bound by "mtx".
	 * And ep_insert() is called with "mtx" held.
	 */
	write_lock_irq(&ep->lock);
	if (ep_is_linked(epi))
		list_del_init(&epi->rdllink);
	write_unlock_irq(&ep->lock);

	wakeup_source_unregister(ep_wakeup_source(epi));

error_create_wakeup_source:
	kmem_cache_free(epi_cache, epi);

	return error;
}

/*
 * This is the callback that is passed to the wait queue wakeup
 * mechanism. It is called by the stored file descriptors when they
 * have events to report.
 *
 * This callback takes a read lock in order not to content with concurrent
 * events from another file descriptors, thus all modifications to ->rdllist
 * or ->ovflist are lockless.  Read lock is paired with the write lock from
 * ep_scan_ready_list(), which stops all list modifications and guarantees
 * that lists state is seen correctly.
 *
 * Another thing worth to mention is that ep_poll_callback() can be called
 * concurrently for the same @epi from different CPUs if poll table was inited
 * with several wait queues entries.  Plural wakeup from different CPUs of a
 * single wait queue is serialized by wq.lock, but the case when multiple wait
 * queues are used should be detected accordingly.  This is detected using
 * cmpxchg() operation.
 */
static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
{
	int pwake = 0;
	struct epitem *epi = ep_item_from_wait(wait);
	struct eventpoll *ep = epi->ep;
	__poll_t pollflags = key_to_poll(key);
	unsigned long flags;
	int ewake = 0;

	read_lock_irqsave(&ep->lock, flags);

	ep_set_busy_poll_napi_id(epi);

	/*
	 * If the event mask does not contain any poll(2) event, we consider the
	 * descriptor to be disabled. This condition is likely the effect of the
	 * EPOLLONESHOT bit that disables the descriptor when an event is received,
	 * until the next EPOLL_CTL_MOD will be issued.
	 */
	if (!(epi->event.events & ~EP_PRIVATE_BITS))
		goto out_unlock;

	/*
	 * Check the events coming with the callback. At this stage, not
	 * every device reports the events in the "key" parameter of the
	 * callback. We need to be able to handle both cases here, hence the
	 * test for "key" != NULL before the event match test.
	 */
	if (pollflags && !(pollflags & epi->event.events))
		goto out_unlock;

	/*
	 * If we are transferring events to userspace, we can hold no locks
	 * (because we're accessing user memory, and because of linux f_op->poll()
	 * semantics). All the events that happen during that period of time are
	 * chained in ep->ovflist and requeued later on.
	 */
	if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {
		if (chain_epi_lockless(epi))
			ep_pm_stay_awake_rcu(epi);
	} else if (!ep_is_linked(epi)) {
		/* In the usual case, add event to ready list. */
		if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist))
			ep_pm_stay_awake_rcu(epi);
	}

	/*
	 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
	 * wait list.
	 */
	if (waitqueue_active(&ep->wq)) {
		if ((epi->event.events & EPOLLEXCLUSIVE) &&
					!(pollflags & POLLFREE)) {
			switch (pollflags & EPOLLINOUT_BITS) {
			case EPOLLIN:
				if (epi->event.events & EPOLLIN)
					ewake = 1;
				break;
			case EPOLLOUT:
				if (epi->event.events & EPOLLOUT)
					ewake = 1;
				break;
			case 0:
				ewake = 1;
				break;
			}
		}
		wake_up(&ep->wq);
	}
	if (waitqueue_active(&ep->poll_wait))
		pwake++;

out_unlock:
	read_unlock_irqrestore(&ep->lock, flags);

	/* We have to call this outside the lock */
	if (pwake)
		ep_poll_safewake(&ep->poll_wait);

	if (!(epi->event.events & EPOLLEXCLUSIVE))
		ewake = 1;

	if (pollflags & POLLFREE) {
		/*
		 * If we race with ep_remove_wait_queue() it can miss
		 * ->whead = NULL and do another remove_wait_queue() after
		 * us, so we can't use __remove_wait_queue().
		 */
		list_del_init(&wait->entry);
		/*
		 * ->whead != NULL protects us from the race with ep_free()
		 * or ep_remove(), ep_remove_wait_queue() takes whead->lock
		 * held by the caller. Once we nullify it, nothing protects
		 * ep/epi or even wait.
		 */
		smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
	}

	return ewake;
}

流程梳理

1.分配一个红黑树节点对象 epitem

2.添加等待事件到 socket 的等待队列中,其回调函数是 ep_poll_callback,这样在数据就绪后可以通知epoll_wait的进程了

3.将 epitem 插入到 epoll 对象的红黑树里

在这里插入图片描述

5.epoll_wait做了什么?

epoll_wait 做的事情不复杂,当它被调用时它观察 eventpoll->rdllist 链表里有没有数据即可。有数据就返回,没有数据就创建一个等待队列项,将其添加到 eventpoll 的等待队列上,然后把自己阻塞掉就完事,当然水平触发和边缘触发也是在这里做的

源码分析

/*
 * Implement the event wait interface for the eventpoll file. It is the kernel
 * part of the user space epoll_wait(2).
 */
static int do_epoll_wait(int epfd, struct epoll_event __user *events,
			 int maxevents, int timeout)
{
	int error;
	struct fd f;
	struct eventpoll *ep;

	/* The maximum number of event must be greater than zero */
	if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
		return -EINVAL;

	/* Verify that the area passed by the user is writeable */
	if (!access_ok(events, maxevents * sizeof(struct epoll_event)))
		return -EFAULT;

	/* Get the "struct file *" for the eventpoll file */
	f = fdget(epfd);
	if (!f.file)
		return -EBADF;

	/*
	 * We have to check that the file structure underneath the fd
	 * the user passed to us _is_ an eventpoll file.
	 */
	error = -EINVAL;
	if (!is_file_epoll(f.file))
		goto error_fput;

	/*
	 * At this point it is safe to assume that the "private_data" contains
	 * our own data structure.
	 */
	ep = f.file->private_data;

	/* Time to fish for events ... */
	error = ep_poll(ep, events, maxevents, timeout);

error_fput:
	fdput(f);
	return error;
}
/**
 * ep_poll - Retrieves ready events, and delivers them to the caller supplied
 *           event buffer.
 *
 * @ep: Pointer to the eventpoll context.
 * @events: Pointer to the userspace buffer where the ready events should be
 *          stored.
 * @maxevents: Size (in terms of number of events) of the caller event buffer.
 * @timeout: Maximum timeout for the ready events fetch operation, in
 *           milliseconds. If the @timeout is zero, the function will not block,
 *           while if the @timeout is less than zero, the function will block
 *           until at least one event has been retrieved (or an error
 *           occurred).
 *
 * Returns: Returns the number of ready events which have been fetched, or an
 *          error code, in case of error.
 */
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
		   int maxevents, long timeout)
{
	int res = 0, eavail, timed_out = 0;
	u64 slack = 0;
	wait_queue_entry_t wait;
	ktime_t expires, *to = NULL;

	lockdep_assert_irqs_enabled();

	if (timeout > 0) {
		struct timespec64 end_time = ep_set_mstimeout(timeout);

		slack = select_estimate_accuracy(&end_time);
		to = &expires;
		*to = timespec64_to_ktime(end_time);
	} else if (timeout == 0) {
		/*
		 * Avoid the unnecessary trip to the wait queue loop, if the
		 * caller specified a non blocking operation. We still need
		 * lock because we could race and not see an epi being added
		 * to the ready list while in irq callback. Thus incorrectly
		 * returning 0 back to userspace.
		 */
		timed_out = 1;

		write_lock_irq(&ep->lock);
		eavail = ep_events_available(ep);
		write_unlock_irq(&ep->lock);

		goto send_events;
	}

fetch_events:

	if (!ep_events_available(ep))
		ep_busy_loop(ep, timed_out);

	eavail = ep_events_available(ep);
	if (eavail)
		goto send_events;

	/*
	 * Busy poll timed out.  Drop NAPI ID for now, we can add
	 * it back in when we have moved a socket with a valid NAPI
	 * ID onto the ready list.
	 */
	ep_reset_busy_poll_napi_id(ep);

	do {
		/*
		 * Internally init_wait() uses autoremove_wake_function(),
		 * thus wait entry is removed from the wait queue on each
		 * wakeup. Why it is important? In case of several waiters
		 * each new wakeup will hit the next waiter, giving it the
		 * chance to harvest new event. Otherwise wakeup can be
		 * lost. This is also good performance-wise, because on
		 * normal wakeup path no need to call __remove_wait_queue()
		 * explicitly, thus ep->lock is not taken, which halts the
		 * event delivery.
		 */
		init_wait(&wait);
		write_lock_irq(&ep->lock);
		__add_wait_queue_exclusive(&ep->wq, &wait);
		write_unlock_irq(&ep->lock);

		/*
		 * We don't want to sleep if the ep_poll_callback() sends us
		 * a wakeup in between. That's why we set the task state
		 * to TASK_INTERRUPTIBLE before doing the checks.
		 */
		set_current_state(TASK_INTERRUPTIBLE);
		/*
		 * Always short-circuit for fatal signals to allow
		 * threads to make a timely exit without the chance of
		 * finding more events available and fetching
		 * repeatedly.
		 */
		if (fatal_signal_pending(current)) {
			res = -EINTR;
			break;
		}

		eavail = ep_events_available(ep);
		if (eavail)
			break;
		if (signal_pending(current)) {
			res = -EINTR;
			break;
		}

		if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS)) {
			timed_out = 1;
			break;
		}

		/* We were woken up, thus go and try to harvest some events */
		eavail = 1;

	} while (0);

	__set_current_state(TASK_RUNNING);

	if (!list_empty_careful(&wait.entry)) {
		write_lock_irq(&ep->lock);
		__remove_wait_queue(&ep->wq, &wait);
		write_unlock_irq(&ep->lock);
	}

send_events:
	/*
	 * Try to transfer events to user space. In case we get 0 events and
	 * there's still timeout left over, we go trying again in search of
	 * more luck.
	 */
	if (!res && eavail &&
	    !(res = ep_send_events(ep, events, maxevents)) && !timed_out)
		goto fetch_events;

	return res;
}
/*
 * This is the callback that is used to add our wait queue to the
 * target file wakeup lists.
 */
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
				 poll_table *pt)
{
	struct epitem *epi = ep_item_from_epqueue(pt);
	struct eppoll_entry *pwq;

	if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
		init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
		pwq->whead = whead;
		pwq->base = epi;
		if (epi->event.events & EPOLLEXCLUSIVE)
			add_wait_queue_exclusive(whead, &pwq->wait);
		else
			add_wait_queue(whead, &pwq->wait);
		list_add_tail(&pwq->llink, &epi->pwqlist);
		epi->nwait++;
	} else {
		/* We have to signal that an error occurred */
		epi->nwait = -1;
	}
}


流程梳理

在这里插入图片描述

什么时候唤醒?

linux内核中关于TCP中有就绪队列和等待队列。会逐个调用等待队列上的ep_epoll_callback进行唤醒,ep_epoll_callback会把描述符加入event_poll的rdlist上,然后再逐个唤醒epoll_wait的进程

在这里插入图片描述
查找就绪回调函数进行唤醒

linux内核中关于TCP中有就绪队列和等待队列。

在这里插入图片描述

五、多路复用的广泛应用

1.协程底层使用了多路复用

2.c++通讯框架boostAio,libevent,muduo等都是用了epoll

3.所有语言中都有多路复用的API接口

相关文章