一次降低进程IO延迟的性能优化实践——基于block层bfq调度器(下篇)

发布时间:2023年12月31日

在上一篇《一次降低进程IO延迟的性能优化实践——基于block层bfq调度器》基础上,本文主要总结实现该IO性能优化过程遇到的 IO卡死、IO重复派发、内核crash等问题

1:IO重复派发触发了crash

在初版代码编写完成后,启动fio测试+cat读取文件,有很大概率触发了内核crash,现场如下

  • PID: 11602? TASK: ffff95f3092ddf00? CPU: 3?? COMMAND: "cat"
  • ?#0 [ffffa67081ceb390] machine_kexec at ffffffff8525bf3e
  • ?#1 [ffffa67081ceb3e8] __crash_kexec at ffffffff8536072d
  • ?#2 [ffffa67081ceb4b0] panic at ffffffff852b5dc7
  • ?#3 [ffffa67081ceb530] __warn.cold.12 at ffffffff852b5fee
  • ?#4 [ffffa67081ceb538] blk_mq_start_request at ffffffff856075d0
  • ?#5 [ffffa67081ceb560] blk_mq_start_request at ffffffff856075d0
  • ?#6 [ffffa67081ceb590] do_error_trap at ffffffff8521f9de
  • ?#7 [ffffa67081ceb5d0] do_invalid_op at ffffffff8521fe36
  • ?#8 [ffffa67081ceb5f0] invalid_op at ffffffff85c00d84
  • ??? [exception RIP: blk_mq_start_request+496]
  • ??? RIP: ffffffff856075d0? RSP: ffffa67081ceb6a0? RFLAGS: 00010202
  • ??? RAX: 0000000000000001? RBX: ffff95f28fc57810? RCX: 0000000000000018
  • ??? RDX: 00000000004b1dc2? RSI: ffff95f28fc57810? RDI: ffff95f297722758
  • ??? RBP: ffff95f38f868000?? R8: ffffa67081ceb7e8?? R9: 0000000000000000
  • ??? R10: 0000000000000000? R11: 0000000000000011? R12: ffff95f296143000
  • ??? R13: ffff95f2987fe000? R14: ffff95f2987fe050? R15: ffffa67081ceb788
  • ??? ORIG_RAX: ffffffffffffffff? CS: 0010? SS: 0018
  • ?#9 [ffffa67081ceb6b8] scsi_queue_rq at ffffffff857d1a51
  • #10 [ffffa67081ceb708] blk_mq_dispatch_rq_list at ffffffff85609f4c
  • #11 [ffffa67081ceb7d8] blk_mq_do_dispatch_sched at ffffffff8560f4ba
  • #12 [ffffa67081ceb830] __blk_mq_sched_dispatch_requests at ffffffff8560ff99
  • #13 [ffffa67081ceb890] blk_mq_sched_dispatch_requests at ffffffff85610020
  • #14 [ffffa67081ceb8a0] __blk_mq_run_hw_queue at ffffffff856076a1
  • #15 [ffffa67081ceb8b8] __blk_mq_delay_run_hw_queue at ffffffff85607f61
  • #16 [ffffa67081ceb8e0] blk_mq_sched_insert_requests at ffffffff85610351
  • #17 [ffffa67081ceb918] blk_mq_flush_plug_list at ffffffff8560b4d6
  • #18 [ffffa67081ceb998] blk_flush_plug_list at ffffffff855ffbe7
  • #19 [ffffa67081ceb9e8] blk_mq_make_request at ffffffff8560ad38
  • #20 [ffffa67081ceba78] generic_make_request at ffffffff855fe85f
  • #21 [ffffa67081cebad0] submit_bio at ffffffff855feadc
  • #22 [ffffa67081cebb10] ext4_mpage_readpages at ffffffffc08eead1 [ext4]
  • #23 [ffffa67081cebbf8] read_pages at ffffffff8543743b
  • #24 [ffffa67081cebc70] __do_page_cache_readahead at ffffffff85437721
  • ………………….

触发crash的源码位置如下

  1. void blk_mq_start_request(struct request *rq)??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
  2. {
  3. ??? struct request_queue *q = rq->q;
  4. ??? blk_mq_sched_started_request(rq);
  5. ??? trace_block_rq_issue(q, rq);
  6. ??? if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
  7. ??????????? rq->io_start_time_ns = ktime_get_ns();
  8. ??????????? rq_aux(rq)->stats_sectors = blk_rq_sectors(rq);
  9. ??????????? rq->rq_flags |= RQF_STATS;
  10. ??????????? rq_qos_issue(q, rq);
  11. ??? }
  12. ??? WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);//这里crash
  13. blk_add_timer(rq);
  14. //标记rq->state MQ_RQ_IN_FLIGHT,表示IO请求派发给磁盘驱动了
  15. ?? WRITE_ONCE(rq->state, MQ_RQ_IN_FLIGHT);
  16. }
  17. static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
  18. {
  19. ??? return READ_ONCE(rq->state);
  20. }

crash过程是:在把rq派发给磁盘驱动过程执行blk_mq_start_request()函数中,rq->state不是MQ_RQ_IDLE,然后就主动触发WARN_ON_ONCE而crash。按照经验,crash现场的RDI寄存器就是blk_mq_start_request()函数传输rq指针,看下这个rq的参数:

  • crash> request ffff95f297722758
  • ? __data_len = 0, //date_len 有问题
  • ? tag = -275282040, //tag 有问题
  • ? __sector = 18446638524612970376, //扇区地址明显有问题
  • ? bio = 0x0, //这个bio有问题
  • ? biotail = 0x0,
  • rq_disk = 0x0, /rq_disk 不可能是NULL
  • ? state = MQ_RQ_IDLE,

到这里怀疑rdi:0xffff95f297722758应该不是blk_mq_start_request()函数传参rq指针,因为打印的rq结构体变量根本不符合常理,对于不符合常理的就要另找他法。

因为这个case比较容易复现,大概率跟我在_bfq_dispatch_request()添加的代码有关。于是在blk_mq_start_request()和__bfq_dispatch_request()中添加一下调试信息,如下红色代码:

  1. void blk_mq_start_request(struct request *rq)? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
  2. {
  3. ??????? struct request_queue *q = rq->q;
  4. ??????? blk_mq_sched_started_request(rq);
  5. ??????? trace_block_rq_issue(q, rq);
  6. ??????? if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
  7. ??????????????? rq->io_start_time_ns = ktime_get_ns();
  8. ??????????????? rq_aux(rq)->stats_sectors = blk_rq_sectors(rq);
  9. ??????????????? rq->rq_flags |= RQF_STATS;
  10. ??????????????? rq_qos_issue(q, rq);
  11. ??????? }
  12. ??????? printk("%s %s %d rq:0x%llx rq->rq_disk:0x%llx \n",__func__,current->comm,current->pid,(u64)rq,(u64)rq->rq_disk);
  13. ??????? WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
  14. ??????? blk_add_timer(rq);
  15. ??????? //标记rq->state MQ_RQ_IN_FLIGHT,表示IO请求派发给磁盘驱动了
  16. ??????? WRITE_ONCE(rq->state, MQ_RQ_IN_FLIGHT);
  17. }
  18. static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
  19. {
  20. ??? ..................
  21. ??? if(bfqd->bfq_high_io_prio_mode)
  22. ??? {
  23. ?????? // bfq_high_io_prio_mode 0时间的5s内,如果遇到非high prio io,并且驱动队列IO个数大于限制,则把不派发该IO,而是临时添加到bfq_high_prio_tmp_list链表
  24. ?????? if((bfqd->rq_in_driver >= 16) && (bfqd->bfq_high_prio_tmp_list_rq_count < 100)){
  25. ??????? //rq从原有链表删掉并把rq移动到bfq_high_prio_tmp_list链表尾,派发时是从bfq_high_prio_tmp_list链表头取出rq,保证先到先派发
  26. ??????? list_add_tail(&rq->queuelist,&bfqd->bfq_high_prio_tmp_list);
  27. ??????? bfqd->bfq_high_prio_tmp_list_rq_count ++;
  28. ??????? p_process_io_info_tmp->block_io_count ++;
  29. ??????? printk("%s %s %d rq:0x%llx bfqq:0x%llx pid:%d bfqq->dispatched:%d bfq_high_prio_tmp_list_rq_count:%d rq_in_driver:%d !!!!!!!!!!!!\n",__func__,current->comm,current->pid,(u64)rq,(u64)bfqq,bfqq->pid,bfqq->dispatched,bfqd->bfq_high_prio_tmp_list_rq_count,bfqd->rq_in_driver);
  30. ??????? goto exit1;
  31. ?????? }
  32. ??? }
  33. ??? ..................
  34. }

等下次触发crash,内核打印 blk_mq_start_request cat 15092 rq:0xffff8eff2401d990 rq->rq_disk:0xffff8efe1b1b4000,看下它的成员信息:

  • crash> request 0xffff8eff2401d990
  • struct request {
  • ? __data_len = 1048576,
  • ? tag = 86,
  • ? __sector = 3468288,
  • ? bio = 0xffff8efd875e8300,
  • ? biotail = 0xffff8efd875e8300,
  • ? rq_disk = 0xffff8efe1b1b4000,
  • ? state = MQ_RQ_IN_FLIGHT,

看来,这次的rq指针是正确的,刚才通过rdi获取blk_mq_start_request()函数传参是有问题的。这个rq->state是MQ_RQ_IN_FLIGHT,就是说该rq已经派发给磁盘驱动了,在传输完成前又派发给磁盘驱动,显然重复了。再看下crash前的内核打印,印证了我的想法

  • //rq:0xffff8eff2401d990 这里被插入 bfq_high_prio_tmp_list_rq_count 链表
  • [? 132.559190] __bfq_dispatch_request cat 15092 rq:0xffff8eff2401d990 bfqq:0xffff8efe1ba0b200 pid:15092 bfqq->dispatched:17 bfq_high_prio_tmp_list_rq_count:1 rq_in_driver:16 !!!!!!!!!!!!1
  • //rq:0xffff8eff2401d990? 被派发
  • [? 132.559244] blk_mq_start_request cat 15092 rq:0xffff8eff2401d990 rq->rq_disk:0xffff8efe1b1b4000
  • //rq:0xffff8eff2401d990 又被派发
  • [? 132.561350] blk_mq_start_request cat 15092 rq:0xffff8eff2401d990 rq->rq_disk:0xffff8efe1b1b4000
  • [? 132.561398] WARNING: CPU: 1 PID: 15092 at block/blk-mq.c:696 blk_mq_start_request+0x128/0x263
  • [? 132.561401] Kernel panic - not syncing: panic_on_warn set ...
  • [? 132.561409] CPU: 1 PID: 15092 Comm: cat Kdump: loaded Tainted: G??????????? E??? ---------r-? - 4.18.0 #2
  • [? 132.561412] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
  • [? 132.561414] Call Trace:
  • [? 132.561431]? dump_stack+0x5c/0x80
  • [? 132.561437]? panic+0xe7/0x2a9
  • [? 132.561443]? ? blk_mq_start_request+0x128/0x263
  • [? 132.561447]? __warn.cold.12+0x31/0x33
  • [? 132.561450]? ? blk_mq_start_request+0x128/0x263
  • [? 132.561454]? ? blk_mq_start_request+0x128/0x263
  • [? 132.561457]? report_bug+0xb1/0xd0-

显然,rq:0xffff8eff2401d990就是被连续派发了两次,就得看看我添加的代码哪里有问题了?

  1. static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
  2. {
  3. ??? ...............
  4. ??? rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq);
  5. ??? if (rq) {
  6. ??????? if(bfqd->queue->high_io_prio_enable)
  7. ??????? {
  8. ??????????? if(rq->rq_flags & RQF_HIGH_PRIO){//高优先级IO
  9. ??????????????? //第一次遇到high prio io,置1 bfq_high_io_prio_mode,启动5s定时器,定时到了对bfq_high_io_prio_mode0
  10. ??????????????? if(bfqd->bfq_high_io_prio_mode == 0){
  11. ??????????????????? bfqd->bfq_high_io_prio_mode = 1;
  12. ??????????????????? hrtimer_start(&bfqd->bfq_high_prio_timer, ms_to_ktime(5000),HRTIMER_MODE_REL);
  13. ??????????????? }
  14. ??????????????? p_process_io_info_tmp->high_prio_io_count ++;
  15. ??????????????? p_process_io_info_tmp->dispatch_io_count++;
  16. ??????????? }
  17. ??????????? else非高优先级IO
  18. ??????????? {
  19. ?????????????? p_process_io_info_tmp->high_not_prio_io_count ++;
  20. ?????????????? if(bfqd->bfq_high_io_prio_mode)
  21. ?????????????? {
  22. ?????????????????? // bfq_high_io_prio_mode 0时间的5s内,如果遇到非high prio io,并且驱动队列IO个数大于限制,则把不派发该IO,而是临时添加到bfq_high_prio_tmp_list链表
  23. ?????????????????? if((bfqd->rq_in_driver >= 16) && (bfqd->bfq_high_prio_tmp_list_rq_count < 100)){
  24. ??????????????????????? //rq从原有链表删掉并把rq移动到bfq_high_prio_tmp_list链表尾,派发时是从bfq_high_prio_tmp_list链表头取出rq,保证先到先派发
  25. ??????????????????? ???list_add_tail(&rq->queuelist,&bfqd->bfq_high_prio_tmp_list);
  26. ?????????????????? ????bfqd->bfq_high_prio_tmp_list_rq_count ++;
  27. ??????????????????? ???p_process_io_info_tmp->block_io_count ++;
  28. ??????????????????? ///bug就出在这里,这里的rq添加到 bfq_high_prio_tmp_list 链表后,本次就不应该再派发了!!!!!!!!!但是却 goto exit1 ,该函数return rq返回该rq并派发了!!!!!!!!!!!!正确的做法是rq = NULL,赋值rqNULL
  29. ??????????????????? ??goto exit1;
  30. ?????????????????? }
  31. ?????????????? }
  32. ??????????? }
  33. ??????? }
  34. ????? ??????/*如果 bfq_high_prio_tmp_list 链表上有rq要派发,不执行这里的rq_in_driver++,在下边的exit那里会执行,当echo 0 >/sys/block/sdb/process_high_io_prio 1再置0后,这个if判断就起作用了。没这个判断,这里会bfqd->rq_in_driver++,下边的if里再bfqd->rq_in_driver++,导致rq_in_driver泄漏*/
  35. ??????? if((rq->rq_flags & RQF_HIGH_PRIO) || list_empty(&bfqd->bfq_high_prio_tmp_list)){
  36. inc_in_driver_start_rq:
  37. ??????????? bfqd->rq_in_driver++;
  38. start_rq:
  39. ??????????? rq->rq_flags |= RQF_STARTED;
  40. ??????? }
  41. ??? }
  42. exit:
  43. ??? //1:如果是高优先级IOif不成立,直接跳过。 2:如果非高优先级IO,则把rq添加到bfq_high_prio_tmp_list尾,从链表头选一个rq派发 3:如果rqNULL,则也从bfq_high_prio_tmp_list选一个rq派发
  44. ??????? if(!direct_dispatch && ((rq && !(rq->rq_flags & RQF_HIGH_PRIO)) || !rq)){
  45. ?????????? if(!list_empty(&bfqd->bfq_high_prio_tmp_list)){
  46. ???????????????? if(rq){
  47. ???????????????????? list_add_tail(&rq->queuelist,&bfqd->bfq_high_prio_tmp_list);
  48. ???????????????????? bfqd->bfq_high_prio_tmp_list_rq_count ++;
  49. ???????????????????? if(p_process_io_info_tmp)
  50. ???????????????????????? p_process_io_info_tmp->block_io_count2++;
  51. ???????????????? }
  52. ???????????????? rq = list_first_entry(&bfqd->bfq_high_prio_tmp_list, struct request, queuelist);
  53. ???????????????? list_del_init(&rq->queuelist);
  54. ???????????????? bfqd->bfq_high_prio_tmp_list_rq_count --;
  55. ???????????????? bfqd->rq_in_driver++;
  56. ???????????????? rq->rq_flags |= RQF_STARTED;
  57. ??????????? }
  58. ??????? }
  59. exit1:
  60. ??? return rq;
  61. }

问题就出在红色代码goto exit1哪里,那里的rq添加到 bfq_high_prio_tmp_list 链表后,本次就不应该再派发了,但是却 goto exit1 ,该函数return rq返回该rq并派发了。后续再从bfq_high_prio_tmp_list 链表链表取出该rq,就会导致rq重复派发了。解决方法很简单,先rq = NULL再goto exit1,这样就避免第一次派发该rq了。

2:派发IO时遇到卡死

2.1 因bfq_has_work()返回false导致一直卡死

上一个问题解决了,新的问题又来了。启动fio压测竟然卡死了,kill -9 fio进程也不行。系统有很多D进程,启动crash工具看下D进程信息

  • crash> ps -m | grep UN
  • [0 00:06:40.712] [UN]? PID: 2767?? TASK: ffff8cb3ff450000? CPU: 0?? COMMAND: "fio"
  • [0 00:06:40.718] [UN]? PID: 2780?? TASK: ffff8cb3c9d5c740? CPU: 3?? COMMAND: "fio"
  • [0 00:06:40.719] [UN]? PID: 2773?? TASK: ffff8cb3c9d317c0? CPU: 2?? COMMAND: "fio"
  • [0 00:06:40.727] [UN]? PID: 2769?? TASK: ffff8cb3c9d0df00? CPU: 3?? COMMAND: "fio"
  • [0 00:06:40.731] [UN]? PID: 2778?? TASK: ffff8cb3c9d5df00? CPU: 3?? COMMAND: "fio"
  • [0 00:06:40.735] [UN]? PID: 2772?? TASK: ffff8cb3c9d08000? CPU: 3?? COMMAND: "fio"
  • [0 00:06:40.738] [UN]? PID: 2775?? TASK: ffff8cb3c9d32f80? CPU: 3?? COMMAND: "fio"
  • [0 00:06:40.742] [UN]? PID: 2770?? TASK: ffff8cb3c9d0af80? CPU: 3?? COMMAND: "fio"
  • [0 00:06:40.744] [UN]? PID: 2768?? TASK: ffff8cb3c9d097c0? CPU: 2?? COMMAND: "fio"
  • [0 00:06:40.757] [UN]? PID: 2777?? TASK: ffff8cb3c9d30000? CPU: 2?? COMMAND: "fio"
  • [0 00:06:40.768] [UN]? PID: 2782?? TASK: ffff8cb3c9d597c0? CPU: 3?? COMMAND: "fio"
  • [0 00:06:40.769] [UN]? PID: 2764?? TASK: ffff8cb3ff454740? CPU: 0?? COMMAND: "fio"

看下栈回溯

  • crash> bt 2764
  • PID: 2764?? TASK: ffff8cb3ff454740? CPU: 0?? COMMAND: "fio"
  • ?#0 [ffffb2348279bb70] __schedule at ffffffffa84c8826
  • ?#1 [ffffb2348279bc08] schedule at ffffffffa84c8cb8
  • ?#2 [ffffb2348279bc18] rwsem_down_write_slowpath at ffffffffa7d105ed
  • ?#3 [ffffb2348279bc90] bfq_has_work at ffffffffc08054d2 [bfq]
  • ?#4 [ffffb2348279bca0] _cond_resched at ffffffffa84c8d95
  • ?#5 [ffffb2348279bcd8] ext4_file_write_iter at ffffffffc08c29bb [ext4]
  • ?#6 [ffffb2348279bd38] aio_write at ffffffffa7f31206
  • ?#7 [ffffb2348279be40] io_submit_one at ffffffffa7f31581
  • ?#8 [ffffb2348279beb8] __x64_sys_io_submit at ffffffffa7f31b82
  • ?#9 [ffffb2348279bf38] do_syscall_64 at ffffffffa7c0419b
  • #10 [ffffb2348279bf50] entry_SYSCALL_64_after_hwframe at ffffffffa86000ad
  • crash> bt 2780
  • PID: 2780?? TASK: ffff8cb3c9d5c740? CPU: 3?? COMMAND: "fio"
  • ?#0 [ffffb23482983b30] __schedule at ffffffffa84c8826
  • ?#1 [ffffb23482983bc8] schedule at ffffffffa84c8cb8
  • ?#2 [ffffb23482983bd8] rwsem_down_read_slowpath at ffffffffa84cbd05
  • ?#3 [ffffb23482983c88] ext4_direct_IO at ffffffffc08d6e5d [ext4]
  • ?#4 [ffffb23482983cf0] generic_file_read_iter at ffffffffa7e2da7f
  • ?#5 [ffffb23482983d38] aio_read at ffffffffa7f313a5
  • ?#6 [ffffb23482983e40] io_submit_one at ffffffffa7f3165b
  • ?#7 [ffffb23482983eb8] __x64_sys_io_submit at ffffffffa7f31b82
  • ?#8 [ffffb23482983f38] do_syscall_64 at ffffffffa7c0419b
  • ?#9 [ffffb23482983f50] entry_SYSCALL_64_after_hwframe at ffffffffa86000ad
  • crash> bt 2776
  • PID: 2776?? TASK: ffff8cb3c9d34740? CPU: 2?? COMMAND: "fio"
  • ?#0 [ffffb23482953958] __schedule at ffffffffa84c8826
  • ?#1 [ffffb234829539f0] schedule at ffffffffa84c8cb8
  • ?#2 [ffffb23482953a00] io_schedule at ffffffffa84c90d2
  • ?#3 [ffffb23482953a10] bit_wait_io at ffffffffa84c94dd
  • ?#4 [ffffb23482953a20] __wait_on_bit_lock at ffffffffa84c934d
  • ?#5 [ffffb23482953a58] out_of_line_wait_on_bit_lock at ffffffffa84c9421
  • ?#6 [ffffb23482953aa8] do_get_write_access at ffffffffc083ae68 [jbd2]
  • ?#7 [ffffb23482953b08] jbd2_journal_get_write_access at ffffffffc083b10c [jbd2]
  • ?#8 [ffffb23482953b28] __ext4_journal_get_write_access at ffffffffc08b63f6 [ext4]
  • ?#9 [ffffb23482953b58] ext4_reserve_inode_write at ffffffffc08d35a6 [ext4]
  • #10 [ffffb23482953b80] ext4_mark_inode_dirty at ffffffffc08d37d1 [ext4]
  • #11 [ffffb23482953bf0] ext4_dirty_inode at ffffffffc08d8a15 [ext4]
  • #12 [ffffb23482953c08] __mark_inode_dirty at ffffffffa7f0aa6a
  • #13 [ffffb23482953c40] generic_update_time at ffffffffa7ef76e6
  • #14 [ffffb23482953c50] file_update_time at ffffffffa7ef7b01
  • #15 [ffffb23482953c98] __generic_file_write_iter at ffffffffa7e2dd38
  • #16 [ffffb23482953cd8] ext4_file_write_iter at ffffffffc08c2761 [ext4]
  • #17 [ffffb23482953d38] aio_write at ffffffffa7f31206
  • #18 [ffffb23482953e40] io_submit_one at ffffffffa7f31581
  • #19 [ffffb23482953eb8] __x64_sys_io_submit at ffffffffa7f31b82
  • #20 [ffffb23482953f38] do_syscall_64 at ffffffffa7c0419b
  • #21 [ffffb23482953f50] entry_SYSCALL_64_after_hwframe at ffffffffa86000ad

有几个fio进程的栈回溯竟然是bfq_has_work,这里边没有调用什么锁呀?很奇怪,难道卡死根源跟bfq_has_work有关。看下它的源码:

  1. //返回0blk_mq_do_dispatch_sched()中就无法派发继续派发IO
  2. static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)
  3. {
  4. ??? struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
  5. ???
  6. ??? //list_empty_careful(&bfqd->dispatch)返回NULL,说明该链表上有rq派发,返回1
  7. ??? return !list_empty_careful(&bfqd->dispatch) ||
  8. ??? //bfq_tot_busy_queues(bfqd)大于0说明还有active bfqq,则派发该bfqq上的rq,此时返回1
  9. ??????? bfq_tot_busy_queues(bfqd) > 0;
  10. }

一般是派发blk-mq派发blk_mq_do_dispatch_sched()函数中会调用bfq_has_work()函数,源码如下:

  1. static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
  2. {
  3. ??? struct request_queue *q = hctx->queue;
  4. ??? struct elevator_queue *e = q->elevator;
  5. ??? LIST_HEAD(rq_list);
  6. ??? int ret = 0;
  7. ??? do {
  8. ??????? struct request *rq;
  9. ??????? //调用bfq_has_work
  10. ??????? if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
  11. ??????????? break;
  12. ??????? if (!list_empty_careful(&hctx->dispatch)) {
  13. ??????????? ret = -EAGAIN;
  14. ??????????? break;
  15. ??????? }
  16. ??????? if (!blk_mq_get_dispatch_budget(hctx))
  17. ??????????? break;
  18. ??????? //调用bfq调度器函数 bfq_dispatch_request
  19. ??????? rq = e->type->ops.dispatch_request(hctx);
  20. ??????? if (!rq) {
  21. ??????????? blk_mq_put_dispatch_budget(hctx);
  22. ??????????? blk_mq_delay_run_hw_queues(q, BLK_MQ_BUDGET_DELAY);
  23. ??????????? break;
  24. ??????? }
  25. ??????? list_add(&rq->queuelist, &rq_list);
  26. ??? /*取出rq_list链表上的req派发给磁盘驱动,如果因驱动队列繁忙或者nvme硬件繁忙导致派发失败,则把req添加hctx->dispatch等稍后派发遇到req派发失败返回false,退出while循环*/
  27. ??? } while (blk_mq_dispatch_rq_list(q, &rq_list, true));
  28. ??? return ret;
  29. }

当 bfq_has_work 返回0原本说明bfq没有IO可派发了,blk_mq_do_dispatch_sched()就不再派发IO了。但是我对bfq派发IO的bfq_dispatch_request函数做了优化,增加了一个 bfq_high_prio_tmp_list链表保存普通优先级的rq。当bfq空闲时,bfq_tot_busy_queues(bfqd)返回0,但是bfq_high_prio_tmp_list链表上还有rq要派发,此时还需要继续派发rq。fio暂存在 bfq_high_prio_tmp_list链表上的rq得不到派发,fio进程就卡主,不能再派发新rq,除非老的rq派发完成。简单说,这种情况下,要想判断bfq是否还有rq没派发,必须判断bfq_high_prio_tmp_list链表上是否有IO。于是在bfq_has_work()函数中添加如下红色代码:

  1. static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)????????????????????????? ?????????????????????????????????????????????????????????????????????????????????????????????????????????????
  2. {
  3. ??????? struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
  4. ??????? return !list_empty_careful(&bfqd->dispatch) ||
  5. ?????????????? !list_empty(&bfqd->bfq_high_prio_tmp_list) ||
  6. ??????????????? bfq_tot_busy_queues(bfqd) > 0;
  7. }

ok,这个问题解决了,但是新的问题又来了。

2.2 ?bfqq->dispatched泄漏导致的卡死

这个问题的表现也是派发IO的fio或者cat进程卡死,同样也是有很多D进程,ps -eLlf | grep fio |awk '{print $6}' | while read line;do echo "*********";cat /proc/$line/stack;done? 看下栈回溯,主要是以下两类:

  • *********
  • [<0>] rwsem_down_write_slowpath+0x32d/0x4e0
  • [<0>] ext4_file_write_iter+0x3cb/0x3e0 [ext4]
  • [<0>] aio_write+0xf6/0x1c0
  • [<0>] io_submit_one+0x131/0x3c0
  • [<0>] __x64_sys_io_submit+0xa2/0x180
  • [<0>] do_syscall_64+0x5b/0x1a0
  • [<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca
  • *********
  • [<0>] blk_mq_get_tag+0x119/0x270
  • [<0>] __blk_mq_alloc_request+0xb1/0x100
  • [<0>] blk_mq_make_request+0x14e/0x5d0
  • [<0>] generic_make_request+0xcf/0x310
  • [<0>] submit_bio+0x3c/0x160
  • [<0>] do_blockdev_direct_IO+0x21e6/0x2e60
  • [<0>] ext4_direct_IO+0x247/0x730 [ext4]
  • [<0>] generic_file_direct_write+0x93/0x160
  • [<0>] __generic_file_write_iter+0xb7/0x1c0
  • [<0>] ext4_file_write_iter+0x171/0x3e0 [ext4]
  • [<0>] aio_write+0xf6/0x1c0
  • [<0>] io_submit_one+0x131/0x3c0
  • [<0>] __x64_sys_io_submit+0xa2/0x180
  • [<0>] do_syscall_64+0x5b/0x1a0
  • [<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca

分析根源应该是有进程 __blk_mq_alloc_request->blk_mq_get_tag 分配tag失败导致的。

在派发IO的__bfq_dispatch_request()函数最后添加如下红色代码调试信息。

  1. static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
  2. {
  3. exit:
  4. ??? ..............
  5. ?? ?printk("5:%s %s %d? dispatch rq:0x%llx bfq_high_io_prio_count:%d rq_in_driver:%d\n",__func__,current->comm,current->pid,(u64)rq,bfqd->bfq_high_io_prio_count,bfqd->rq_in_driver);
  6. ??? return rq;
  7. }

卡死时刷屏打印如下信息:

5:__bfq_dispatch_request kworker/3:1H 497? dispatch rq:0x0 bfq_high_io_prio_count:0 rq_in_driver:0

这是blk-mq驱动了内核线程在疯狂的派发rq,但是派发的rq一直是NULL。正常情况应该会退出派发的!

看下 497 派发IO的函数流程,为什么会一直派发rq呢?执行这个命令stap --all-modules? -ve 'probe module("bfq").function("bfq_dispatch_request") {printf("%s %d\n",execname(),tid()) print_backtrace()}',刷屏打印

  • kworker/3:1H 497
  • ?0xffffffffc06b3950 : bfq_dispatch_request+0x0/0x9f0 [bfq]
  • ?0xffffffffa480f385 : blk_mq_do_dispatch_sched+0xc5/0x160 [kernel]
  • ?0xffffffffa480feb9 : __blk_mq_sched_dispatch_requests+0x189/0x1e0 [kernel]
  • ?0xffffffffa480ff40 : blk_mq_sched_dispatch_requests+0x30/0x60 [kernel]
  • ?0xffffffffa48076a1 : __blk_mq_run_hw_queue+0x51/0xd0 [kernel]
  • ?0xffffffffa44d3477 : process_one_work+0x1a7/0x360 [kernel]
  • ?0xffffffffa44d3b40 : worker_thread+0x30/0x390 [kernel]
  • ?0xffffffffa44d9502 : kthread+0x112/0x130 [kernel]
  • ?0xffffffffa4e00255 : ret_from_fork+0x35/0x40 [kernel]
  • ?0xffffffffa4e00255 : ret_from_fork+0x35/0x40 [kernel] (inexact)

为什么 kworker/3:1H 进程会刷屏执行 __blk_mq_run_hw_queue 而最终疯狂派发 rq 呢?继续执行stap --all-modules? -ve 'probe kernel.function("blk_mq_do_dispatch_sched").return {if(tid()== 497) {printf("%s %d\n",execname(),tid()) print_backtrace()}}'调试,刷屏打印:

  • kworker/3:1H 497
  • Returning from:? 0xffffffffa480f2c0 : blk_mq_do_dispatch_sched+0x0/0x160 [kernel]
  • Returning to? :? 0xffffffffa480feb9 : __blk_mq_sched_dispatch_requests+0x189/0x1e0 [kernel]
  • ?0xffffffffa480ff40 : blk_mq_sched_dispatch_requests+0x30/0x60 [kernel]
  • ?0xffffffffa48076a1 : __blk_mq_run_hw_queue+0x51/0xd0 [kernel]
  • ?0xffffffffa44d3477 : process_one_work+0x1a7/0x360 [kernel]
  • ?0xffffffffa44d3b40 : worker_thread+0x30/0x390 [kernel]
  • ?0xffffffffa44d9502 : kthread+0x112/0x130 [kernel]
  • ?0xffffffffa4e00255 : ret_from_fork+0x35/0x40 [kernel]
  • ?0xffffffffa4e00255 : ret_from_fork+0x35/0x40 [kernel] (inexact)

源码分析这是blk-mq驱动启动的内核线程,而启动的根源在blk_mq_run_work_fn()函数,继续用如下命令调试stap --all-modules? -ve 'probe kernel.function("blk_mq_run_work_fn") {if(tid()== 497) {printf("%s %d\n",execname(),tid()) print_backtrace()}}',刷屏打印:

  • kworker/3:1H 497
  • ?0xffffffffa4807720 : blk_mq_run_work_fn+0x0/0x20 [kernel]
  • ?0xffffffffa44d3477 : process_one_work+0x1a7/0x360 [kernel]
  • ?0xffffffffa44d3b40 : worker_thread+0x30/0x390 [kernel]
  • ?0xffffffffa44d9502 : kthread+0x112/0x130 [kernel]
  • ?0xffffffffa4e00255 : ret_from_fork+0x35/0x40 [kernel]
  • ?0xffffffffa4e00255 : ret_from_fork+0x35/0x40 [kernel] (inexact)

这个打印验证了想法。并且,分析可能性最大是__blk_mq_delay_run_hw_queue函数里执行的__blk_mq_run_hw_queue函数。用如下命令验证stap --all-modules? -ve 'probe kernel.function("__blk_mq_delay_run_hw_queue") {{printf("%s %d\n",execname(),tid()) print_backtrace()}}',刷屏打印:

  • kworker/3:1H 497
  • ?0xffffffffa4807e20 : __blk_mq_delay_run_hw_queue+0x0/0x160 [kernel]
  • ?0xffffffffa4807fd8 : blk_mq_delay_run_hw_queues+0x38/0x50 [kernel]
  • ?0xffffffffa480f412 : blk_mq_do_dispatch_sched+0x152/0x160 [kernel]
  • ?0xffffffffa480feb9 : __blk_mq_sched_dispatch_requests+0x189/0x1e0 [kernel]
  • ?0xffffffffa480ff40 : blk_mq_sched_dispatch_requests+0x30/0x60 [kernel]
  • ?0xffffffffa48076a1 : __blk_mq_run_hw_queue+0x51/0xd0 [kernel]
  • ?0xffffffffa44d3477 : process_one_work+0x1a7/0x360 [kernel]
  • ?0xffffffffa44d3b40 : worker_thread+0x30/0x390 [kernel]
  • ?0xffffffffa44d9502 : kthread+0x112/0x130 [kernel]
  • ?0xffffffffa4e00255 : ret_from_fork+0x35/0x40 [kernel]
  • ?0xffffffffa4e00255 : ret_from_fork+0x35/0x40 [kernel] (inexact)

综合这些调试信息,基本可以确定:blk_mq_do_dispatch_sched()函数因为派发的rq 是NULL,而频繁执行 blk_mq_delay_run_hw_queues(q, BLK_MQ_BUDGET_DELAY)->blk_mq_delay_run_hw_queue->__blk_mq_delay_run_hw_queue->kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,msecs_to_jiffies(msecs))? 而再次触发 mq 异步派发进程,就是 kworker/3:1H497 进程。这个逻辑好像没问题,但是为什么会频繁触发 blk-mq 异步派发进程 kworker/3:1H 497 呢?看下blk_mq_do_dispatch_sched()函数派发IO的代码:

  1. static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
  2. {
  3. ??? struct request_queue *q = hctx->queue;
  4. ??? struct elevator_queue *e = q->elevator;
  5. ??? LIST_HEAD(rq_list);
  6. ??? int ret = 0;
  7. ??? do {
  8. ??????? struct request *rq;
  9. ??????? if (e->type->ops.has_work && !e->type->ops.has_work(hctx))//bfq_has_work
  10. ??????????? break;
  11. ??????? if (!list_empty_careful(&hctx->dispatch)) {
  12. ??????????? ret = -EAGAIN;
  13. ??????????? break;
  14. ??????? }
  15. ??????? if (!blk_mq_get_dispatch_budget(hctx))
  16. ??????????? break;
  17. ??????? rq = e->type->ops.dispatch_request(hctx);//调用bfq调度器函数 bfq_dispatch_request
  18. ??????? if (!rq) {
  19. ??????????? //如果bfq_dispatch_request返回rqNULL,则执行blk_mq_delay_run_hw_queues()启动blk-mq异步派发IO内核线程
  20. ??????????? blk_mq_put_dispatch_budget(hctx);
  21. ??????????? blk_mq_delay_run_hw_queues(q, BLK_MQ_BUDGET_DELAY);
  22. ??????????? break;
  23. ??????? }
  24. ??????? list_add(&rq->queuelist, &rq_list);
  25. ??? /*取出rq_list链表上的req派发给磁盘驱动,如果因驱动队列繁忙或者nvme硬件繁忙导致派发失败,则把rq添加hctx->dispatch等稍后派发遇到rq派发失败返回false,退出while循环*/
  26. ??? } while (blk_mq_dispatch_rq_list(q, &rq_list, true));
  27. ??? return ret;
  28. }

跟踪下bfq_has_work()函数,stap --all-modules? -ve 'probe module("bfq").function("bfq_has_work").return {{printf("%s %d %d bfqd:0x%x\n",execname(),tid(),$return,$hctx->queue->elevator->elevator_data)}}',刷屏打印如下:

  • kworker/3:1H 497 1 bfqd:0xffffa0657f07e800
  • kworker/3:1H 497 1 bfqd:0xffffa0657f07e800

是在没什么思路,那就把bfq算法核心数据bfqq或bfqd结构体成员信息打印出来,看能否发现什么异常!启动crash,

  • crash> bfq_data 0xffffa0657f07e800
  • struct bfq_data {
  • ? queue = 0xffffa0659740eda8,
  • ? dispatch = {
  • ??? next = 0xffffa0657f07e808,
  • ??? prev = 0xffffa0657f07e808
  • ? },
  • ...........
  • ?bfq_high_prio_tmp_list = {
  • ??? next = 0xffffa0657f07ec28,
  • ??? prev = 0xffffa0657f07ec28
  • ? },

这两个暂存IO的链表都是空的,那bfq_has_work函数返回1只能可能是 bfq_tot_busy_queues 返回true 了,测试一下果然是。stap --all-modules? -ve 'probe module("bfq").function("bfq_tot_busy_queues").return {{printf("%s %d %d\n",execname(),tid(),$return)}}'刷屏打印:

  • kworker/3:1H 497 21
  • kworker/3:1H 497 21
  • kworker/3:1H 497 21
  • kworker/3:1H 497 21
  • kworker/3:1H 497 21
  • kworker/3:1H 497 21

此时,怀疑有很多IO的派发都有问题。我在内核检测哪些rq添加到bfq算法队列后30s还没传输完成,结果打印:

  • [10168.410008] rq:0xffffa0659b96e110 long time do not dispatch
  • [10168.410008] rq:0xffffa0659b95f790 long time do not dispatch
  • [10168.410008] rq:0xffffa0659b950010 long time do not dispatch
  • [10168.410009] rq:0xffffa0659b968350 long time do not dispatch
  • [10168.410009] rq:0xffffa065974b8350 long time do not dispatch
  • [10168.410009] rq:0xffffa0659b958e90 long time do not dispatch
  • [10168.411852] 5:__bfq_dispatch_request kworker/3:1H 497? dispatch rq:0x0 bfq_high_io_prio_count:0 rq_in_driver:0
  • [10168.415764] 5:__bfq_dispatch_request kworker/3:1H 497? dispatch rq:0x0 bfq_high_io_prio_count:0 rq_in_driver:0
  • [10168.419817] 5:__bfq_dispatch_request kworker/3:1H 497? dispatch rq:0x0 bfq_high_io_prio_count:0 rq_in_driver:0
  • [10168.423622] 5:__bfq_dispatch_request kworker/3:1H 497? dispatch rq:0x0 bfq_high_io_prio_count:0 rq_in_driver:0
  • [10168.427652] 5:__bfq_dispatch_request kworker/3:1H 497? dispatch rq:0x0 bfq_high_io_prio_count:0 rq_in_driver:0

有时一个很大的疑问,还是重点看下 __bfq_dispatch_request 函数为什么派发的rq总是0把!怀疑 里边返回的 bfq_select_queue 有问题。因为__bfq_dispatch_request函数中是先执行bfq_select_queue选择一个bfqq,再从bfqq中跳一个rq派发,是否bfq_select_queue选择的bfqq就有问题呢?当有很多怀疑点时,就抓住核心的疑问穷追不舍!

用stap --all-modules? -ve 'probe module("bfq").function("bfq_select_queue").return {{printf("%s %d %d\n",execname(),tid(),$return)}}'这个命令调试,打印

  • kworker/3:1H 497 0
  • kworker/3:1H 497 0
  • kworker/3:1H 497 0
  • kworker/3:1H 497 0
  • kworker/3:1H 497 0
  • kworker/3:1H 497 0
  • kworker/3:1H 497 0
  • kworker/3:1H 497 0

果然 bfq_select_queue 返回的bfqq 有问题。那就通过bfqd->in_service_queue看下当前正在派发IO的bfqq是哪个!前文调试已经知道bfqd指针是0xffffa0657f07e800。

  • crash> bfq_data 0xffffa0657f07e800 | grep in_service_queue
  • ? in_service_queue = 0xffffa06597e1c000,
  • crash> bfq_queue 0xffffa06597e1c000 | grep pid
  • ? pid = 1272,
  • crash> bt 1272
  • PID: 1272?? TASK: ffffa065a692df00? CPU: 0?? COMMAND: "jbd2/sdb-8"
  • ?#0 [ffffbcc1c21efa48] __schedule at ffffffffa4cc8826
  • ?#1 [ffffbcc1c21efae0] schedule at ffffffffa4cc8cb8
  • ?#2 [ffffbcc1c21efaf0] io_schedule at ffffffffa4cc90d2
  • ?#3 [ffffbcc1c21efb00] blk_mq_get_tag at ffffffffa480dca9
  • ?#4 [ffffbcc1c21efb78] __blk_mq_alloc_request at ffffffffa4807ba1
  • ?#5 [ffffbcc1c21efb98] blk_mq_make_request at ffffffffa480ab5e
  • ?#6 [ffffbcc1c21efc28] generic_make_request at ffffffffa47fe85f
  • ?#7 [ffffbcc1c21efc80] submit_bio at ffffffffa47feadc
  • ?#8 [ffffbcc1c21efcc0] submit_bh_wbc at ffffffffa471673a
  • ?#9 [ffffbcc1c21efcf8] jbd2_journal_commit_transaction at ffffffffc06e28a4 [jbd2]
  • #10 [ffffbcc1c21efea0] kjournald2 at ffffffffc06e792d [jbd2]
  • #11 [ffffbcc1c21eff10] kthread at ffffffffa44d9502
  • #12 [ffffbcc1c21eff50] ret_from_fork at ffffffffa4e00255

当前正在派发rq的bfqq的进程竟然卡死了!继续看下bfq_select_queue函数里有哪些疑问?看下他的函数源码:

  1. static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
  2. {
  3. ??? ................
  4. ??? if (bfq_bfqq_wait_request(bfqq) ||
  5. ??????? (bfqq->dispatched != 0 && bfq_better_to_idle(bfqq))) {
  6. ??????? ..........
  7. ?????? ?//如果进程有异步bfqq,则取出这个异步bfqq
  8. ??????? if (async_bfqq &&
  9. ??????????? icq_to_bic(async_bfqq->next_rq->elv.icq) == bfqq->bic &&
  10. ??????????? bfq_serv_to_charge(async_bfqq->next_rq, async_bfqq) <=
  11. ??????????? bfq_bfqq_budget_left(async_bfqq))
  12. ??????????? bfqq = bfqq->bic->bfqq[0];
  13. ??????? else if (bfq_bfqq_has_waker(bfqq) &&
  14. ?????????????? bfq_bfqq_busy(bfqq->waker_bfqq) &&
  15. ?????????????? bfqq->next_rq &&
  16. ?????????????? bfq_serv_to_charge(bfqq->waker_bfqq->next_rq,
  17. ????????????????????????? bfqq->waker_bfqq) <=
  18. ?????????????? bfq_bfqq_budget_left(bfqq->waker_bfqq)
  19. ??????????? )
  20. ??????????? //取出bfqq->waker_bfqq
  21. ??????????? bfqq = bfqq->waker_bfqq;
  22. ????????????? //bfqd->in_service_queue这个bfqq绑定的进程空闲时没有大量连续快速向bfqq->sort_list插入IO请求特性
  23. ??????? else if (!idling_boosts_thr_without_issues(bfqd, bfqq) &&
  24. ????????????? //bfqd->in_service_queue这个bfqq没有权重提升
  25. ???????????? (bfqq->wr_coeff == 1 || bfqd->wr_busy_queues > 1 ||
  26. ??????????? //bfqd->in_service_queue这个bfqq绑定的进程在派发IO请求时,没有快速插入IO请求的特性
  27. ????????????? !bfq_bfqq_has_short_ttime(bfqq)))
  28. ??????????? /*if成立说明bfqd->in_service_queue这个bfqq初步符合被inject bfqq抢占的条件,在bfq_choose_bfqq_for_injection()里,如果遍历st->active tree上的bfqq,符合bfqd->rq_in_driver < limit条件,就返回这个bfqq,抢占bfqd->in_service_queue*/
  29. ??????????? bfqq = bfq_choose_bfqq_for_injection(bfqd);
  30. ??????? else
  31. ??????????? bfqq = NULL;
  32. ??????? goto keep_queue;
  33. ??? }
  34. expire:
  35. ??? //bfqq过期失效
  36. ??? bfq_bfqq_expire(bfqd, bfqq, false, reason);
  37. new_queue:
  38. ??? bfqq = bfq_set_in_service_queue(bfqd);
  39. ??? if (bfqq) {
  40. ??????? //找到bfqqgoto check_queue分支
  41. ??????? goto check_queue;
  42. ??? }
  43. keep_queue:
  44. ??? return bfqq;
  45. }

用stap --all-modules? -ve 'probe module("bfq").function("bfq_bfqq_expire") {{printf("%s %d 0x%x\n",execname(),tid(),$bfqq)}}'看下是否执行了bfq_bfqq_expire()函数,什么打印都没有。再用stap --all-modules? -ve 'probe module("bfq").function("idling_boosts_thr_without_issues").return {{printf("%s %d 0x%x\n",execname(),tid(),$return)}}'看下是否调用了idling_boosts_thr_without_issues函数,刷屏打印:

  • kworker/3:1H 497 0x0
  • kworker/3:1H 497 0x0
  • kworker/3:1H 497 0x0
  • kworker/3:1H 497 0x0
  • kworker/3:1H 497 0x0

看来执行到了if (!idling_boosts_thr_without_issues(bfqd, bfqq)…)那个if判断,我认为这个if不成立,而是执行了else分支bfqq = NULL,然后goto keep_queue返回bfqq = NULL,这样就导致bfq_select_queue()函数一直返回NULL呀。怎么验证,启动crash工具,前文知道当前的bfqq指针是0xffffa06597e1c000:

  • crash> bfq_queue 0xffffa06597e1c000 | grep wr_coeff
  • ? wr_coeff = 30,
  • crash> bfq_data 0xffffa0657f07e800 | grep wr_busy_queues
  • ? wr_busy_queues = 1,
  • ?crash> bfq_queue 0xffffa06597e1c000 -x | grep flags
  • ? flags = 0xf2,
  • crash> bfq_queue 0xffffa06597e1c000 | grep dispatched
  • ? dispatched = 2, //bfqq

BFQQF_has_short_ttime 是bit5 ,而现在 bfqq:0xffffa06597e1c000 的 flags bit5是1,因此 if(!idling_boosts_thr_without_issues(bfqd, bfqq) &&(bfqq->wr_coeff == 1 || bfqd->wr_busy_queues > 1 ||!bfq_bfqq_has_short_ttime(bfqq))) 不成立,因此 否else 分支, bfqq=NULL,这就是 bfq_select_queue 返回的bfqq是NULL。神奇了,为什么会这样呢?

难道我在bfqq添加的代码影响到了 bfqq 算法?那段代码要成立,得先有更外边的 if (bfq_bfqq_wait_request(bfqq) || (bfqq->dispatched != 0 && bfq_better_to_idle(bfqq))) 成立!而BFQQF_wait_request 是bit2,但bfqq的flags的bit2是0。bfqq->dispatched 是2,那应该是这个导致if ((bfqq->dispatched != 0 && bfq_better_to_idle(bfqq))) 成立。

验证一下 bfq_better_to_idle()返回true,stap --all-modules? -ve 'probe module("bfq").function("bfq_better_to_idle").return {{printf("%s %d bfqq:0x%x 0x%x\n",execname(),tid(),$bfqq,$return)}}',刷屏打印:

  • kworker/3:1H 497 bfqq:0xffffa06597e1c000 0x1
  • kworker/3:1H 497 bfqq:0xffffa06597e1c000 0x1
  • kworker/3:1H 497 bfqq:0xffffa06597e1c000 0x1
  • kworker/3:1H 497 bfqq:0xffffa06597e1c000 0x1
  • kworker/3:1H 497 bfqq:0xffffa06597e1c000 0x1
  • kworker/3:1H 497 bfqq:0xffffa06597e1c000 0x1
  • kworker/3:1H 497 bfqq:0xffffa06597e1c000 0x1

看来,如果 bfqq:0xffffa06597e1c000 的 dispatched 是0,那if就不会成立了吗。但事实是bfqq->dispatched 始终是2!

看来问题的根源是 bfqq:0xffffa06597e1c000 的 dispatched 始终是2,大于0 呀?神奇了,难道 我的代码导致 bfqq:0xffffa06597e1c000 的 dispatched 泄漏了,导致始终大于0?仔细分析我在__bfq_dispatch_request()中添加的代码,果然发现了问题,如下红色代码:

  1. static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
  2. {
  3. ??? ....................
  4. ??? rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq);
  5. ??
  6. ??? if (rq) {
  7. ??????????? if(bfqd->queue->high_io_prio_enable)
  8. ??????????? {
  9. ??????????????????? if(rq->rq_flags & RQF_HIGH_PRIO){//高优先级IO
  10. ??????????????????????? if(bfqd->bfq_high_io_prio_mode == 0){
  11. ??????????????????????????? bfqd->bfq_high_io_prio_mode = 1;
  12. ??????????????????????????? hrtimer_start(&bfqd->bfq_high_prio_timer, ms_to_ktime(5000),HRTIMER_MODE_REL);
  13. ??????????????????????? }
  14. ??????????????????? }
  15. ??????????????????? else//非高优先级IO
  16. ??????????????????? {
  17. ?????????????????????? if(bfqd->bfq_high_io_prio_mode)
  18. ?????????????????????? {
  19. ?????????????????????????? // bfq_high_io_prio_mode 0时间的5s内,如果遇到非high prio io,并且驱动队列IO个数大于限制,则把不派发该IO,而是临时添加到bfq_high_prio_tmp_list链表
  20. ???????????????? ??????????if(bfqd->rq_in_driver >= HIGH_PRIO_IO_LIMIT){
  21. ??????????????????????????????? list_add_tail(&rq->queuelist,&bfqd->bfq_high_prio_tmp_list);
  22. ??????????????????????????????? bfqq->dispatched --;
  23. ??????????????????????????????? bfqd->bfq_high_io_prio_count ++;
  24. ??????????????????????????????? return NULL;
  25. ?????????????????????????? }
  26. ?????????????????????? }
  27. ??????????????????? }
  28. ?????????? }
  29. ?????? if(list_empty(&bfqd->bfq_high_prio_tmp_list)){
  30. inc_in_driver_start_rq:
  31. ??????????? bfqd->rq_in_driver++;
  32. start_rq:
  33. ??????????? rq->rq_flags |= RQF_STARTED;
  34. ??????? }
  35. ??? }
  36. exit:
  37. ??? //1:如果是高优先级IOif不成立,直接跳过。 2:如果非高优先级IO,则把rq添加到bfq_high_prio_tmp_list尾,从链表头选一个rq派发 3:如果rqNULL,则也从bfq_high_prio_tmp_list选一个rq派发
  38. ??? if(((rq && !(rq->rq_flags & RQF_HIGH_PRIO)) || !rq)){
  39. ?????? if(!list_empty(&bfqd->bfq_high_prio_tmp_list)){
  40. ???????????? if(rq){
  41. ???????????????? list_add_tail(&rq->queuelist,&bfqd->bfq_high_prio_tmp_list);
  42. ???????????????? bfqq->dispatched --;
  43. ???????????????? bfqd->bfq_high_io_prio_count ++;
  44. ??? ?????????}
  45. ???????????? rq = list_first_entry(&bfqd->bfq_high_prio_tmp_list, struct request, queuelist);
  46. ???????????? list_del_init(&rq->queuelist);
  47. ???????????? bfqd->bfq_high_io_prio_count --;
  48. ???????????? bfqq = RQ_BFQQ(rq);
  49. ???????????? if(bfqq)
  50. ???????????????? bfqq->dispatched++;
  51. ???????????? bfqd->rq_in_driver++;
  52. ???????????? rq->rq_flags |= RQF_STARTED;
  53. ??????? }
  54. ??? }
  55. ??? return rq;
  56. }

如果rq有 RQF_HIGH_PRIO属性,rq在派发时先有__bfq_dispatch_request->bfq_dispatch_rq_from_bfqq()默认的bfqq->dispatched++。回到__bfq_dispatch_request函数,如果 bfq_high_prio_tmp_list 链表空,那if(!list_empty(&bfqd->bfq_high_prio_tmp_list))不成立,就不会执行 rq->rq_flags |= RQF_STARTED 。再下边的 if(((rq && !(rq->rq_flags & RQF_HIGH_PRIO)) || !rq)) 也不成立。于是再次错过了rq->rq_flags |= RQF_STARTED。

等rq传输完成,执行到bfq_finish_requeue_request函数

  1. static void bfq_finish_requeue_request(struct request *rq)
  2. {
  3. ??? //由传输完成的IO请求rq得到bfqq
  4. ??? struct bfq_queue *bfqq = RQ_BFQQ(rq);
  5. ??? struct bfq_data *bfqd;
  6. ???
  7. ??? if (likely(rq->rq_flags & RQF_STARTED)) {
  8. ??????? unsigned long flags;
  9. ??????? spin_lock_irqsave(&bfqd->lock, flags);
  10. ??????? if (rq == bfqd->waited_rq)
  11. ??????????? bfq_update_inject_limit(bfqd, bfqq);
  12. ??????? //IO传输完成重点执行的函数在这里
  13. ??????? bfq_completed_request(bfqq, bfqd);
  14. ??????? bfq_finish_requeue_request_body(bfqq);
  15. ??????? spin_unlock_irqrestore(&bfqd->lock, flags);
  16. ??? }
  17. }
  18. static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
  19. {
  20. ??? u64 now_ns;
  21. ??? u32 delta_us;
  22. ??? bfq_update_hw_tag(bfqd);
  23. ??? //已经派发但是还没传输完成的reqIO请求个数
  24. ??? bfqd->rq_in_driver--;
  25. ??? //还没有传输完成的IO请求个数,为0表示所有的IO请求都传输完成了,跟bfqd->rq_in_driver类似
  26. ??? bfqq->dispatched--;
  27. }

因为 rq 没有 RQF_STARTED 标记,导致没有执行bfqq->dispatched--,这就导致bfqq->dispatched泄漏了。解决方法很简单,rq 有 RQF_HIGH_PRIO属性标记并且 bfq_high_prio_tmp_list 链表空时,也要执行 rq->rq_flags |= RQF_STARTED。把if(list_empty(&bfqd->bfq_high_prio_tmp_list))改成if((rq->rq_flags & RQF_HIGH_PRIO) || list_empty(&bfqd->bfq_high_prio_tmp_list))即可!

就是一个细节逻辑分析疏忽,导致了这么复杂的排查过程,服了!

最后,关于blk-mq内核派发rq的kworker/0:1H内核线程多了一层理解。blk_mq_do_dispatch_sched函数中,因为以后很多个rq暂存在 bfq_high_prio_tmp_list链表, if (e->type->ops.has_work && !e->type->ops.has_work(hctx)) 不成立。于是执行 rq = e->type->ops.dispatch_request(hctx) 即 __bfq_dispatch_request()。

如果 进程在 执行__bfq_dispatch_request时,因为rq没有RQF_HIGH_PRIO属性,导致__bfq_dispatch_request返回NULL,即 rq = e->type->ops.dispatch_request(hctx) 返回NULL,那就执行 blk_mq_delay_run_hw_queues(q, BLK_MQ_BUDGET_DELAY) ,在kworker/0:1H 内核线程延迟派发rq。然后2ms后再次执行 blk_mq_do_dispatch_sched,重复上述流程,直到bfq_high_prio_tmp_list链表上的rq全派发完。然后bfq_high_prio_tmp_list链表空,kworker/0:1H 线程最后一次执行 blk_mq_do_dispatch_sched(),bfq_has_work返回NULL,if (e->type->ops.has_work && !e->type->ops.has_work(hctx)) 成立,最终退出rq派发。

相当于我利用了 blk-mq blk_mq_delay_run_hw_queues(q, BLK_MQ_BUDGET_DELAY) 延迟派发的特性,从而保证没有进程执行 __blk_mq_sched_dispatch_requests->blk_mq_do_dispatch_sched->blk_mq_dispatch_rq_list 派发rq时,也可以由内核线程 kworker/0:1H? 延迟派发完所有的rq。这样我就不用担心rq暂存到bfq_high_prio_tmp_list链表后,会导致这些rq无法被进程主动派发了!

3:bfqq->dispatched和rq暂存bfq_high_prio_tmp_list链表的深入分析

我在bfq添加的代码,有多处 执行 bfqq->dispatched -- 和 bfqq->dispatched ++。本身rq在rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq) 里已经执行 bfqq->dispatched ++。我在bfq添加的 bfqq->dispatched -- 和 bfqq->dispatched ++ 是否会影响bfq算法呢?我原本的意思是,rq如果 添加到 bfq_high_prio_tmp_list链表,那就bfqq->dispatched --,等rq真正派发时再 bfqq->dispatched ++。但是这样有问题,如果rq在bfq_high_prio_tmp_list链表停留时间过长,因为提前 bfqq->dispatched --,如果这是bfqq的最后一个rq,就相当于bfqq的所有rq全派发完成了。

但实际并没有,只是rq暂存在 bfq_high_prio_tmp_list链表而已。如果 bfqq->dispatched 是0了,那估计会影响bfqq过期失效,从st->active tree剔除。这样,等该bfqq暂存在 bfq_high_prio_tmp_list链表上的rq终于派发了,再 bfqq->dispatched ++。这样就有问题了,因为该bfqq可能已经被新进程拥有了!这样分析,我的代码里不应该 bfqq->dispatched ++ 和 bfqq->dispatched --。不对,分析错了。因为先有 rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq) 里的 bfqq->dispatched ++,然后再有我的代码里的 bfqq->dispatched --,这就相当于该bfqq上的rq并没有派发呀,rq还保存在bfqq上,这样bfqq也不会过期失效的!!!!!!

但是,我的bfq代码是否可以放到 rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq); 前边呢?因为 rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq) 执行过后,相当于rq就从bfqq上的链表剔除了,而我把该rq长时间保存在 bfq_high_prio_tmp_list链表,可能会影响bfq算法呀。因为正常 rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq) 执行过的rq很快就会传输成功呀。而我是把rq暂存在bfq_high_prio_tmp_list链表,可能要过一段时间才会传输完成。

并且 rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq) 选中要派发的rq一定来自bfqq->next_rq ,并且还会执行 bfq_dispatch_rq_from_bfqq->bfq_bfqq_served 把rq传输消耗的配额累加到rq所属bfqq的entity->service,然后我把rq添加到bfq_high_prio_tmp_list链表。如果这个bfqq的配额正好消耗光了,那bfqq就会过期失效。等从bfq_high_prio_tmp_list链表再取出这个rq,rq所属的bfqq已经过期失效了,然后的代码里却 bfqq->dispatched++ 。然后派发给驱动,等rq传输完成,执行bfq_completed_request(),还要 bfqq->dispatched--。这样就会有问题了,因为bfqq已经过期失效了!

问题来了,rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq)从bfqq取出rq,然后把rq添加到bfq_high_prio_tmp_list链表后,rq和原属的bfqq要不要彻底脱离关系???不脱离关系,那rq在bfq_high_prio_tmp_list链表暂存时,bfqq可能因配额消耗光而失效。这样从bfq_high_prio_tmp_list链表取出该rq后,使用rq的bfqq已经过期失效了?不能再按照原流程处理了!那怎么办?rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq)从bfqq取出rq,然后把rq添加到bfq_high_prio_tmp_list链表后:先执行bfqq->dispatched--,这制造一个假象,这个rq传输完成了!因为正常bfqq->dispatched--就说明rq传输完成了。然后执行 rq->elv.priv[0] = NULL 和 rq->elv.priv[1] = NULL ,令rq所属的bfqq是NULL,这样rq和bfqq就脱离关系了!接着,从bfq_high_prio_tmp_list链表取出该rq后,不再执行bfqq->dispatched++,因为rq不再属于哪个 bfqq了,接着派发该rq。然后在该rq传输完成后,执行bfq_finish_requeue_request()函数,因rq所属bfqq是NULL,则直接返回,不会再执行bfq_completed_request()令bfqq->dispatched--了。

但是这个方案也有一个问题,因为正常情况,rq传输完成后,会执行 bfq_finish_requeue_request->bfq_completed_request(),更新很多bfqq参数,这些与bfq算法紧密相关。而我的bfq优化算法,一旦rq加入 bfq_high_prio_tmp_list链表,就要令rq所属bfqq是NULL,然后rq传输完成后就执行不了 bfq_finish_requeue_request->bfq_completed_request() 了,影响了bfqq参数更新,肯定会对bfq算法造成影响。左右为难,没有一个完美的解法。

不对,想来想去,还是有解法的!执行 rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq) 后,然后执行我添加的bfq代码时,把rq添加到bfq_high_prio_tmp_list链表。但是把bfqq->dispatched++ 和 bfqq->dispatched-- 都去掉,其他代码不修改。之后 rq所属bfqq可能过期失效,从st->active tree 移动到 st->idle tree。但是该bfqq可能会被完全释放吗?不会,第一,bfqq所属的进程派发的rq,还有保存在bfq_high_prio_tmp_list链表,进程必须等这些rq派发完才会退出。我之前说添加到bfq_high_prio_tmp_list链表的rq的bfqq可能被释放,bfqq会被新的进程有用,这个说法是错误的。什么情况下bfqq才会被释放呢? 在 bfq_put_queue()函数释放bfqq,但是前提是 bfqq->ref 是0。每向bfqq插入一个rq则bfqq->ref ++,看来只有bfqq上的rq全派发完才有可能 bfqq->ref是0。然后才有概率 bfq_forget_entity()-> bfq_put_queue()中因 bfqq->ref为0 而释放掉 bfqq。因此,即便 bfqq 的rq有插入 bfq_high_prio_tmp_list 链表的,然后bfqq上的rq全派发完了,bfqq过期失效,也不会释放bfqq。应该是这样!

因此,我的分析:把 rq = bfq_dispatch_rq_from_bfqq(里边有bfqq->dispatched++ )上的rq插入bfq_high_prio_tmp_list链表后,不再 bfqq->dispatched--,就相当于该rq还是属于bfqq,只不过换了一个保存位置而已。只不过延迟派发给驱动而已。想想,即便没有我的代码,rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq) 选中的rq直接派发给驱动,在磁盘阵列驱动繁忙时,rq也是暂存在磁盘驱动队列,这个rq也无法直接派发给磁盘硬件。rq暂存在磁盘驱动队列,我的bfq代码是把rq暂存在 bfq_high_prio_tmp_list 链表,都是延迟派发,有什么区别呢?

文章来源:https://blog.csdn.net/hu1610552336/article/details/135314705
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。