• 关于 2.6 big kernel lock,大内核锁 - [Linux Kernel binutils ]2005-12-10

    关于 big kernel lock,大内核锁

    关于内核各人自有不同的解读方法,本人常用readelf, objdump等binutils中的工具, 因对有些函数如 lock_kernel 的操作有很多编译参数的控制导致看不清其代码到底是什么, 故将生成的vmlinux反汇编一下,于是一目了然:

    c02a57a0 <lock_kernel>:
    c02a57a0:       ba 00 e0 ff ff          mov    $0xffffe000,%edx
    c02a57a5:       21 e2                   and    %esp,%edx
    c02a57a7:       8b 02                   mov    (%edx),%eax
    c02a57a9:       8b 48 14                mov    0x14(%eax),%ecx
    c02a57ac:       41                      inc    %ecx
    c02a57ad:       75 05                   jne    c02a57b4 <lock_kernel+0x14>
    c02a57af:       ff 42 14                incl   0x14(%edx)
    c02a57b2:       8b 02                   mov    (%edx),%eax
    c02a57b4:       89 48 14                mov    %ecx,0x14(%eax)
    c02a57b7:       c3                      ret
    c02a57b8:       90                      nop
    c02a57b9:       8d b4 26 00 00 00 00    lea    0x0(%esi,1),%esi
    
    c02a57c0 <unlock_kernel>:
    c02a57c0:       ba 00 e0 ff ff          mov    $0xffffe000,%edx
    c02a57c5:       21 e2                   and    %esp,%edx
    c02a57c7:       8b 0a                   mov    (%edx),%ecx
    c02a57c9:       8b 41 14                mov    0x14(%ecx),%eax
    c02a57cc:       85 c0                   test   %eax,%eax
    c02a57ce:       78 19                   js     c02a57e9 <unlock_kernel+0x29>
    c02a57d0:       48                      dec    %eax
    c02a57d1:       85 c0                   test   %eax,%eax
    c02a57d3:       89 41 14                mov    %eax,0x14(%ecx)
    c02a57d6:       79 0b                   jns    c02a57e3 <unlock_kernel+0x23>
    c02a57d8:       ff 4a 14                decl   0x14(%edx)
    c02a57db:       8b 42 08                mov    0x8(%edx),%eax
    c02a57de:       83 e0 08                and    $0x8,%eax
    c02a57e1:       75 01                   jne    c02a57e4 <unlock_kernel+0x24>
    c02a57e3:       c3                      ret
    c02a57e4:       e9 a7 ed ff ff          jmp    c02a4590 <preempt_schedule>
    c02a57e9:       0f 0b                   ud2a
    c02a57eb:       c7 00 00 fb 2b c0       movl   $0xc02bfb00,(%eax)
    c02a57f1:       eb dd                   jmp    c02a57d0 <unlock_kernel+0x10>

    栈指针 esp 中包含的是内核栈上的位置,而常数 $0xffffe000 实际上就是值为 ~(THREAD_SIZE - 1), 因在内核代码运行时所用的栈区就是每个进程的内核栈,大小为THREAD_SIZE,并且是THREAD_SIZE对齐的, 并正好放在一个struct thread_info上,因而将 esp 与 ~(THREAD_SIZE - 1) 的结果就是当前进程的 struct thread_info结构体指针,在结构体中包含struct task_struct指针,于是struct task_struct 结构体指针轻易取得。

    从这两个结构定义中可以看出,

    • struct thread_info 的偏移 0x14 处的是int preempt_count, 代表当前进程在调度时是否可抢占:等于0为可抢占,大于0是不可抢占,小于0则是出BUG了
    • struct task_struct 的偏移 0x14 处的是int lock_depth, 正是大内核锁的实现所定义的,从 kernel_locked() 的宏定义可以看出如果 lock_depth 大于或等于0 则是大内核锁锁上了,小于0则是出BUG了,在 include/linux/init_task.h 中对 init_task 初始化的值 为 -1, 说明其正常的值为-1,大于或等于零则是内核上锁了。

    根据 lock_kernel 反汇编码可以看出所谓大内核锁的上锁其实就是对当前进程的 lock_depth 加1,并判断 加1后如果等于0则为第一次上锁,需对当前 struct thread_info 的 preempt_count 也加1阻止抢占, 大内核锁的特点之一就是可以递归调用,也即可以重复上锁,只要解锁次数与上锁次数相同即可。实现方式就是 lock_depth 值累计上锁次数,但 preempt_count 值并不需累计,所以同一进程的第二次上锁时只对 lock_depth 累计而不对 preempt_count 累计。

    解锁时在 unlock_kernel 中也是可看出:首先检查 lock_depth 小于0,小于0则出BUG了(此处以 ud2a 实现, 关于内核中 BUG 函数以 ud2a 所实现代码的巧妙性可以另一文写出), 说明有人在未上锁时先解锁了,肯定是代码有BUG了。然后对 lock_depth 减1操作并写回,如果减1后不为负数则 不是递归上锁时的最后一次解锁,直接返回。而如果是负数则是最后一次解锁,这时递减preempt_count值,将恢复为可 抢占状态,

    大内核锁的特点就是可递归使用,在 schedule 调度过程中如果发现当前进程对 BKL 上锁则释放它,下一次调度时再 重新获得它,使用这两个函数:

    c02a5780 <__reacquire_kernel_lock>:
    c02a5780:       b8 00 e0 ff ff          mov    $0xffffe000,%eax
    c02a5785:       21 e0                   and    %esp,%eax
    c02a5787:       ff 40 14                incl   0x14(%eax)
    c02a578a:       31 c0                   xor    %eax,%eax
    c02a578c:       c3                      ret
    c02a578d:       8d 76 00                lea    0x0(%esi),%esi

    本次试验编译平台为 CONFIG_PREEMPT=y 而 CONFIG_PREEMPT_BKL not set,CONFIG_SMP not set 这是为桌面编译 2.6 内核的很典型配置(是啊,编译 2.6 内核如果不打开PREEMPT那还有何趣味?)。 在 CONFIG_SMP 关闭情况下,spinlock 优化为无。

    __reacquire_kernel_lock 其实就是很简单的 preempt_disable(),对 preempt_count 直接自增, 并由 eax 返回0

    c02a5790 <__release_kernel_lock>:
    c02a5790:       b8 00 e0 ff ff          mov    $0xffffe000,%eax
    c02a5795:       21 e0                   and    %esp,%eax
    c02a5797:       ff 48 14                decl   0x14(%eax)
    c02a579a:       c3                      ret
    c02a579b:       90                      nop
    c02a579c:       8d 74 26 00             lea    0x0(%esi,1),%esi

    相应 __release_kernel_lock 也就是对 preempt_count 直接自减。返回类型为 void

    在 unlock_kernel 中 ud2a 指令后的 c02a57eb 处其实是两个数:word 类型的 0x007c 和一个指针: 0xc02bfb00: 在这个指针处查到代码如下,看起来汇编码很乱: 仔细对映一下发现它就是字符串:"/usr/src/linux-2.6.14/lib/kernel_lock.c"

    c02bfaff:       00 2f                   add    %ch,(%edi)
    c02bfb01:       75 73                   jne    c02bfb76 <__func__.1+0x10b44>
    c02bfb03:       72 2f                   jb     c02bfb34 <__func__.1+0x10b02>
    c02bfb05:       73 72                   jae    c02bfb79 <__func__.1+0x10b47>
    c02bfb07:       63 2f                   arpl   %bp,(%edi)
    c02bfb09:       6c                      insb   (%dx),%es:(%edi)
    c02bfb0a:       69 6e 75 78 2d 32 2e    imul   $0x2e322d78,0x75(%esi),%ebp
    c02bfb11:       36 2e 31 34 2f          xor    %esi,%cs:%ss:(%edi,%ebp,1)
    c02bfb16:       6c                      insb   (%dx),%es:(%edi)
    c02bfb17:       69 62 2f 6b 65 72 6e    imul   $0x6e72656b,0x2f(%edx),%esp
    c02bfb1e:       65                      gs
    c02bfb1f:       6c                      insb   (%dx),%es:(%edi)
    c02bfb20:       5f                      pop    %edi
    c02bfb21:       6c                      insb   (%dx),%es:(%edi)
    c02bfb22:       6f                      outsl  %ds:(%esi),(%dx)
    c02bfb23:       63 6b 2e                arpl   %bp,0x2e(%ebx)
    c02bfb26:       63 00                   arpl   %ax,(%eax)

    在 unlock_kernel 中同时还调用了一个 preempt_schedule ,即在判断 need_schedule 时, 如果需要重调度则跳转至此,给出汇编码和 C 代码:

    c02a4590 <preempt_schedule>:
    c02a4590:       55                      push   %ebp
    c02a4591:       b8 00 e0 ff ff          mov    $0xffffe000,%eax
    c02a4596:       89 e5                   mov    %esp,%ebp
    c02a4598:       21 e0                   and    %esp,%eax
    c02a459a:       53                      push   %ebx
    c02a459b:       8b 48 14                mov    0x14(%eax),%ecx
    c02a459e:       31 d2                   xor    %edx,%edx
    c02a45a0:       85 c9                   test   %ecx,%ecx
    c02a45a2:       75 0d                   jne    c02a45b1 <preempt_schedule+0x21>
    c02a45a4:       9c                      pushf
    c02a45a5:       58                      pop    %eax
    c02a45a6:       c1 e8 09                shr    $0x9,%eax
    c02a45a9:       83 f0 01                xor    $0x1,%eax
    c02a45ac:       83 e0 01                and    $0x1,%eax
    c02a45af:       74 05                   je     c02a45b6 <preempt_schedule+0x26>
    c02a45b1:       ba 01 00 00 00          mov    $0x1,%edx
    c02a45b6:       85 d2                   test   %edx,%edx
    c02a45b8:       75 22                   jne    c02a45dc <preempt_schedule+0x4c>
    c02a45ba:       bb 00 e0 ff ff          mov    $0xffffe000,%ebx
    c02a45bf:       21 e3                   and    %esp,%ebx
    c02a45c1:       81 43 14 00 00 00 10    addl   $0x10000000,0x14(%ebx)
    c02a45c8:       e8 b3 f9 ff ff          call   c02a3f80 <schedule>
    c02a45cd:       81 6b 14 00 00 00 10    subl   $0x10000000,0x14(%ebx)
    c02a45d4:       8b 43 08                mov    0x8(%ebx),%eax
    c02a45d7:       83 e0 08                and    $0x8,%eax
    c02a45da:       75 de                   jne    c02a45ba <preempt_schedule+0x2a>
    c02a45dc:       5b                      pop    %ebx
    c02a45dd:       5d                      pop    %ebp
    c02a45de:       c3                      ret
    c02a45df:       90                      nop
    
    /*
     * this is is the entry point to schedule() from in-kernel preemption
     * off of preempt_enable.  Kernel preemptions off return from interrupt
     * occur there and call schedule directly.
     */
    asmlinkage void __sched preempt_schedule(void)
    {
            struct thread_info *ti = current_thread_info();
            /*
             * If there is a non-zero preempt_count or interrupts are disabled,
             * we do not want to preempt the current task.  Just return..
             */
            if (unlikely(ti->preempt_count || irqs_disabled()))
                    return;
    
    need_resched:
            add_preempt_count(PREEMPT_ACTIVE);
            schedule();
            sub_preempt_count(PREEMPT_ACTIVE);
    
            /* we could miss a preemption opportunity between schedule and now */
            barrier();
            if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
                    goto need_resched;
    }

    在 schedule() 之前给 preempt_count 加上了 PREEMPT_ACTIVE(值为0x10000000),以改变调度时对 当前进程的某些策略。

    有关代码:

    /* Attach to any functions which should be ignored in wchan output. */
    #define __sched         __attribute__((__section__(".sched.text")))
    
    #define THREAD_SIZE             (8192)
    #define STACK_WARN             (THREAD_SIZE/8)
    #define TIF_NEED_RESCHED        3       /* rescheduling necessary */
    #define _TIF_NEED_RESCHED       (1<<TIF_NEED_RESCHED)
    
    #define fastcall        __attribute__((regparm(3)))
    
    # define add_preempt_count(val) do { preempt_count() += (val); } while (0)
    # define sub_preempt_count(val) do { preempt_count() -= (val); } while (0)
    
    #define preempt_count() (current_thread_info()->preempt_count)
    
    #define inc_preempt_count() add_preempt_count(1)
    #define dec_preempt_count() sub_preempt_count(1)
    
    #define preempt_disable() \
    do { \
            inc_preempt_count(); \
            barrier(); \
    } while (0)
    
    #define preempt_enable_no_resched() \
    do { \
            barrier(); \
            dec_preempt_count(); \
    } while (0)
    
    /* how to get the thread information struct from C */
    static inline struct thread_info *current_thread_info(void)
    {
            struct thread_info *ti;
            __asm__("andl %%esp,%0; ":"=r" (ti) : "0" (~(THREAD_SIZE - 1)));
            return ti;
    }
    
    #define __lockfunc fastcall __attribute__((section(".spinlock.text")))
    
    #define kernel_locked()         (current->lock_depth >= 0)
    
    #define irqs_disabled()                 \
    ({                                      \
            unsigned long flags;            \
            local_save_flags(flags);        \
            !(flags & (1<<9));              \
    })
    
    struct thread_info {
            struct task_struct      *task;          /* main task structure */
            struct exec_domain      *exec_domain;   /* execution domain */
            unsigned long           flags;          /* low level flags */
            unsigned long           status;         /* thread-synchronous flags */
            __u32                   cpu;            /* current CPU */
            int                     preempt_count;  /* 0 => preemptable, <0      => BUG */
    
    struct task_struct {
            volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */
            struct thread_info *thread_info;
            atomic_t usage;
            unsigned long flags;    /* per process flags, defined below */
            unsigned long ptrace;
    
            int lock_depth;         /* BKL lock depth */
    
    /*
     *  INIT_TASK is used to set up the first task table, touch at
     * your own risk!. Base=0, limit=0x1fffff (=2MB)
     */
    #define INIT_TASK(tsk)  \
    {                                                                       \
            .state          = 0,                                            \
            .thread_info    = &init_thread_info,                            \
            .usage          = ATOMIC_INIT(2),                               \
            .flags          = 0,                                            \
            .lock_depth     = -1,                                           \
    
    #define INIT_THREAD_INFO(tsk)                   \
    {                                               \
            .task           = &tsk,                 \
            .exec_domain    = &default_exec_domain, \
            .flags          = 0,                    \
            .cpu            = 0,                    \
            .preempt_count  = 1,                    \
            .addr_limit     = KERNEL_DS,            \
            .restart_block = {                      \
                    .fn = do_no_restart_syscall,    \
            },                                      \
    }
    
    include/linux/spinlock.h
    
    /*
     * Must define these before including other files, inline functions need them
     */
    #define LOCK_SECTION_NAME                       \
            ".text.lock." __stringify(KBUILD_BASENAME)
    
    #define LOCK_SECTION_START(extra)               \
            ".subsection 1\n\t"                     \
            extra                                   \
            ".ifndef " LOCK_SECTION_NAME "\n\t"     \
            LOCK_SECTION_NAME ":\n\t"               \
            ".endif\n"
    
    #define LOCK_SECTION_END                        \
            ".previous\n\t"
    
    #define __lockfunc fastcall __attribute__((section(".spinlock.text")))
    
    include/linux/compiler.h
    
    # define __acquires(x)  __attribute__((context(0,1)))
    # define __releases(x)  __attribute__((context(1,0)))
    
    extern void __lockfunc lock_kernel(void)        __acquires(kernel_lock);
    extern void __lockfunc unlock_kernel(void)      __releases(kernel_lock);
    
    /*
     * Generic compiler-dependent macros required for kernel
     * build go below this comment. Actual compiler/compiler version
     * specific implementations come from the above header files
     */
    
    #define likely(x)       __builtin_expect(!!(x), 1)
    #define unlikely(x)     __builtin_expect(!!(x), 0)
    
    lib/kernel_lock.c
    
    /*
     * Getting the big kernel lock.
     *
     * This cannot happen asynchronously, so we only need to
     * worry about other CPU's.
     */
    void __lockfunc lock_kernel(void)
    {
            int depth = current->lock_depth+1;
            if (likely(!depth))
                    __lock_kernel();
            current->lock_depth = depth;
    }
    
    void __lockfunc unlock_kernel(void)
    {
            BUG_ON(current->lock_depth < 0);
            if (likely(--current->lock_depth < 0))
                    __unlock_kernel();
    }
    
    #define BUG()                           \
     __asm__ __volatile__(  "ud2\n"         \
                            "\t.word %c0\n" \
                            "\t.long %c1\n" \
                             : : "i" (__LINE__), "i" (__FILE__))
    
    #define BUG_ON(condition) do { if (unlikely((condition)!=0)) BUG(); } while(0)
    
    /*
     * Acquire/release the underlying lock from the scheduler.
     *
     * This is called with preemption disabled, and should
     * return an error value if it cannot get the lock and
     * TIF_NEED_RESCHED gets set.
     *
     * If it successfully gets the lock, it should increment
     * the preemption count like any spinlock does.
     *
     * (This works on UP too - _raw_spin_trylock will never
     * return false in that case)
     */
    int __lockfunc __reacquire_kernel_lock(void)
    {
            while (!_raw_spin_trylock(&kernel_flag)) {
                    if (test_thread_flag(TIF_NEED_RESCHED))
                            return -EAGAIN;
                    cpu_relax();
            }
            preempt_disable();
            return 0;
    }
    
    void __lockfunc __release_kernel_lock(void)
    {
            _raw_spin_unlock(&kernel_flag);
            preempt_enable_no_resched();
    }
    
    /*
     * These are the BKL spinlocks - we try to be polite about preemption.
     * If SMP is not on (ie UP preemption), this all goes away because the
     * _raw_spin_trylock() will always succeed.
     */
    # CONFIG_PREEMPT=y
    static inline void __lock_kernel(void)
    {
            preempt_disable();
            if (unlikely(!_raw_spin_trylock(&kernel_flag))) {
                    /*
                     * If preemption was disabled even before this
                     * was called, there's nothing we can be polite
                     * about - just spin.
                     */
                    if (preempt_count() > 1) {
                            _raw_spin_lock(&kernel_flag);
                            return;
                    }
    
                    /*
                     * Otherwise, let's wait for the kernel lock
                     * with preemption enabled..
                     */
                    do {
                            preempt_enable();
                            while (spin_is_locked(&kernel_flag))
                                    cpu_relax();
                            preempt_disable();
                    } while (!_raw_spin_trylock(&kernel_flag));
            }
    }
    
    static inline void __unlock_kernel(void)
    {
            spin_unlock(&kernel_flag);
    }




    Tags: Linux Kernel binutils

    非非 发表于 14:06:54 | 引用 0 | 编辑

评论

  • 这种内核的分析的确很好,我使用IDA这个软件把内核反汇编,看起来比较容易,不过有一些问题我不大明白,请教一下.
    在2.6.23内核中,加入了paravirt.h这部分
    其中有这段代码
    #define _paravirt_alt(insn_string, type, clobber) \
    "771:\n\t" insn_string "\n" "772:\n" \
    ".pushsection .parainstructions,\"a\"\n" \
    " .long 771b\n" \
    " .byte " type "\n" \
    " .byte 772b-771b\n" \
    " .short " clobber "\n" \
    ".popsection\n"
    我的理解是有parainistructions这个节区,并且把传递的变量放在这个节区定义的变量中.但是我用objdump之后的内容不是变量的定义.请给予解答.

    kimming () 发表于 2008-01-26 23:37:08   [回复]
  • 非常感谢你的独到的BKL

    pangdae () 发表于 2006-01-05 15:22:01   [回复]

发表评论