從底層理解Python的執行

編者按】下面博文將帶你創建一個字節碼級別的追蹤API以追蹤Python的一些內部機制，比如類似YIELDVALUE、YIELDFROM操作碼的實現，推式構造列表(List Comprehensions)、生成器表達式(generator expressions)以及其他一些有趣Python的編譯。

關于譯者：趙斌，OneAPM工程師，常年使用 Python/Perl 腳本，從事 DevOP、測試開發相關的開發工作。業余熱愛看書，喜歡 MOOC。

以下為譯文

最近我在學習 Python 的運行模型。我對 Python 的一些內部機制很是好奇，比如 Python 是怎么實現類似 YIELDVALUE、YIELDFROM 這樣的操作碼的；對于遞推式構造列表(List Comprehensions)、生成器表達式(generator expressions)以及其他一些有趣的 Python 特性是怎么編譯的；從字節碼的層面來看，當異常拋出的時候都發生了什么事情。翻閱 CPython 的代碼對于解答這些問題當然是很有幫助的，但我仍然覺得以這樣的方式來做的話對于理解字節碼的執行和堆棧的變化還是缺少點什么。GDB 是個好選擇，但是我懶，而且只想使用一些比較高階的接口寫點 Python 代碼來完成這件事。

所以呢，我的目標就是創建一個字節碼級別的追蹤 API，類似 sys.setrace 所提供的那樣，但相對而言會有更好的粒度。這充分鍛煉了我編寫 Python 實現的 C 代碼的編碼能力。我們所需要的有如下幾項，在這篇文章中所用的 Python 版本為 3.5。

一個新的 Cpython 解釋器操作碼
一種將操作碼注入到 Python 字節碼的方法
一些用于處理操作碼的 Python 代碼

一個新的 Cpython 操作碼

新操作碼：DEBUG_OP

這個新的操作碼 DEBUG_OP 是我第一次嘗試寫 CPython 實現的 C 代碼，我將盡可能的讓它保持簡單。我們想要達成的目的是，當我們的操作碼被執行的時候我能有一種方式來調用一些 Python 代碼。同時，我們也想能夠追蹤一些與執行上下文有關的數據。我們的操作碼會把這些信息當作參數傳遞給我們的回調函數。通過操作碼能辨識出的有用信息如下：

所以呢，我們的操作碼需要做的事情是：

找到回調函數
創建一個包含堆棧內容的列表
調用回調函數，并將包含堆棧內容的列表和當前幀作為參數傳遞給它

聽起來挺簡單的，現在開始動手吧！聲明：下面所有的解釋說明和代碼是經過了大量段錯誤調試之后總結得到的結論。首先要做的是給操作碼定義一個名字和相應的值，因此我們需要在Include/opcode.h中添加代碼。

/** My own comments begin by '**' **/
/** From: Includes/opcode.h **/

/* Instruction opcodes for compiled code */

/** We just have to define our opcode with a free value
    0 was the first one I found **/
#define DEBUG_OP                0

#define POP_TOP                 1
#define ROT_TWO                 2
#define ROT_THREE               3

這部分工作就完成了，現在我們去編寫操作碼真正干活的代碼。

實現 DEBUG_OP

在考慮如何實現DEBUG_OP之前我們需要了解的是DEBUG_OP提供的接口將長什么樣。擁有一個可以調用其他代碼的新操作碼是相當酷眩的，但是究竟它將調用哪些代碼捏？這個操作碼如何找到回調函數的捏？我選擇了一種最簡單的方法：在幀的全局區域寫死函數名。那么問題就變成了，我該怎么從字典中找到一個固定的 C 字符串？為了回答這個問題我們來看看在 Python 的 main loop 中使用到的和上下文管理相關的標識符__enter__和__exit__。

我們可以看到這兩標識符被使用在操作碼SETUP_WITH中：

/** From: Python/ceval.c **/
TARGET(SETUP_WITH) {
_Py_IDENTIFIER(__exit__);
_Py_IDENTIFIER(__enter__);
PyObject *mgr = TOP;
PyObject *exit = special_lookup(mgr, &PyId___exit__), *enter;
PyObject *res;

現在，看一眼宏_Py_IDENTIFIER的定義

/** From: Include/object.h **/

/********************* String Literals ****************************************/
/* This structure helps managing static strings. The basic usage goes like this:
   Instead of doing

       r = PyObject_CallMethod(o, "foo", "args", ...);

   do

       _Py_IDENTIFIER(foo);
       ...
       r = _PyObject_CallMethodId(o, &PyId_foo, "args", ...);

   PyId_foo is a static variable, either on block level or file level. On first
   usage, the string "foo" is interned, and the structures are linked. On interpreter
   shutdown, all strings are released (through _PyUnicode_ClearStaticStrings).

   Alternatively, _Py_static_string allows to choose the variable name.
   _PyUnicode_FromId returns a borrowed reference to the interned string.
   _PyObject_{Get,Set,Has}AttrId are __getattr__ versions using _Py_Identifier*.
*/
typedef struct _Py_Identifier {
    struct _Py_Identifier *next;
    const char* string;
    PyObject *object;
} _Py_Identifier;

#define _Py_static_string_init(value) { 0, value, 0 }
#define _Py_static_string(varname, value)  static _Py_Identifier varname = _Py_static_string_init(value)
#define _Py_IDENTIFIER(varname) _Py_static_string(PyId_##varname, #varname)

嗯，注釋部分已經說明得很清楚了。通過一番查找，我們發現了可以用來從字典找固定字符串的函數_PyDict_GetItemId，所以我們操作碼的查找部分的代碼就是長這樣滴。

 /** Our callback function will be named op_target **/
PyObject *target = NULL;
_Py_IDENTIFIER(op_target);
target = _PyDict_GetItemId(f->f_globals, &PyId_op_target);
if (target == NULL && _PyErr_OCCURRED) {
    if (!PyErr_ExceptionMatches(PyExc_KeyError))
        goto error;
    PyErr_Clear;
    DISPATCH;
}

為了方便理解，對這一段代碼做一些說明：

f是當前的幀，f->f_globals是它的全局區域
如果我們沒有找到op_target，我們將會檢查這個異常是不是KeyError
goto error;是一種在 main loop 中拋出異常的方法
PyErr_Clear抑制了當前異常的拋出，而DISPATCH觸發了下一個操作碼的執行

下一步就是收集我們想要的堆棧信息。

/** This code create a list with all the values on the current stack **/
PyObject *value = PyList_New(0);
for (i = 1 ; i <= STACK_LEVEL; i++) {
    tmp = PEEK(i);
    if (tmp == NULL) {
        tmp = Py_None;
    }
    PyList_Append(value, tmp);
}

最后一步就是調用我們的回調函數！我們用call_function來搞定這件事，我們通過研究操作碼CALL_FUNCTION的實現來學習怎么使用call_function。

/** From: Python/ceval.c **/
TARGET(CALL_FUNCTION) {
    PyObject **sp, *res;
    /** stack_pointer is a local of the main loop.
        It's the pointer to the stacktop of our frame **/
    sp = stack_pointer;
    res = call_function(&sp, oparg);
    /** call_function handles the args it consummed on the stack for us **/
    stack_pointer = sp;
    PUSH(res);
    /** Standard exception handling **/
    if (res == NULL)
        goto error;
    DISPATCH;
}

有了上面這些信息，我們終于可以搗鼓出一個操作碼DEBUG_OP的草稿了：

TARGET(DEBUG_OP) {
    PyObject *value = NULL;
    PyObject *target = NULL;
    PyObject *res = NULL;
    PyObject **sp = NULL;
    PyObject *tmp;
    int i;
    _Py_IDENTIFIER(op_target);

    target = _PyDict_GetItemId(f->f_globals, &PyId_op_target);
    if (target == NULL && _PyErr_OCCURRED) {
        if (!PyErr_ExceptionMatches(PyExc_KeyError))
            goto error;
        PyErr_Clear;
        DISPATCH;
    }
    value = PyList_New(0);
    Py_INCREF(target);
    for (i = 1 ; i <= STACK_LEVEL; i++) {
        tmp = PEEK(i);
        if (tmp == NULL)
            tmp = Py_None;
        PyList_Append(value, tmp);
    }

    PUSH(target);
    PUSH(value);
    Py_INCREF(f);
    PUSH(f);
    sp = stack_pointer;
    res = call_function(&sp, 2);
    stack_pointer = sp;
    if (res == NULL)
        goto error;
    Py_DECREF(res);
    DISPATCH;
}

在編寫 CPython 實現的 C 代碼方面我確實沒有什么經驗，有可能我漏掉了些細節。如果您有什么建議還請您糾正，我期待您的反饋。

編譯它，成了！

一切看起來很順利，但是當我們嘗試去使用我們定義的操作碼DEBUG_OP的時候卻失敗了。自從 2008 年之后，Python 使用預先寫好的goto(你也可以從這里獲取更多的訊息)。故，我們需要更新下 goto jump table，我們在 Python/opcode_targets.h 中做如下修改。

/** From: Python/opcode_targets.h **/
/** Easy change since DEBUG_OP is the opcode number 1 **/
static void *opcode_targets[256] = {
    //&&_unknown_opcode,
    &&TARGET_DEBUG_OP,
    &&TARGET_POP_TOP,
    /** ... **/

這就完事了，我們現在就有了一個可以工作的新操作碼。唯一的問題就是這貨雖然存在，但是沒有被人調用過。接下來，我們將DEBUG_OP注入到函數的字節碼中。

在 Python 字節碼中注入操作碼 DEBUG_OP

有很多方式可以在 Python 字節碼中注入新的操作碼：

使用 peephole optimizer， Quarkslab就是這么干的
在生成字節碼的代碼中動些手腳
在運行時直接修改函數的字節碼(這就是我們將要干的事兒)

為了創造出一個新操作碼，有了上面的那一堆 C 代碼就夠了。現在讓我們回到原點，開始理解奇怪甚至神奇的 Python！

我們將要做的事兒有：

得到我們想要追蹤函數的 code object
重寫字節碼來注入DEBUG_OP
將新生成的 code object 替換回去

和 code object 有關的小貼士

如果你從沒聽說過 code object，這里有一個簡單的介紹網路上也有一些相關的文檔可供查閱,可以直接Ctrl+F查找 code object

還有一件事情需要注意的是在這篇文章所指的環境中 code object 是不可變的：

Python 3.4.2 (default, Oct  8 2014, 10:45:20)
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> x = lambda y : 2
>>> x.__code__
<code object <lambda> at 0x7f481fd88390, file "<stdin>", line 1>
>>> x.__code__.co_name
'<lambda>'
>>> x.__code__.co_name = 'truc'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: readonly attribute
>>> x.__code__.co_consts = ('truc',)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: readonly attribute

但是不用擔心，我們將會找到方法繞過這個問題的。

使用的工具

為了修改字節碼我們需要一些工具：

dis模塊用來反編譯和分析字節碼
dis.BytecodePython 3.4 新增的一個特性，對于反編譯和分析字節碼特別有用
一個能夠簡單修改 code object 的方法

用dis.Bytecode反編譯 code bject 能告訴我們一些有關操作碼、參數和上下文的信息。

# Python3.4
>>> import dis
>>> f = lambda x: x + 3
>>> for i in dis.Bytecode(f.__code__): print (i)
...
Instruction(opname='LOAD_FAST', opcode=124, arg=0, argval='x', argrepr='x', offset=0, starts_line=1, is_jump_target=False)
Instruction(opname='LOAD_CONST', opcode=100, arg=1, argval=3, argrepr='3', offset=3, starts_line=None, is_jump_target=False)
Instruction(opname='BINARY_ADD', opcode=23, arg=None, argval=None, argrepr='', offset=6, starts_line=None, is_jump_target=False)
Instruction(opname='RETURN_VALUE', opcode=83, arg=None, argval=None, argrepr='', offset=7, starts_line=None, is_jump_target=False)

為了能夠修改 code object，我定義了一個很小的類用來復制 code object，同時能夠按我們的需求修改相應的值，然后重新生成一個新的 code object。

class MutableCodeObject(object):
    args_name = ("co_argcount", "co_kwonlyargcount", "co_nlocals", "co_stacksize", "co_flags", "co_code",
                  "co_consts", "co_names", "co_varnames", "co_filename", "co_name", "co_firstlineno",
                   "co_lnotab", "co_freevars", "co_cellvars")

    def __init__(self, initial_code):
        self.initial_code = initial_code
        for attr_name in self.args_name:
            attr = getattr(self.initial_code, attr_name)
            if isinstance(attr, tuple):
                attr = list(attr)
            setattr(self, attr_name, attr)

    def get_code(self):
        args = 
        for attr_name in self.args_name:
            attr = getattr(self, attr_name)
            if isinstance(attr, list):
                attr = tuple(attr)
            args.append(attr)
        return self.initial_code.__class__(*args)

這個類用起來很方便，解決了上面提到的 code object 不可變的問題。

>>> x = lambda y : 2
>>> m = MutableCodeObject(x.__code__)
>>> m
<new_code.MutableCodeObject object at 0x7f3f0ea546a0>
>>> m.co_consts
[None, 2]
>>> m.co_consts[1] = '3'
>>> m.co_name = 'truc'
>>> m.get_code
<code object truc at 0x7f3f0ea2bc90, file "<stdin>", line 1>

測試我們的新操作碼

我們現在擁有了注入DEBUG_OP的所有工具，讓我們來驗證下我們的實現是否可用。我們將我們的操作碼注入到一個最簡單的函數中：

from new_code import MutableCodeObject

def op_target(*args):
    print("WOOT")
    print("op_target called with args <{0}>".format(args))

def nop:
    pass

new_nop_code = MutableCodeObject(nop.__code__)
new_nop_code.co_code = b"\x00" + new_nop_code.co_code[0:3] + b"\x00" + new_nop_code.co_code[-1:]
new_nop_code.co_stacksize += 3

nop.__code__ = new_nop_code.get_code

import dis
dis.dis(nop)
nop


# Don't forget that ./python is our custom Python implementing DEBUG_OP
hakril@computer ~/python/CPython3.5 % ./python proof.py
  8           0 <0>
              1 LOAD_CONST               0 (None)
              4 <0>
              5 RETURN_VALUE
WOOT
op_target called with args <([], <frame object at 0x7fde9eaebdb0>)>
WOOT
op_target called with args <([None], <frame object at 0x7fde9eaebdb0>)>

看起來它成功了！有一行代碼需要說明一下new_nop_code.co_stacksize += 3

co_stacksize 表示 code object 所需要的堆棧的大小
操作碼DEBUG_OP往堆棧中增加了三項，所以我們需要為這些增加的項預留些空間

現在我們可以將我們的操作碼注入到每一個 Python 函數中了！

重寫字節碼

正如我們在上面的例子中所看到的那樣，重寫 Pyhton 的字節碼似乎 so easy。為了在每一個操作碼之間注入我們的操作碼，我們需要獲取每一個操作碼的偏移量，然后將我們的操作碼注入到這些位置上(把我們操作碼注入到參數上是有壞處大大滴)。這些偏移量也很容易獲取，使用dis.Bytecode，就像這樣。

def add_debug_op_everywhere(code_obj):
    # We get every instruction offset in the code object
    offsets = [instr.offset for instr in dis.Bytecode(code_obj)]
    # And insert a DEBUG_OP at every offset
    return insert_op_debug_list(code_obj, offsets)

def insert_op_debug_list(code, offsets):
    # We insert the DEBUG_OP one by one
    for nb, off in enumerate(sorted(offsets)):
        # Need to ajust the offsets by the number of opcodes already inserted before
        # That's why we sort our offsets!
        code = insert_op_debug(code, off + nb)
    return code

# Last problem: what does insert_op_debug looks like?

基于上面的例子，有人可能會想我們的insert_op_debug會在指定的偏移量增加一個"\x00"，這尼瑪是個坑啊！我們第一個DEBUG_OP注入的例子中被注入的函數是沒有任何的分支的，為了能夠實現完美一個函數注入函數insert_op_debug我們需要考慮到存在分支操作碼的情況。

Python 的分支一共有兩種：

絕對分支：看起來是類似這樣子的Instruction_Pointer = argument(instruction)
相對分支：看起來是類似這樣子的Instruction_Pointer += argument(instruction)

我們希望這些分支在我們插入操作碼之后仍然能夠正常工作，為此我們需要修改一些指令參數。以下是其邏輯流程：

對于每一個在插入偏移量之前的相對分支而言
- 如果目標地址是嚴格大于我們的插入偏移量的話，將指令參數增加 1
- 如果相等，則不需要增加 1 就能夠在跳轉操作和目標地址之間執行我們的操作碼DEBUG_OP
- 如果小于，插入我們的操作碼的話并不會影響到跳轉操作和目標地址之間的距離
對于 code object 中的每一個絕對分支而言
- 如果目標地址是嚴格大于我們的插入偏移量的話，將指令參數增加 1
- 如果相等，那么不需要任何修改，理由和相對分支部分是一樣的
- 如果小于，插入我們的操作碼的話并不會影響到跳轉操作和目標地址之間的距離

下面是實現：

# Helper
def bytecode_to_string(bytecode):
    if bytecode.arg is not None:
        return struct.pack("<Bh", bytecode.opcode, bytecode.arg)
    return struct.pack("<B", bytecode.opcode)

# Dummy class for bytecode_to_string
class DummyInstr:
    def __init__(self, opcode, arg):
        self.opcode = opcode
        self.arg = arg

def insert_op_debug(code, offset):
    opcode_jump_rel = ['FOR_ITER', 'JUMP_FORWARD', 'SETUP_LOOP', 'SETUP_WITH', 'SETUP_EXCEPT', 'SETUP_FINALLY']
    opcode_jump_abs = ['POP_JUMP_IF_TRUE', 'POP_JUMP_IF_FALSE', 'JUMP_ABSOLUTE']
    res_codestring = b""
    inserted = False
    for instr in dis.Bytecode(code):
        if instr.offset == offset:
            res_codestring += b"\x00"
            inserted = True
        if instr.opname in opcode_jump_rel and not inserted: #relative jump are always forward
            if offset < instr.offset + 3 + instr.arg: # inserted beetwen jump and dest: add 1 to dest (3 for size)
                #If equal: jump on DEBUG_OP to get info before exec instr
                res_codestring += bytecode_to_string(DummyInstr(instr.opcode, instr.arg + 1))
                continue
        if instr.opname in opcode_jump_abs:
            if instr.arg > offset:
                res_codestring += bytecode_to_string(DummyInstr(instr.opcode, instr.arg + 1))
                continue
        res_codestring += bytecode_to_string(instr)
    # replace_bytecode just replaces the original code co_code
    return replace_bytecode(code, res_codestring)

讓我們看一下效果如何：

>>> def lol(x):
...     for i in range(10):
...         if x == i:
...             break

>>> dis.dis(lol)
101           0 SETUP_LOOP              36 (to 39)
              3 LOAD_GLOBAL              0 (range)
              6 LOAD_CONST               1 (10)
              9 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             12 GET_ITER
        >>   13 FOR_ITER                22 (to 38)
             16 STORE_FAST               1 (i)

102          19 LOAD_FAST                0 (x)
             22 LOAD_FAST                1 (i)
             25 COMPARE_OP               2 (==)
             28 POP_JUMP_IF_FALSE       13

103          31 BREAK_LOOP
             32 JUMP_ABSOLUTE           13
             35 JUMP_ABSOLUTE           13
        >>   38 POP_BLOCK
        >>   39 LOAD_CONST               0 (None)
             42 RETURN_VALUE
>>> lol.__code__ = transform_code(lol.__code__, add_debug_op_everywhere, add_stacksize=3)


>>> dis.dis(lol)
101           0 <0>
              1 SETUP_LOOP              50 (to 54)
              4 <0>
              5 LOAD_GLOBAL              0 (range)
              8 <0>
              9 LOAD_CONST               1 (10)
             12 <0>
             13 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             16 <0>
             17 GET_ITER
        >>   18 <0>

102          19 FOR_ITER                30 (to 52)
             22 <0>
             23 STORE_FAST               1 (i)
             26 <0>
             27 LOAD_FAST                0 (x)
             30 <0>

103          31 LOAD_FAST                1 (i)
             34 <0>
             35 COMPARE_OP               2 (==)
             38 <0>
             39 POP_JUMP_IF_FALSE       18
             42 <0>
             43 BREAK_LOOP
             44 <0>
             45 JUMP_ABSOLUTE           18
             48 <0>
             49 JUMP_ABSOLUTE           18
        >>   52 <0>
             53 POP_BLOCK
        >>   54 <0>
             55 LOAD_CONST               0 (None)
             58 <0>
             59 RETURN_VALUE

# Setup the simplest handler EVER
>>> def op_target(stack, frame):
...     print (stack)

# GO
>>> lol(2)


[<class 'range'>]
[10, <class 'range'>]
[range(0, 10)]
[<range_iterator object at 0x7f1349afab80>]
[0, <range_iterator object at 0x7f1349afab80>]
[<range_iterator object at 0x7f1349afab80>]
[2, <range_iterator object at 0x7f1349afab80>]
[0, 2, <range_iterator object at 0x7f1349afab80>]
[False, <range_iterator object at 0x7f1349afab80>]
[<range_iterator object at 0x7f1349afab80>]
[1, <range_iterator object at 0x7f1349afab80>]
[<range_iterator object at 0x7f1349afab80>]
[2, <range_iterator object at 0x7f1349afab80>]
[1, 2, <range_iterator object at 0x7f1349afab80>]
[False, <range_iterator object at 0x7f1349afab80>]
[<range_iterator object at 0x7f1349afab80>]
[2, <range_iterator object at 0x7f1349afab80>]
[<range_iterator object at 0x7f1349afab80>]
[2, <range_iterator object at 0x7f1349afab80>]
[2, 2, <range_iterator object at 0x7f1349afab80>]
[True, <range_iterator object at 0x7f1349afab80>]
[<range_iterator object at 0x7f1349afab80>]

[None]

甚好！現在我們知道了如何獲取堆棧信息和 Python 中每一個操作對應的幀信息。上面結果所展示的結果目前而言并不是很實用。在最后一部分中讓我們對注入做進一步的封裝。

增加 Python 封裝

正如您所見到的，所有的底層接口都是好用的。我們最后要做的一件事是讓 op_target 更加方便使用(這部分相對而言比較空泛一些，畢竟在我看來這不是整個項目中最有趣的部分)。

首先我們來看一下幀的參數所能提供的信息，如下所示：

f_code當前幀將執行的 code object
f_lasti當前的操作(code object 中的字節碼字符串的索引)

經過我們的處理我們可以得知DEBUG_OP之后要被執行的操作碼，這對我們聚合數據并展示是相當有用的。

新建一個用于追蹤函數內部機制的類：

改變函數自身的co_code
設置回調函數作為op_debug的目標函數

一旦我們知道下一個操作，我們就可以分析它并修改它的參數。舉例來說我們可以增加一個auto-follow-called-functions的特性。

def op_target(l, f, exc=None):
    if op_target.callback is not None:
        op_target.callback(l, f, exc)

class Trace:
    def __init__(self, func):
        self.func = func

    def call(self, *args, **kwargs):
        self.add_func_to_trace(self.func)
        # Activate Trace callback for the func call
        op_target.callback = self.callback
        try:
            res = self.func(*args, **kwargs)
        except Exception as e:
            res = e
        op_target.callback = None
        return res

    def add_func_to_trace(self, f):
        # Is it code? is it already transformed?
        if not hasattr(f ,"op_debug") and hasattr(f, "__code__"):
            f.__code__ = transform_code(f.__code__, transform=add_everywhere, add_stacksize=ADD_STACK)
            f.__globals__['op_target'] = op_target
            f.op_debug = True

    def do_auto_follow(self, stack, frame):
        # Nothing fancy: FrameAnalyser is just the wrapper that gives the next executed instruction
        next_instr = FrameAnalyser(frame).next_instr
        if "CALL" in next_instr.opname:
            arg = next_instr.arg
            f_index = (arg & 0xff) + (2 * (arg >> 8))
            called_func = stack[f_index]

            # If call target is not traced yet: do it
            if not hasattr(called_func, "op_debug"):
                self.add_func_to_trace(called_func)

現在我們實現一個 Trace 的子類，在這個子類中增加 callback 和 doreport 這兩個方法。callback 方法將在每一個操作之后被調用。doreport 方法將我們收集到的信息打印出來。

這是一個偽函數追蹤器實現：

class DummyTrace(Trace):
    def __init__(self, func):
        self.func = func
        self.data = collections.OrderedDict
        self.last_frame = None
        self.known_frame = 
        self.report = 

    def callback(self, stack, frame, exc):
        if frame not in self.known_frame:
            self.known_frame.append(frame)
            self.report.append(" === Entering New Frame {0} ({1}) ===".format(frame.f_code.co_name, id(frame)))
            self.last_frame = frame
        if frame != self.last_frame:
            self.report.append(" === Returning to Frame {0} {1}===".format(frame.f_code.co_name, id(frame)))
            self.last_frame = frame

        self.report.append(str(stack))
        instr = FrameAnalyser(frame).next_instr
        offset = str(instr.offset).rjust(8)
        opname = str(instr.opname).ljust(20)
        arg = str(instr.arg).ljust(10)
        self.report.append("{0}  {1} {2} {3}".format(offset, opname, arg, instr.argval))
        self.do_auto_follow(stack, frame)

    def do_report(self):
        print("\n".join(self.report))

這里有一些實現的例子和使用方法。格式有些不方便觀看，畢竟我并不擅長于搞這種對用戶友好的報告的事兒。

遞推式構造列表(List Comprehensions)的追蹤示例。

總結

這個小項目是一個了解 Python 底層的良好途徑，包括解釋器的 main loop，Python 實現的 C 代碼編程、Python 字節碼。通過這個小工具我們可以看到 Python 一些有趣構造函數的字節碼行為，例如生成器、上下文管理和遞推式構造列表。

這里是這個小項目的完整代碼。更進一步的，我們還可以做的是修改我們所追蹤的函數的堆棧。我雖然不確定這個是否有用，但是可以肯定是這一過程是相當有趣的。

6月3-5日，北京國家會議中心，第七屆中國云計算大會，3天主會，17場分論壇，3場實戰培訓，160+位講師，議題全公開！

者 | Einstellung

責編 | 郭芮

首先介紹一下背景。筆者參加的一個關于風機開裂故障分析的預警比賽。訓練數據有將近5萬個樣本，測試數據近9萬。數據來自SCADA采集系統。采集了10分鐘之內的75個特征值的數據信息，label是一周以內風機是否會發生故障的label。

數據介紹

每個樣本10分鐘之內大概采集到了450條數據，一共75個特征，也就是差不多75*450個信息。最后三個特征完全沒有數據，所以一開始做的時候，我們就把最后三個特征進行刪除，所以實際上是對72個特征進行的數據分析。

最開始，用的是seaborn畫的正常風機和不正常風機的頻率分布圖，比如說對于輪轂轉速這個特征：

 1import seaborn as snsimport pandas as pd
 2data_file = r"D:\fan_fault\feature1.csv"
 3pre_process = pd.read_csv(data_file, encoding = "gbk")
 4
 5pre_process = pre_process.fillna(0)
 6feature1_plot = pre_process["normal(0)"]
 7
 8feature2_plot2 = pre_process["fault(1)"]
 9sns.kdeplot(feature1_plot, shade = True)
10sns.kdeplot(feature2_plot2, shade = True)

大部分特征都是這樣，沒有很好的區分度。正是因為如此，也一直沒有嘗試出來非常好的模型。后來我們嘗試用MATLAB畫圖，每個特征出兩個圖：

看起來要比seaborn好一些（后兩個圖和第一個不是一個特征）。我們在做數據分析這一塊很大的問題是在于只去查了各個特征的物理含義，做了頻率和頻數分布圖，看看是否有沒有好的特征，然后就直接進入了下一步。忘了考慮是否可能會出現因為采集問題而導致的異常值和空缺值問題。這一點導致后面我們的很多工作都需要推倒重來。

數據分析

我們從統計上來看，并沒有找到很好的區分度特征，然后就考慮從物理上來找。在老師的建議下，我們嘗試了有輪轂轉速，風速為6.5m/s時，y方向振動值的特征：

依舊沒有很好的區分度，對于其他風速嘗試也是如此。

隨后我們討論到了閾值、記0等方式構造新特征。在考慮記0這個新特征構造辦法時，突然發現大氣壓力這個特征居然有0的情況。根據物理學的知識來講，風機的大氣壓力是不可能為0的。然后我們才想起來，沒有對數據的異常值進行處理。刪除了有8萬多條整行全為0的數據，導致某些文件為空，也就是這個風機沒有數據信息。當然，也有某些風機是某幾行為0。

除了刪除空缺值，我們還對其他明顯是異常的數據進行了一些數據清洗工作。因為之前我們對于數據特征數統計分析是根據未清洗的數據做的分析，所以分析的可靠性也有點問題，后面就產生了一些不必要的麻煩。我們也做了一些相關性分析的工作，大部分特征相關性十分的高。幾十個特征兩兩組合然后進行相關性分析，會有數千個結果，相關性分析沒有辦法進行下去。后來，我們就沒有考慮相關性的事情。

特征工程

我們最開始嘗試對前72個特征構造均值，作為基準嘗試看看效果如何。

 1import os
 2import pandas as pd
 3import numpy as np
 4import csv
 5
 6label_file = r"C:\fan_fault\train\trainX"
 7train_mean = r"D:\fan_fault\train_mean_new.csv"
 8
 9with open(train_mean, "a", newline = '', encoding = "utf-8") as f:
10 train_mean = csv.writer(f) 
11
12 for x in range(1, 48340):
13 fan_file = os.path.join(label_file, str(x) + ".csv")
14 print("程序運行進度為", x/48340) #用該語句查看工作進度狀態
15
16 with open(fan_file, encoding='utf-8') as f:
17 feature_read = pd.read_csv(f) 
18 #遍歷打開文件的每一個特征（72），求取均值
19 # a用來臨時存放計算好的特征均值，外加一個label
20
21 a = [] 
22 for i in range(72):
23 mean_num = feature_read.iloc[:, i]
24 mean_num = np.array(mean_num).mean() 
25 #生成每個特征所有數據對應的均值
26 a.append(mean_num)
27
28 train_mean.writerow(a)

也包括絕對值差分累計、差分均值、差分方差，用隨機森林進行調參。

 1# -*- coding: utf-8 -*-"""
 2
 3import numpy as np
 4import pandas as pd
 5from sklearn.preprocessing import MinMaxScaler
 6from sklearn.ensemble import RandomForestClassifier
 7from sklearn.model_selection import cross_val_scorefrom sklearn 
 8import metrics
 9from sklearn.model_selection import GridSearchCV
10
11#數據導入、檢查空缺值
12data = pd.read_csv(r'D:\next\8_19\train_data.csv',encoding = "gbk")
13label = pd.read_csv(r"D:\next\8_19\train_label.csv")
14data.info()
15data.notnull().sum(axis=0)/data.shape[0]
16train = data.iloc[:,:-1]
17label = label.iloc[:,-1]
18
19#數據標準化
20scaler = MinMaxScaler()
21train = scaler.fit(train).transform(train)
22
23#單個分類器
24clf = RandomForestClassifier(random_state=14)
25f1 = cross_val_score(clf, train, label, scoring='f1')
26print("f1:{0:.1f}%".format(np.mean(f1)*100))
27
28#調參
29parameter_space = { 
30 'n_estimators':range(10,200,10), 
31 'max_depth':range(1,10), 
32 'min_samples_split':range(2,10),
33 }
34clf = RandomForestClassifier(random_state=14)
35grid = GridSearchCV(clf,parameter_space,scoring='f1', n_jobs = 6)
36grid.fit(train,label)
37print("f1:(0:.1f)%".format(grid.best_score_*100))
38print(grid.best_estimator_)
39
40#調參后的分類器
41new_clf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
42 max_depth=7, max_features='auto', max_leaf_nodes=None,
43 min_impurity_decrease=0.0, min_impurity_split=None,
44 min_samples_leaf=1, min_samples_split=7,
45 min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
46 oob_score=False, random_state=14, verbose=0,warm_start=False)
47print("f1:{0:.1f}%".format(np.mean(f1)*100))

測試集輸出預測結果如下：

 1#數據標準化
 2scaler = MinMaxScaler()
 3train = scaler.fit(train).transform(train)
 4test = scaler.fit(test).transform(test)
 5
 6#訓練分類器
 7clf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
 8 max_depth=8, max_features='auto', max_leaf_nodes=None,
 9 min_impurity_decrease=0.0, min_impurity_split=None,
10 min_samples_leaf=1, min_samples_split=5,
11 min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=5,
12 oob_score=False, random_state=14, verbose=0, warm_start=False)
13clf = clf.fit(train, label)
14#預測結果
15pre = clf.predict(test)
16
17#測試結果文件寫入
18import csv
19
20label = r"D:/fan_fault/label.csv"
21
22with open(label, "w", newline = '', encoding = "utf-8") as f:
23 label = csv.writer(f)
24
25 for x in range(len(pre)):
26 label.writerow(pre[x:x+1])

測試效果來看，并沒有取得十分理想的效果。

之后想起來沒有考慮數據清洗，之前的工作全部推倒從來。我們在此期間，也調查整理關于這72個特征和風機葉片開裂故障的原因，發現從文獻上沒有找到和葉片開裂故障有很好相關性的特征，對于特征進行的改造也沒有很好的區分效果。后期我們咨詢過相關行業的工程師，他們說這些特征中也沒有十分強相關性的特征。我們就考慮到特征之間兩兩交叉相乘看看效果。

 1# 交叉特征有2844列，分別是自身的平方和相互交叉，最后求均值方差，最后三個特征不單獨再生成交叉特征
 2
 3import os
 4import pandas as pd
 5import numpy as np
 6import csv
 7from sklearn.preprocessing import PolynomialFeatures
 8
 9label_file = r"F:\User\Xinyuan Huang\train_labels.csv"
10fan_folder = r"F:\User\Xinyuan Huang"
11read_label = pd.read_csv(label_file)
12
13cross_var = r"F:\User\Xinyuan Huang\CaiJi\Feature_crosses\cross_var.csv"
14
15with open(cross_var, "a", newline = '', encoding = "utf-8") as f:
16 cross_var = csv.writer(f)
17
18 # 該for循環用于定位要打開的文件
19 for x in range(len(read_label)-1):
20 column1 = str(read_label["f_id"][x:x+1]) 
21 #遍歷DataFrame第一列的f_id標簽下面的每一個數
22 column2 = str(read_label["file_name"][x:x+1]) 
23 #遍歷DataFrame第二列的file_name標簽下面的每一個數
24 column3 = str(read_label["ret"][x:x+1]) 
25 #遍歷DataFrame第三列的ret標簽下面的每一個數
26
27 f_id = column1.split()[1] 
28 #第一行的文件所對應的f_id進行切片操作，獲取對應的數字
29 # 對f_id進行補0操作
30 f_id = f_id.zfill(3) 
31 # 比如2補成002，所以這里寫3
32 file_name = column2.split()[1] 
33 #第一行的文件所對應的file_name
34 label = column3.split()[1] 
35 #第一行文件所對應的ret
36
37 fan_file = os.path.join(fan_folder, "train", f_id, file_name)
38 print("程序運行進度為", x/(len(read_label)-1)) 
39 #用該語句查看工作進度狀態
40
41 # 打開相應的fan_file文件進行讀取操作
42 with open(fan_file, encoding='utf-8') as f:
43 dataset = pd.read_csv(f)
44 #數據集名稱為dataset
45 poly = PolynomialFeatures(degree=2, include_bias=False,interaction_only=False)
46 X_ploly = poly.fit_transform(dataset)
47 data_ploly = pd.DataFrame(X_ploly, columns=poly.get_feature_names())
48
49 new_data = data_ploly.ix[:,75:-6]
50
51 #ploly_mean,ploly_var為交叉特征均值方差
52 ploly_mean = np.mean(new_data)
53 ploly_var = np.var(ploly_mean)
54
55 ploly_var = list(ploly_var)
56 ploly_var.append(label)
57
58 cross_var.writerow(ploly_var)

交叉相乘之后的文件有數千個特征，生成的文件有將近2G大小。考慮到服務器性能不高，計算曠日持久。不對特征進行篩選，直接進行交叉之后跑算法這條路被我們放棄了。

后來阜特科技的楊工幫我們篩選了一些比較重要的特征，我們在此基礎之上進行了一些特征交叉和重要性排序的操作，特征縮小到了幾百個（包含交叉、均值、方差等，經過重要性排序），然后用它來跑得模型。

特征里面有一些特征是離散特征，對于這些特征我們進行單獨處理，進行離散化。比如說偏航要求值總共有3個值分別是1,2,3。我們對其進行離散化處理，一個特征就變成了三個特征。每個樣本統計出現這三個特征的頻率。

 1import os
 2import pandas as pd
 3import numpy as np
 4import csv
 5
 6label_file = r"E:\8_19\testX_csv"
 7
 8train_mean = r"E:\8_19\disperse\discrete56.csv"
 9
10with open(train_mean, "a", newline = '', encoding = "utf-8") as f:
11 train_mean = csv.writer(f)
12
13 for x in range(1, 451):
14 fan_file = os.path.join(label_file, str(x) + ".csv")
15# print("程序運行進度為", x/451) #用該語句查看工作進度狀態
16
17 with open(fan_file, encoding='utf-8') as f:
18 feature_read = pd.read_csv(f, header = None)
19
20 num1 = 0
21 num2 = 0
22 num3 = 0
23
24 a = []
25
26 for x in range(len(feature_read)):
27 if feature_read[55][x] == 0:
28 num1 = num1+1
29 if feature_read[55][x] == 1:
30 num2 = num2+1
31 if feature_read[55][x] == 2:
32 num3 = num3+1
33
34 num1 = num1/len(feature_read)
35 num2 = num2/len(feature_read)
36 num3 = num3/len(feature_read)
37
38 a.append(num1)
39 a.append(num2)
40 a.append(num3)
41
42 train_mean.writerow(a)

算法

我們最后主要用的算法是Xgboost，期間也嘗試過LightGBM，因為算力不夠的原因，沒有辦法嘗試一些算法（包括楊工說的SVM以及深度學習的想法），最后主要用Xgboost進行調參，直接一起調參的話算力不夠，我們是單個調參以及兩兩調參組合的形式進行參數調整。

 1from xgboost import XGBClassifier
 2import xgboost as xgb
 3
 4import pandas as pd 
 5import numpy as np
 6
 7from sklearn.model_selection import GridSearchCV
 8from sklearn.model_selection import StratifiedKFold
 9
10from sklearn.metrics import log_loss
11from sklearn.preprocessing import MinMaxScaler
12
13
14#數據導入、檢查空缺值
15data = pd.read_csv(r'D:\next\8_19\train_data.csv',encoding = "gbk")
16label = pd.read_csv(r"D:\next\8_19\train_label.csv")
17test = pd.read_csv(r"D:\next\8_19\test_data.csv", encoding = "gbk")
18train = data.iloc[:,:-1]
19label = label.iloc[:,-1]
20
21X_train = train
22y_train = label
23
24#數據標準化
25scaler = MinMaxScaler()
26train = scaler.fit(train).transform(train)
27test = scaler.fit(test).transform(test)
28
29#交叉驗證
30kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=3)
31
32param_test1 = {
33 'max_depth':list(range(3,10,1)),
34 'min_child_weight':list(range(1,6,1))
35}
36gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=400, max_depth=5,
37 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
38 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
39 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
40gsearch1.fit(X_train,y_train)
41gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

調完這個參數之后，把最好的輸出結果拿出來放在下一個參數調優里進行調整：

 1aram_test1 = {
 2 'learning_rate':[i/100.0 for i in range(6,14,2)]
 3}
 4gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=400, max_depth=6,
 5 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 6 objective= 'binary:logistic', nthread=6, scale_pos_weight=1, seed=27), 
 7 param_grid = param_test1, scoring='roc_auc',n_jobs=-1,iid=False, cv=5)
 8gsearch1.fit(X_train,y_train)
 9gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
10
11param_test1 = {
12 'subsample':[0.8, 0.9]
13}
14gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=310, max_depth=6,
15 min_child_weight=1, gamma=0, subsample=0.9, colsample_bytree=0.8,
16 objective= 'binary:logistic', nthread=6, scale_pos_weight=1, seed=27), 
17 param_grid = param_test1, scoring='roc_auc',n_jobs=-1,iid=False, cv=5)
18gsearch1.fit(X_train,y_train)
19gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

還可以調整好幾個參數，最后結果輸出：

 1import xgboost as xgb
 2dtrain=xgb.DMatrix(X_train ,label=y_train)
 3dtest=xgb.DMatrix(test)
 4
 5params={
 6 'objective': 'binary:logistic',
 7 'max_depth':6,
 8 'subsample':0.8,
 9 'colsample_bytree':0.8,
10 'min_child_weight':1,
11 'seed':27,
12 'nthread':6,
13 'learning_rate':0.1,
14 'n_estimators':292,
15 'gamma':0,
16 'scale_pos_weight':1}
17
18watchlist = [(dtrain,'train')]
19
20bst=xgb.train(params,dtrain,num_boost_round=100,evals=watchlist)
21
22ypred=bst.predict(dtest)
23
24import csv
25
26test_label = r"D:\next\8_20\test_label_new.csv"
27with open(test_label, "a", newline = '', encoding = "utf-8") as f:
28 test_label = csv.writer(f)
29
30 for x in range(len(ypred)):
31 a = []
32 if ypred[x] < 0.5:
33 a.append(0)
34 test_label.writerow(a)
35 else:
36 a.append(1)
37 test_label.writerow(a)

即使是單個調參和兩兩調參，對于我們而言，計算速度還是太慢，我們為此也嘗試了Hyperopt方法。通俗的說，我們用的是擲骰子方法，也就是在一個劃定參數區域內隨機地擲骰子，哪個參數被擲成幾，我們就用這個數來訓練模型。最后返回一個擲的最好的參數結果。這個方法有很大的局限性，一是結果的隨機性，二是很容易局部收斂。但是，如果用來粗糙地驗證一個特征構造的好壞，也不失為一個好方法。

 1# -*- coding: utf-8 -*-
 2"""
 3Created on Fri May 18 14:09:06 2018
 4
 6"""
 7
 8import numpy as np
 9import pandas as pd
10from sklearn.preprocessing import MinMaxScaler
11import xgboost as xgb
12from random import shuffle
13from xgboost.sklearn import XGBClassifier
14from sklearn.cross_validation import cross_val_score
15import pickle
16import time
17from hyperopt import fmin, tpe, hp,space_eval,rand,Trials,partial,STATUS_OK
18import random
19
20data = pd.read_csv(r'D:\next\select_data\new_feature.csv', encoding = "gbk").values
21label = pd.read_csv(r'D:\next\select_data\new_label.csv').values
22labels = label.reshape((1,-1))
23label = labels.tolist()[0]
24
25minmaxscaler = MinMaxScaler()
26attrs = minmaxscaler.fit_transform(data)
27
28index = range(0,len(label))
29random.shuffle(label)
30trainIndex = index[:int(len(label)*0.7)]
31print (len(trainIndex))
32testIndex = index[int(len(label)*0.7):]
33print (len(testIndex))
34attr_train = attrs[trainIndex,:]
35print (attr_train.shape)
36attr_test = attrs[testIndex,:]
37print (attr_test.shape)
38label_train = labels[:,trainIndex].tolist()[0]
39print (len(label_train))
40label_test = labels[:,testIndex].tolist()[0]
41print (len(label_test))
42print (np.mat(label_train).reshape((-1,1)).shape)
43
44
45def GBM(argsDict):
46 max_depth = argsDict["max_depth"] + 5
47# n_estimators = argsDict['n_estimators'] * 5 + 50
48 n_estimators = 627
49 learning_rate = argsDict["learning_rate"] * 0.02 + 0.05
50 subsample = argsDict["subsample"] * 0.1 + 0.7
51 min_child_weight = argsDict["min_child_weight"]+1
52
53 print ("max_depth:" + str(max_depth))
54 print ("n_estimator:" + str(n_estimators))
55 print ("learning_rate:" + str(learning_rate))
56 print ("subsample:" + str(subsample))
57 print ("min_child_weight:" + str(min_child_weight))
58
59 global attr_train,label_train
60
61 gbm = xgb.XGBClassifier(nthread=6, #進程數
62 max_depth=max_depth, #最大深度
63 n_estimators=n_estimators, #樹的數量
64 learning_rate=learning_rate, #學習率
65 subsample=subsample, #采樣數
66 min_child_weight=min_child_weight, #孩子數
67
68 max_delta_step = 50, #50步不降則停止
69 objective="binary:logistic")
70
71 metric = cross_val_score(gbm,attr_train,label_train,cv=3, scoring="f1", n_jobs = -1).mean()
72 print (metric)
73 return -metric
74
75space = {"max_depth":hp.randint("max_depth",15),
76 "n_estimators":hp.quniform("n_estimators",100,1000,1), #[0,1,2,3,4,5] -> [50,]
77 #"learning_rate":hp.quniform("learning_rate",0.01,0.2,0.01), #[0,1,2,3,4,5] -> 0.05,0.06
78 #"subsample":hp.quniform("subsample",0.5,1,0.1),#[0,1,2,3] -> [0.7,0.8,0.9,1.0]
79 #"min_child_weight":hp.quniform("min_child_weight",1,6,1), #
80
81 #"max_depth":hp.randint("max_depth",15),
82 # "n_estimators":hp.randint("n_estimators",10), #[0,1,2,3,4,5] -> [50,]
83 "learning_rate":hp.randint("learning_rate",6), #[0,1,2,3,4,5] -> 0.05,0.06
84 "subsample":hp.randint("subsample",3),#[0,1,2,3] -> [0.7,0.8,0.9,1.0]
85 "min_child_weight":hp.randint("min_child_weight",2)
86
87 }
88algo = partial(tpe.suggest,n_startup_jobs=1)
89best = fmin(GBM,space,algo=algo,max_evals=50) #max_evals表示想要訓練的最大模型數量，越大越容易找到最優解
90
91print (best)
92print (GBM(best))

最終結果

我們首先把數據進行分類處理。對于那些空缺值的數據，我們直接給label為1（表示異常），對于空缺值的處理只能摸獎。在分析訓練樣本的分布時，我們還發現有一些閾值的特征，就是那些特征大于某些值或者小于某些值之后故障風機要明顯比正常風機多出很多，這樣，我們可以用閾值判斷直接給label，剩下不能判斷的再用算法進行判斷。然后最后時間比較緊，閾值部分的沒有做，除去了空缺值之后，其他的全部用算法進行判斷。

楊工告訴我們，應該做分析的時候分析“輪轂轉速”大于3的數據，因為風機工作才可以檢測出來異常，如果風機不工作是很難判斷的。但是因為時間比較緊，對于訓練集我們就沒有進行這樣的操作，于是測試集也沒有進行這樣的劃分。全部都一起塞進算法里了。

總結一下。我們對于特征進行交叉和重要性排序，綜合考慮楊工說的重要特征和算法反饋的重要特征排序，最后生成一百多個特征的特征文件用來訓練（訓練樣本是經過數據清洗過后的樣本）。

測試集分為兩部分，一部分是空缺值，直接標1，另一部分放算法里出結果。

總結

首先最大的一個坑是開始沒有做數據清洗的工作，后來發現了之后從新來了一遍。再后來是楊工和我們說應該分析工作風機，拿工作風機來進行訓練。如果這樣做的話又要推倒從來，當時時間已經十分緊張了，心有余而力不足。對于比賽或者說數據分析工作來說，數據的理解是第一位的。否則很可能會做不少無用功。有的時候，受限于專業背景，我們很難充分地理解數據和業務場景，這時候應該向專業人士進行請教，把這些工作都做好之后再進行數據分析要好很多。

其次，提高自己地代碼和算法的能力。既要懂算法又要能擼出一手好代碼，這樣才能提高效率。我寫代碼寫得太慢，十分制約我的想法實現速度。算法不太懂，也不能很好地參與到算法討論環節。

另外，版本控制十分重要。我們每天都在實現新的想法，文件很多也很亂。經常出現剛發一個文件，過一段時間就找不到那個文件或者忘了有沒有發那個文件地情況。

聲明：本文為公眾號經管人學數據分析投稿，版權歸對方所有。

_“征稿啦_”

CSDN 公眾號秉持著「與千萬技術人共成長」理念，不僅以「極客頭條」、「暢言」欄目在第一時間以技術人的獨特視角描述技術人關心的行業焦點事件，更有「技術頭條」專欄，深度解讀行業內的熱門技術與場景應用，讓所有的開發者緊跟技術潮流，保持警醒的技術嗅覺，對行業趨勢、技術有更為全面的認知。

如果你有優質的文章，或是行業熱點事件、技術趨勢的真知灼見，或是深度的應用實踐、場景方案等的新見解，歡迎聯系 CSDN 投稿，聯系方式：微信（guorui_1118，請備注投稿+姓名+公司職位），郵箱（guorui@csdn.net）。

者 | 徐麟

責編 | 胡巍巍

前言

很多人提到B站，首先想到的就會是二次元或者鬼畜。

上個月，筆者也發表了一篇關于B站鬼畜視頻的文章：

大數據解讀B站火過蔡徐坤的“鬼畜“區巨頭們

。

然而，實際上B站其實是個非常神奇的網站，里面的內容可謂是包羅萬象，有趣的彈幕文化也能極大地提高大家的體驗，B站也逐漸地成為了一個用來學習的“神器”。

近期B站獲得了央視網的力挺，報道稱B站已經成為了越來越多的年輕人的學習陣地，正所謂“我在B站看番，你卻在B站學習” ，今天我們就來爬取B站上那些播放量、彈幕量排名靠前的編程類視頻，一起去了解B站的另一面。

數據來源

我們此次的數據主要來源于B站搜索框中輸入“編程”后的視頻列表及相關信息：

B站一共提供了物種視頻排序的方式，每種能夠返回前1000個視頻，我們分別爬取五種排序所得到的1000個視頻之后對5000個視頻進行排序，最終得到了2000多個編程類視頻的信息。

同時我們也增加了一些篩選條件，使得最終獲取到的編程教學視頻更具代表性：a.所屬分類為科技類 b.視頻時長大于60分鐘，部分代碼如下：

## 獲得列表
def get_list(i,j):
 attempts = 0
 success = False
 while attempts < 5 and not success:
 try:
 url = 'https://search.bilibili.com/all?keyword=%E7%BC%96%E7%A8%8B&from_source=banner_search&order={}&duration=4&tids_1=36&page={}'.format(i,j+1) 
 header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win32; x32; rv:54.0) Gecko/20100101 Firefox/54.0',
 'Connection': 'keep-alive'}
 cookies ='v=3; iuuid=1A6E888B4A4B29B16FBA1299108DBE9CDCB327A9713C232B36E4DB4FF222CF03; webp=true; ci=1%2C%E5%8C%97%E4%BA%AC; __guid=26581345.3954606544145667000.1530879049181.8303; _lxsdk_cuid=1646f808301c8-0a4e19f5421593-5d4e211f-100200-1646f808302c8; _lxsdk=1A6E888B4A4B29B16FBA1299108DBE9CDCB327A9713C232B36E4DB4FF222CF03; monitor_count=1; _lxsdk_s=16472ee89ec-de2-f91-ed0%7C%7C5; __mta=189118996.1530879050545.1530936763555.1530937843742.18'
 cookie = {}
 for line in cookies.split(';'):
 name, value = cookies.strip().split('=', 1)
 cookie[name] = value 
 html = requests.get(url,cookies=cookie, headers=header).content
 bsObj = BeautifulSoup(html.decode('utf-8'),"html.parser")
 script = bsObj.find_all('script')[3].text
 info = json.loads(script.replace('window.__INITIAL_STATE__=','').split(';(function()')[0])['allData']['video']
 return info
 except:
 attempts = attempts+1
 return []
coding_all = []
type = ['click','stow','dm']
for i in type:
 for j in range(50):
 this_coding = get_list(i,j)
 coding_all = coding_all+this_coding

最終，我們獲取到了如下的視頻信息列表：

數據分析

獲取到數據之后，我們首先關注的是這些視頻的主要內容，通過視頻給出的標簽，繪制整體內容總結的詞云圖：

可以看到，上面的詞云除了編程語言，技術之外包含了許多類似于學習，教程這樣的通用描述性詞匯，我們需要進一步從中篩選出與編程語言、技術相關的詞云，提高詞云圖的效果：

可以看到，經過篩選后的詞云圖效果要好很多，其中基本上囊括了現在比較火的編程語言，如Java、Python 以及數據結構、機器學習這些技術類的內容，下面我們來看一下各編程語言的播放量及彈幕量對比：

我們此次將Linux也劃分到語言類中，可以看到目前基本上就是處于Python、C語言、Java三組鼎力的態勢，Python略微領先于其他兩種語言，這也一定程度反映了當今的整體發展趨勢。由此可見，B站的內容也是與時俱進，適合年輕人去學習了解編程整體發展趨勢。

看完了語言類，我們再來看一下具體的技術類排行榜：

可以看到，前端、人工智能、數據框、爬蟲這些大家比較關心以及公司有較大需求量的技術都出現在了榜單中，在B站如果能將自己所要從事領域的視頻認真學習，也會有很大的提高，部分代碼如下：

## 分組統計
coding_tag = dataframe_explode(coding,'tag')
coding_tag['tag'] = coding_tag['tag'].apply(str.lower)
coding_tag['type'] = coding_tag['tag'].map({tag_dict['tag'][k]:tag_dict['type'][k] for k in range(tag_dict.shape[0])})
coding_tag = coding_tag.groupby(['title','pic','author','arcurl','tag','type'],as_index=False).agg({'play':'max','danmu':'max','favorites':'max','review':'max'})
tag_count = coding_tag.groupby(['tag','type'],as_index=False).agg({'title':['count'],'play':['sum'],'danmu':['sum'],'favorites':['sum']}) 
tag_count.columns = ['tag','type','num','play','danmu','favorites']
## 繪制圖片
coding_stat = tag_count[tag_count['type']=='語言']
coding_stat.sort_values('play',ascending=False,inplace=True)
attr = coding_stat['tag'][0:10]
v1 = coding_stat['play'][0:10]
bar = Bar("語言類播放量TOP10")
bar.add("播放數量", attr, v1, is_stack=True, xaxis_rotate=30,xaxis_label_textsize=18,
 xaxis_interval =0,is_splitline_show=False,label_text_size=12,is_label_show=True)
bar.render('語言類播放量TOP10.html')

精品視頻

分析完整體視頻內容的分布情況，我們再來看下那些最為精品的視頻，由于B站以彈幕文化為特色，我們就依據彈幕量來為大家精選出一些非常不錯的視頻，首先是所有編程類視頻的TOP20：

我們下面分別看一下三足鼎立中的Python、Java、C語言分別彈幕量排名前十的視頻信息：

寫在最后

B站的阿婆主為為大家提供了特別多的編程學習資源，大家在學習知識的同時，也需要注意的就是相應的版權信息。

上傳視頻一定要確認版權不存在問題之后再去上傳，另外如果發現有存在侵權的問題，也要及時跟視頻作者進行反饋，及時將侵權視頻下架。

另外，希望大家能夠多多支持技術類的視頻和阿婆主，如果覺得不錯就不要吝惜手中的硬幣，讓更多的技術類阿婆主有動力為大家提供更多更好的視頻內容。

作者簡介：徐麟，某互聯網公司數據分析獅，哥大統計狗，喜歡用R&Python玩一些不一樣的數據。個人公眾號：數據森麟（ID:shujusenlin）。

聲明：本文為作者投稿，版權歸對方所有。

在線咨詢

上一篇：掌握CAD這三個繪圖要點，什么圖紙都能畫，你信嗎？
下一篇：CSS選擇器中哪些屬性可以繼承？

您的項目需求

*請認真填寫需求信息，我們會在24小時內與您取得聯系。

整合營銷服務商