CPU 調度器如何運作?#

NumPy 調度器基於多來源編譯,這表示採用特定來源並使用不同的編譯器標誌以及影響程式碼路徑的不同 C 定義多次編譯它。這為每個編譯物件啟用特定的指令集,具體取決於所需的最佳化,並以連結返回的物件結束。

../../_images/opt-infra.png

此機制應支援所有編譯器,並且不需要任何編譯器特定的擴充功能,但同時它為正常編譯增加了一些步驟,這些步驟說明如下。

1- 配置#

使用者在開始透過上述兩個命令引數建置原始碼檔案之前,配置所需最佳化

  • --cpu-baseline:所需的最小最佳化集。

  • --cpu-dispatch:調度的額外最佳化集。

2- 探索環境#

在此部分,我們檢查編譯器和平台架構,並快取一些中繼結果以加速重建。

3- 驗證請求的優化#

透過針對編譯器測試它們,並查看編譯器根據請求的最佳化可以支援哪些。

4- 產生主要配置標頭檔#

產生的標頭檔 _cpu_dispatch.h 包含在先前步驟中驗證的所需最佳化的所有指令集的定義和標頭檔。

它還包含額外的 C 定義,這些定義用於定義 NumPy 的 Python 層級模組屬性 __cpu_baseline____cpu_dispatch__

此標頭檔中有什麼?

範例標頭檔是由 gcc 在 X86 機器上動態產生的。編譯器支援 --cpu-baseline="sse sse2 sse3"--cpu-dispatch="ssse3 sse41",結果如下。

// The header should be located at numpy/numpy/_core/src/common/_cpu_dispatch.h
/**NOTE
 ** C definitions prefixed with "NPY_HAVE_" represent
 ** the required optimizations.
 **
 ** C definitions prefixed with 'NPY__CPU_TARGET_' are protected and
 ** shouldn't be used by any NumPy C sources.
 */
/******* baseline features *******/
/** SSE **/
#define NPY_HAVE_SSE 1
#include <xmmintrin.h>
/** SSE2 **/
#define NPY_HAVE_SSE2 1
#include <emmintrin.h>
/** SSE3 **/
#define NPY_HAVE_SSE3 1
#include <pmmintrin.h>

/******* dispatch-able features *******/
#ifdef NPY__CPU_TARGET_SSSE3
  /** SSSE3 **/
  #define NPY_HAVE_SSSE3 1
  #include <tmmintrin.h>
#endif
#ifdef NPY__CPU_TARGET_SSE41
  /** SSE41 **/
  #define NPY_HAVE_SSE41 1
  #include <smmintrin.h>
#endif

基準功能 是透過 --cpu-baseline 配置的所需最小最佳化集。它們沒有前處理器保護,並且始終處於開啟狀態,這表示它們可以用於任何來源程式碼。

這是否表示 NumPy 的基礎架構將基準功能的編譯器標誌傳遞給所有來源程式碼?

絕對是。但是 可調度來源程式碼 的處理方式不同。

如果使用者在建置期間指定了某些 基準功能,但在執行時機器甚至不支援這些功能怎麼辦?編譯後的程式碼是否會透過其中一個定義呼叫,或者編譯器本身是否可能根據提供的命令列編譯器標誌自動產生/向量化了某些程式碼片段?

在載入 NumPy 模組期間,有一個驗證步驟會偵測到此行為。它將引發 Python 執行階段錯誤以通知使用者。這是為了防止 CPU 達到非法指令錯誤而導致分段錯誤。

可調度功能 是我們調度的額外最佳化集,這些最佳化是透過 --cpu-dispatch 配置的。它們預設未啟用,並且始終受其他以 NPY__CPU_TARGET_ 為字首的 C 定義保護。C 定義 NPY__CPU_TARGET_ 僅在可調度來源程式碼內啟用。

5- 可調度來源程式碼和配置語句#

可調度來源程式碼是特殊的 C 檔案,可以多次編譯,使用不同的編譯器標誌和不同的 C 定義。這些會影響程式碼路徑,以便根據「配置語句」為每個編譯物件啟用特定的指令集,這些語句必須在 C 註解 (/**/) 之間宣告,並在每個可調度來源程式碼的頂部以特殊標記 @targets 開頭。同時,如果最佳化被命令引數 --disable-optimization 停用,則可調度來源程式碼將被視為正常的 C 來源程式碼。

什麼是配置語句?

配置語句是組合在一起以確定可調度來源程式碼所需最佳化的關鍵字。

範例

/*@targets avx2 avx512f vsx2 vsx3 asimd asimdhp */
// C code

關鍵字主要代表透過 --cpu-dispatch 配置的額外最佳化,但它也可以代表其他選項,例如

  • 目標群組:預先配置的配置語句,用於從可調度來源程式碼外部管理所需最佳化。

  • 策略:用於變更預設行為或強制編譯器執行某些操作的選項集合。

  • 「baseline」:一個獨特的關鍵字,代表透過 --cpu-baseline 配置的最小最佳化

Numpy 的基礎架構在四個步驟中處理可調度來源程式碼:

  • (A) 識別:就像來源範本和 F2PY 一樣,可調度來源程式碼需要特殊的副檔名 *.dispatch.c 來標記 C 可調度來源檔案,而 C++ 則為 *.dispatch.cpp*.dispatch.cxx 注意:尚不支援 C++。

  • (B) 解析和驗證:在此步驟中,已由先前步驟篩選的可調度來源程式碼會被逐一解析和驗證每個來源程式碼的配置語句,以確定所需最佳化。

  • (C) 包裝:這是 NumPy 的基礎架構採用的方法,已被證明足夠靈活,可以多次編譯單一來源程式碼,使用不同的 C 定義和標誌來影響程式碼路徑。此過程是透過為每個與額外最佳化相關的所需最佳化建立暫時的 C 來源程式碼來實現的,該來源程式碼包含 C 定義的宣告,並透過 C 指令 #include 包含相關的來源程式碼。如需更清楚的說明,請查看以下 AVX512F 的程式碼

    /*
     * this definition is used by NumPy utilities as suffixes for the
     * exported symbols
     */
    #define NPY__CPU_TARGET_CURRENT AVX512F
    /*
     * The following definitions enable
     * definitions of the dispatch-able features that are defined within the main
     * configuration header. These are definitions for the implied features.
     */
    #define NPY__CPU_TARGET_SSE
    #define NPY__CPU_TARGET_SSE2
    #define NPY__CPU_TARGET_SSE3
    #define NPY__CPU_TARGET_SSSE3
    #define NPY__CPU_TARGET_SSE41
    #define NPY__CPU_TARGET_POPCNT
    #define NPY__CPU_TARGET_SSE42
    #define NPY__CPU_TARGET_AVX
    #define NPY__CPU_TARGET_F16C
    #define NPY__CPU_TARGET_FMA3
    #define NPY__CPU_TARGET_AVX2
    #define NPY__CPU_TARGET_AVX512F
    // our dispatch-able source
    #include "/the/absolute/path/of/hello.dispatch.c"
    
  • (D) 可調度配置標頭檔:基礎架構為每個可調度來源程式碼產生一個配置標頭檔,此標頭檔主要包含兩個抽象 C 巨集,用於識別產生的物件,因此它們可以用於執行階段調度任何 C 來源程式碼產生的物件中的特定符號。它也用於前向宣告。

    產生的標頭檔採用可調度來源程式碼的名稱,排除副檔名並替換為 .h,例如,假設我們有一個名為 hello.dispatch.c 的可調度來源程式碼,其中包含以下內容

    // hello.dispatch.c
    /*@targets baseline sse42 avx512f */
    #include <stdio.h>
    #include "numpy/utils.h" // NPY_CAT, NPY_TOSTR
    
    #ifndef NPY__CPU_TARGET_CURRENT
      // wrapping the dispatch-able source only happens to the additional optimizations
      // but if the keyword 'baseline' provided within the configuration statements,
      // the infrastructure will add extra compiling for the dispatch-able source by
      // passing it as-is to the compiler without any changes.
      #define CURRENT_TARGET(X) X
      #define NPY__CPU_TARGET_CURRENT baseline // for printing only
    #else
      // since we reach to this point, that's mean we're dealing with
        // the additional optimizations, so it could be SSE42 or AVX512F
      #define CURRENT_TARGET(X) NPY_CAT(NPY_CAT(X, _), NPY__CPU_TARGET_CURRENT)
    #endif
    // Macro 'CURRENT_TARGET' adding the current target as suffix to the exported symbols,
    // to avoid linking duplications, NumPy already has a macro called
    // 'NPY_CPU_DISPATCH_CURFX' similar to it, located at
    // numpy/numpy/_core/src/common/npy_cpu_dispatch.h
    // NOTE: we tend to not adding suffixes to the baseline exported symbols
    void CURRENT_TARGET(simd_whoami)(const char *extra_info)
    {
        printf("I'm " NPY_TOSTR(NPY__CPU_TARGET_CURRENT) ", %s\n", extra_info);
    }
    

    現在假設您將 hello.dispatch.c 附加到來源樹,那麼基礎架構應產生一個名為 hello.dispatch.h 的暫時配置標頭檔,來源樹中的任何來源程式碼都可以存取它,並且它應包含以下程式碼

    #ifndef NPY__CPU_DISPATCH_EXPAND_
      // To expand the macro calls in this header
        #define NPY__CPU_DISPATCH_EXPAND_(X) X
    #endif
    // Undefining the following macros, due to the possibility of including config headers
    // multiple times within the same source and since each config header represents
    // different required optimizations according to the specified configuration
    // statements in the dispatch-able source that derived from it.
    #undef NPY__CPU_DISPATCH_BASELINE_CALL
    #undef NPY__CPU_DISPATCH_CALL
    // nothing strange here, just a normal preprocessor callback
    // enabled only if 'baseline' specified within the configuration statements
    #define NPY__CPU_DISPATCH_BASELINE_CALL(CB, ...) \
      NPY__CPU_DISPATCH_EXPAND_(CB(__VA_ARGS__))
    // 'NPY__CPU_DISPATCH_CALL' is an abstract macro is used for dispatching
    // the required optimizations that specified within the configuration statements.
    //
    // @param CHK, Expected a macro that can be used to detect CPU features
    // in runtime, which takes a CPU feature name without string quotes and
    // returns the testing result in a shape of boolean value.
    // NumPy already has macro called "NPY_CPU_HAVE", which fits this requirement.
    //
    // @param CB, a callback macro that expected to be called multiple times depending
    // on the required optimizations, the callback should receive the following arguments:
    //  1- The pending calls of @param CHK filled up with the required CPU features,
    //     that need to be tested first in runtime before executing call belong to
    //     the compiled object.
    //  2- The required optimization name, same as in 'NPY__CPU_TARGET_CURRENT'
    //  3- Extra arguments in the macro itself
    //
    // By default the callback calls are sorted depending on the highest interest
    // unless the policy "$keep_sort" was in place within the configuration statements
    // see "Dive into the CPU dispatcher" for more clarification.
    #define NPY__CPU_DISPATCH_CALL(CHK, CB, ...) \
      NPY__CPU_DISPATCH_EXPAND_(CB((CHK(AVX512F)), AVX512F, __VA_ARGS__)) \
      NPY__CPU_DISPATCH_EXPAND_(CB((CHK(SSE)&&CHK(SSE2)&&CHK(SSE3)&&CHK(SSSE3)&&CHK(SSE41)), SSE41, __VA_ARGS__))
    

    根據上述內容使用配置標頭檔的範例

    // NOTE: The following macros are only defined for demonstration purposes only.
    // NumPy already has a collections of macros located at
    // numpy/numpy/_core/src/common/npy_cpu_dispatch.h, that covers all dispatching
    // and declarations scenarios.
    
    #include "numpy/npy_cpu_features.h" // NPY_CPU_HAVE
    #include "numpy/utils.h" // NPY_CAT, NPY_EXPAND
    
    // An example for setting a macro that calls all the exported symbols at once
    // after checking if they're supported by the running machine.
    #define DISPATCH_CALL_ALL(FN, ARGS) \
        NPY__CPU_DISPATCH_CALL(NPY_CPU_HAVE, DISPATCH_CALL_ALL_CB, FN, ARGS) \
        NPY__CPU_DISPATCH_BASELINE_CALL(DISPATCH_CALL_BASELINE_ALL_CB, FN, ARGS)
    // The preprocessor callbacks.
    // The same suffixes as we define it in the dispatch-able source.
    #define DISPATCH_CALL_ALL_CB(CHECK, TARGET_NAME, FN, ARGS) \
      if (CHECK) { NPY_CAT(NPY_CAT(FN, _), TARGET_NAME) ARGS; }
    #define DISPATCH_CALL_BASELINE_ALL_CB(FN, ARGS) \
      FN NPY_EXPAND(ARGS);
    
    // An example for setting a macro that calls the exported symbols of highest
    // interest optimization, after checking if they're supported by the running machine.
    #define DISPATCH_CALL_HIGH(FN, ARGS) \
      if (0) {} \
        NPY__CPU_DISPATCH_CALL(NPY_CPU_HAVE, DISPATCH_CALL_HIGH_CB, FN, ARGS) \
        NPY__CPU_DISPATCH_BASELINE_CALL(DISPATCH_CALL_BASELINE_HIGH_CB, FN, ARGS)
    // The preprocessor callbacks
    // The same suffixes as we define it in the dispatch-able source.
    #define DISPATCH_CALL_HIGH_CB(CHECK, TARGET_NAME, FN, ARGS) \
      else if (CHECK) { NPY_CAT(NPY_CAT(FN, _), TARGET_NAME) ARGS; }
    #define DISPATCH_CALL_BASELINE_HIGH_CB(FN, ARGS) \
      else { FN NPY_EXPAND(ARGS); }
    
    // NumPy has a macro called 'NPY_CPU_DISPATCH_DECLARE' can be used
    // for forward declarations any kind of prototypes based on
    // 'NPY__CPU_DISPATCH_CALL' and 'NPY__CPU_DISPATCH_BASELINE_CALL'.
    // However in this example, we just handle it manually.
    void simd_whoami(const char *extra_info);
    void simd_whoami_AVX512F(const char *extra_info);
    void simd_whoami_SSE41(const char *extra_info);
    
    void trigger_me(void)
    {
        // bring the auto-generated config header
        // which contains config macros 'NPY__CPU_DISPATCH_CALL' and
        // 'NPY__CPU_DISPATCH_BASELINE_CALL'.
        // it is highly recommended to include the config header before executing
      // the dispatching macros in case if there's another header in the scope.
        #include "hello.dispatch.h"
        DISPATCH_CALL_ALL(simd_whoami, ("all"))
        DISPATCH_CALL_HIGH(simd_whoami, ("the highest interest"))
        // An example of including multiple config headers in the same source
        // #include "hello2.dispatch.h"
        // DISPATCH_CALL_HIGH(another_function, ("the highest interest"))
    }