NEP 8 — 為 NumPy 新增 groupby 功能的提案#

作者:: Travis Oliphant
聯絡方式:: oliphant@enthought.com
日期:: 2010-04-27
狀態:: 已延遲

摘要#

NumPy 提供的工具可以用於處理資料和進行計算，方式與關聯代數非常相似。然而，常見的 group-by 功能不容易處理。NumPy 的 ufuncs 的 reduce 方法是放置此 groupby 行為的自然位置。此 NEP 描述了 ufuncs 的兩個額外方法 (reduceby 和 reducein) 以及兩個額外函數 (segment 和 edges)，它們可以幫助新增此功能。

範例使用案例#

假設您有一個 NumPy 結構化陣列，其中包含關於多天內幾家商店的購買數量的資訊。為了清楚起見，結構化陣列的資料型別是

dt = [('year', i2), ('month', i1), ('day', i1), ('time', float),
    ('store', i4), ('SKU', 'S6'), ('number', i4)]

假設有一個此資料型別的 1 維 NumPy 陣列，並且您想要計算各種統計數據（最大值、最小值、平均值、總和等），關於產品銷售數量，依產品、依月份、依商店等。

目前，這可以通過在陣列的 number 欄位上使用 reduce 方法來完成，結合原地排序、unique 與 return_inverse=True 和 bincount 等。然而，對於如此常見的資料分析需求，如果能有標準且更直接的方法來取得結果，那就太好了。

提議的 Ufunc 方法#

提議為 ufuncs 新增兩個新的 reduce 風格方法：reduceby 和 reducein。reducein 方法旨在成為更易於使用的 reduceat 版本，而 reduceby 方法旨在為歸約提供 group-by 功能。

reducein

<ufunc>.reducein(arr, indices, axis=0, dtype=None, out=None)

Perform a local reduce with slices specified by pairs of indices.

The reduction occurs along the provided axis, using the provided
data-type to calculate intermediate results, storing the result into
the array out (if provided).

The indices array provides the start and end indices for the
reduction.  If the length of the indices array is odd, then the
final index provides the beginning point for the final reduction
and the ending point is the end of arr.

This generalizes along the given axis, the behavior:

[<ufunc>.reduce(arr[indices[2*i]:indices[2*i+1]])
        for i in range(len(indices)/2)]

This assumes indices is of even length

Example:
   >>> a = [0,1,2,4,5,6,9,10]
   >>> add.reducein(a,[0,3,2,5,-2])
   [3, 11, 19]

   Notice that sum(a[0:3]) = 3; sum(a[2:5]) = 11; and sum(a[-2:]) = 19

reduceby

<ufunc>.reduceby(arr, by, dtype=None, out=None)

Perform a reduction in arr over unique non-negative integers in by.


Let N=arr.ndim and M=by.ndim.  Then, by.shape[:N] == arr.shape.
In addition, let I be an N-length index tuple, then by[I]
contains the location in the output array for the reduction to
be stored.  Notice that if N == M, then by[I] is a non-negative
integer, while if N < M, then by[I] is an array of indices into
the output array.

The reduction is computed on groups specified by unique indices
into the output array. The index is either the single
non-negative integer if N == M or if N < M, the entire
(M-N+1)-length index by[I] considered as a whole.

提議的函數#

segment
edges