Padding-free Dynamic Batching

Previous works such as TensorFlow Fold typically takes RNN, especially TreeRNN as the example to illustrate dynamic batching. InsNet can properly execute such models in batch.

But it is the Transformer era now, thus in this topic, we will discuss the padding-free dynamic batching mechanism of InsNet using Transformers’ self-attention’s forward pass as an example. To simplify the illustration, it will not cover multi-head attention, but the method we will discuss can generalize to any operator.

Self-attention’s Forward Pass Example

Given a mini-batch containing matrices \(\{X_i\}_{i=1}^b\) satisfying \(row(X_i) = d\) (but \(col(X_i)\) are commonly not equal since we do not use paddings to align shapes), where \(row(X_i)\) and \(col(X_i)\) returns the row and column number of \(X_i\), respectively. We can pass them to self-attention as follows:

\[\begin{split}\begin{align} Q_i = W_Q X_i\tag{1}\\ K_i = W_K X_i\tag{2}\\ V_i = W_V X_i\tag{3}\\ S_i = {K_i}^T Q_i\tag{4}\\ {S_i}'= \frac{1}{\sqrt{d}} S_i\tag{5}\\ A_i = softmax({S_i}')\tag{6}\\ Y_i = A_i V_i\tag{7} \end{align}\end{split}\]

Executing the same formula in batch can generally speed up computation, especially on the GPU. To this end, following DyNet, InsNet maps every operator into a signature, where operators with the same signature mean that they should be computed in batch. In the following, we will discuss how InsNet executes the above formulas one by one.

\(Y = W X\)

Formula (1) ~ (3) are all linear transformations, and taking Formular (1) as the example, we can first merge \(X_1, X_2, ... , X_b\) into a single matrix \(\bigl[ \begin{smallmatrix}X_1 & X_2 & ... & X_b\end{smallmatrix} \bigr]\), and then compute \(\bigl[ \begin{smallmatrix}Q_1 & Q_2 & ... & Q_b\end{smallmatrix} \bigr] = W_Q \bigl[ \begin{smallmatrix}X_1 & X_2 & ... & X_b\end{smallmatrix} \bigr]\). Finally, we split the matrix \(\bigl[ \begin{smallmatrix}Q_1 & Q_2 & ... & Q_b\end{smallmatrix} \bigr]\) into matrices \(Q_1, Q_2, ... , Q_b\).

From the above computation process, we find that linear transformations with the same parameter can be executed in batch. Thus we set the signature to "linear-" + addr(W).

\(Y = A^T B\)

One way to implement formula (4) is to divide it into two operators, i.e., matrix transposition and matrix multiplication, which would help keep public APIs fine-grained and orthogonal. However, considering the additional data transfer of the former matrix it would cause, and more importantly, the cache friendliness it would lose, we still implement it as one operator.

Then we need to determine that when the input matrices meet what condition, the operators can be executed in batch. Still taking formula (4) as the example, obviously, the condition should not be that the shapes of \(\{K_i\}_{i=1}^b\) and \(\{Q_i\}_{i=1}^b\) are equal, respectively, for \(col(Q_i)\) and \(col(K_i)\) are not equal in a mini-batch. Thus we set the signature to "transpose-mul-" + to_string(row(A))

\(Y = \alpha X\)

Taking formula (5) as the example, since the sizes of \(\{S_i\}_{i=1}^b\) are not equal in the mini-batch, we define the signature as "factor", i.e., InsNet executes this type of operators in batch regardless of the input sizes.

More generally, InsNet executes all single-input operators \(Y = F(X)\) satisfying \(y_i = f(x_i), 0 < i < size(X)\), e.g., dropout, tanh and relu in batch regardless of their sizes.

\(Y = softmax(X)\)

Similarly, we define softmax’s signature as "softmax", which means InsNet executes all softmax operators in batch.

\(Y = A B\)

To properly execute formula (7) in batch, we define the matrix multiplication’s signature as "mul-" + row(A)

Importance of Model Design Bias

One may concern that shall we define these general-purpose operators’ signatures to adapt self-attention? More generally, shall we exploit model design bias?

Our answer is “Yes” because the padding-free dynamic batching task is only tractable when exploiting model design bias. For example, recall how InsNet batch \(Y = W X\) and we can realize that the efficiency of \(\bigl[ \begin{smallmatrix}Y_1 & Y_2 & ... & Y_b\end{smallmatrix} \bigr] = W \bigl[ \begin{smallmatrix}X_1 & X_2 & ... & X_b\end{smallmatrix} \bigr]\) is guaranteed by the assumption that \(W\) is shared in a mini-batch. Otherwise, why not try \(Y = \bigl[ \begin{smallmatrix}W_1 W_2 & ... & W_N\end{smallmatrix} \bigr]^T X\) instead?