APIs

There are several principles for API design as follows:

Tensor dimensions are no greater than 2, i.e., they are all column-major matrices.
The shape information is not kept in and broadcasted among Node objects but only coupled with operators, while Node objects only keep their sizes. For example, shapes of matmul’s input matrix A and B are inferred from A.size(), B.size() and b_row, while methods such as add do not care shapes, but only sizes. So we do not need shape-related methods such as reshape and view.
For each operator, we define the corresponding batching rule, but it is designed to be ignored by users.
Operators and modules are all functions.
We do not invent but borrow function names from PyTorch, except for those not included in PyTorch but implemented in InsNet for strong reasons.

Operators

Node *insnet::add(const std::vector<Node*> &inputs)

Add input tensors.

The add operators that have the same number of inputs will be executed in batch. For example, [0.1, 0.1] + [0.2, 0.2] and [0.3, 0.3, 0.3] + [0.4, 0.4, 0.4] will be executed in batch, but [0.1, 0.1] + [0.2, 0.2] and [0.1, 0.1] + [0.2, 0.2] + [0.3, 0.3] will not. If the latter is your case, use sumPool instead.

Parameters: inputs – The input tensors to be added. Note that their sizes should be equal.
Returns: The sum of the input tensors. Its size is equal to the size of any input tensor.

Node *insnet::affine(Node &input, LayerNormParams &params)

The affine transformation in layer normalization. \([{x_0}{g_0}, {x_1}{g_1}, ..., {x_n}{g_n}] + [b_0, b_1, ..., b_n]\)

Following the Pytorch documentation’s LayerNorm, we call this affine transformation, but it should actually be a simplified version. For Example, supposing params.g() is [0.1, -0.1] and params.b() is [0, 0] affine([1, -1, -1, 1], params) will return [0.1, 0.1, -0.1, -0.1].

The operators with the same parameters will be executed in batch.

Parameters

input – The input tensor.
params – g and b.

Returns

The affine transformed tensor. Its size is equal to input.size().

std::vector<std::vector<int>> insnet::argmax(const std::vector<Node*> &nodes, int row)

Return indexes of row-wise max values.

For example, argmax({[0.1, 0.2], [0.1, 0.2, 0.3, 0.4]}, 2) returns {{1}, {1, 1}}.

It is not differentiable and will be executed eagerly.

Parameters

nodes – The input matrices. their sizes can be variant but should all be divisible by row.
row – The row number of nodes.

Returns

The result indexes.

Node *insnet::avgPool(Node &input, int row)

Find the column-wise sum pooling.

It is the shortcut of mul(*sumPool(input, row), static_cast<dtype>(row) / input.size()).

Node *insnet::bias(Node &X, BiasParam &b)

Add the bias iterm to the input tensor. \(X + [b b .. b]\).

The operators with the same bias will be executed in batch.

Parameters

X – The input tensor.
b – The bias parameters. The result tensor. Its size is equal to X.size().

Node *insnet::cat(const std::vector<Node*> &inputs, int col = 1)

Concaternate input matrices into the result matrix with a specified column number.

For example, cat({[0.1, 0.2, 0.3, 0.4], [0.1, 0.2, 0.3, 0.4]}) will return [0.1, 0.2, 0.3, 0.4, 0.1, 0.2, 0.3, 0.4], and cat({[0.1, 0.2, 0.3, 0.4], [0.1, 0.2, 0.3, 0.4], 2) will return [0.1, 0.2, 0.1, 0.2, 0.3, 0.4, 0.3, 0.4].

The operators whose column number and input matrix sizes are equal one by one will be executed in batch. For example, cat({[0.1, 0.2, 0.3, 0.4], [0.1], [0.1, 0.2]}) and cat({[0, 0, 0, 0], [0], [0, 0]}) will be executed in batch.

The operators whose column number is 1 and input tensor sizes are all the same will also be executed in batch. This rule is especially useful when concaternating RNN hidden states. For example, cat({[0.1, 0.2], [0.1, 0], [0.1, 0.2]}) and cat({[0, 0], [0, 0]}) will be executed in batch because their input tensors have the same size of 2, though they have different number of input tensors.

Parameters

inputs – The input matrices
col – The column number of both the input matrices and the result matrix. The default value is 1.

Returns

The result matrix. Its size is equal to the sum of all input matrix sizes.

Node *insnet::div(Node &dividend, Node &divisor)

The pointwise div operator.

For example, div([0.1, 0.2], [0.1, 0.2]) will return [1, 1].

All div operators will be executed in batch.

Parameters

dividend – The dividend number. Its size should be equal to divisor.size().
divisor – The divisor number. Its size should be equal to dividend.size().

Returns

The result tensor. Its size is equal to dividend.size() and divisor.size().

Node *insnet::dropout(Node &input, dtype p)

The dropout function. In particular, if the dropout value is no greater than 1e-10, it will return the input tensor directly.

If the graph is set to the training stage, it drop out all elements independently with the probability p. Otherwise it scales all elements by (1 - p).

The operators with the equal dropout probability will be executed in batch. For example, dropout([0.1, 0.1], 0.1) and dropout([0.2, 0.2, 0.2], 0.1) will be executed in batch, but dropout([0.1, 0.1], 0.1) and dropout([0.2, 0.2], 0.2) will not.

Parameters

input – The input tensor.
p – The dropout probability.

Returns

The result tensor. Its size is equal to input.size().

Node *insnet::embedding(Graph &graph, const std::vector<std::string> &words, EmbeddingAbs &table, bool freeze = false)

Find embeddings from a parameter matrix with the specified words.

The operators with the same parameter matrix will be executed in batch. For example, given the same param, embedding(graph, {“what”, “is”, “zero-padding”, “?”}, param) and embedding(graph, {“zero-padding”, “is”, “a”, “waste”, “of”, “memory”, “.”}, param) will be executed in batch.

Parameters

graph – The computation graph.
words – The words to find the embeddings.
table – The embedding table. If the gradients will(will not) be sparse, pass an instance of Embedding<SparseParam>(Embedding<Param>).
freeze – Whether to freeze the embedding table. The default value is false.

Returns

The result tensor of concaternated found embeddings. Its size is equal to table.param().row() * ids.size().

Node *insnet::embedding(Graph &graph, const std::string &word, EmbeddingAbs &table, bool freeze = false)

Find embeddings from a parameter matrix with the specified words.

It is a shortcut to call embedding(graph, {word}, param, freeze).

Node *insnet::embedding(Graph &graph, const std::vector<int> &ids, BaseParam &param, bool freeze = false)

Find embeddings from a parameter matrix with the specified ids.

For example, assumming we have the parameter matrix param = [[0.1, 0.1], [0.2, 0.2]], embedding(graph, {1, 0}, param) will return [0.2, 0.2, 0.1, 0.1].

The operators with the same parameter matrix will be executed in batch. For example, given the same param, embedding(graph, {0, 122, 3333, 33333, 1}, param) and embedding(graph, {0, 323, 34223, 1}, param) will be executed in batch.

Parameters

graph – The computation graph.
ids – The column numbers of the parameter matrix to find the embeddings.
param – The parameter matrix. If the gradients will(will not) be sparse, pass an instance of SparseParam(Param).
freeze – Whether to freeze the parameter matrix. The default value is false.

Returns

The result tensor of concaternated found embeddings. Its size is equal to param.row() * ids.size().

Node *insnet::embedding(Graph &graph, int id, BaseParam &param, bool freeze = false)

Find embeddings from a parameter matrix with the specified id.

It is a shortcut to call embedding(graph, {id}, param, freeze).

Node *insnet::exp(Node &input)

The pointwise exp function.

All exp operators will be executed in batch.

Parameters: input – The input tensor.
Returns: The result tensor. Its size is equal to input.size().

Node *insnet::expandColumnwisely(Node &input, int col)

Expand the input tensor in the column-wise direction.

For example, expandColumnwisely([0.1, 0.2], 3) will return [0.1, 0.2, 0.1, 0.2, 0.1, 0.2].

The operators whose input tensor’s sizes are equal will be executed in batch. For example, expandColumnwisely([0.1, 0.2], 3) and expandColumnwisely([0.3, 0.4], 4) will be executed in batch.

Parameters

input – The input tensor.
col – The column number to expand with.

Returns

The expanded tensor. Its size is equal to input.size() * col.

Node *insnet::expandRowwisely(Node &input, int row)

Expand the input tensor in the row-wise direction.

For example, expandRowwisely([0.1, 0.2], 3) will return [0.1, 0.1, 0.1, 0.2, 0.2, 0.2].

The operators whose input tensor’s sizes are equal will be executed in batch. But this batching rule seems not reasonable enough and needs to be modified.

Parameters

input – The input tensor.
row – The row number to expand with.

Returns

The expanded tensor. Its size is equal to input.size() * row.

Node *insnet::layerNorm(Node &input, int row)

The row-wise layer normalization with the specified row number.

For Example, layerNorm([1.1, 0.9, -1.2, -0.8], 2) will return [1, -1, -1, 1].

The operators with the equal row will be executed in batch. For example, layerNorm([0.1, 0.2, 0.3, 0.4], 2) and layerNorm([0.1, 0.2], 2) will be executed in batch, but layerNorm([0.1, 0.2, 0.3, 0.4], 2) and layerNorm([0.1, 0.2, 0.3, 0.4], 4) will not.

Parameters

input – The input tensor.
row – The row number.

Returns

The normalized tensor with the mean value of 0 and the standard deviation of 1. Its size is equal to input.size().

Node *insnet::layerNorm(Node &input, LayerNormParams &params)

The row-wise layer normalization with the parameters of the subsequent affine transformation.

For Example, supposing params.g() is [0.1, -0.1] and params.b() is [0, 0] layerNorm([1.1, 0.9, -1.2, -0.8], params) will return [0.1, 0.1, -0.1, -0.1].

The operators with the same parameters will be executed in batch.

Parameters

input – The input tensor.
params – g and b.

Returns

The affine transformed normalized tensor. Its size is equal to input.size().

Node *insnet::linear(Node &X, LinearParams &params)

The linear transformation with bias. \({W^T}{X} + [b b .. b]\).

You can disable the bias term when initializing params.

The operators with the same parameters will be executed in batch. For example, supposing we have params with a 2x2 weight matrix, linear([0.1, 0.2, 0.3, 0.4], params) and linear([0.1, 0.2], params) will be executed in batch.

Parameters

X – The input tensor.
params – W and b.

Returns

The transformed tensor. its size is equal to X.size() / params.W.row() * params.W.col().

Node *insnet::linear(Node &X, Param &W)

The linear transformation. \({W^T}{X}\).

This operator is especially useful when you want to share the weight matrix with another component. For example, to tie the input and output embeddings, call this operator like linear(h, emb_table.param()).

The operators with the same weight matrix will be executed in batch.

Parameters

X – The input tensor.
W – The weight matrix.

Returns

The transformed tensor. its size is equal to X.size() / W.row() * W.col().

Node *insnet::logSoftmax(Node &input, int row)

The row-wise log softmax operator.

For example, logSoftmax([5, 0, 0, -5], 2) returns [-0.0067, -5.0067, -0.0067, -5.0067]

The operators with the same row will be executed in batch.

Parameters

input – The input tensor.
row – The row number. Note that the input tensor’s size should be divisible by the row number.

Returns

The result tensor. Its size is input.size();

inline Node *insnet::logSoftmax(Node &input)

The row-wise log softmax operator of the input vector.

This is the shorcut of logSoftmax(input, input.size()).

Node *insnet::matmul(Node &A, Node &B, int b_row, bool transpose_a = false, bool use_lower_triangular_mask = false)

Matrix multiplication. \(A B\) (if transpose_a is false) or \(A^T B\) (if transpose_a is true).

No matter transpose_a is true or false, the shapes of A and B are determined by b_row. InsNet supports transposing A in matmul for cache friendliness.

No matter transpose_a is true or false, the operators with equal row of A will be executed in batch. For example, \(K^T Q\) and \(VA\) of the Transformer in the same mini-batch will be executed in batch.

Parameters

A – Matrix A whose row number is A.size() / b_row and column number is b_row.
B – Matrix B whose row number is b_row and column number is B.size() / b_row.
b_row – The row number of matrix B.
transpose_a – Whether to transpose A before matrix multiplication. The default value is false.
use_lower_triangular_mask – Whether to apply a lower triangular mask of -inf to the result matrix. It is typically used in the Transformer decoder’s self attention. Note that do not set it to true when transpose_a is false or the result matrix is not a square. The default value is false.

Returns

The result matrix. No matter transpose_a is true or false, its size is A.size() / b_row * B.size() / b_row.

Node *insnet::max(Node &input, int row)

Find the row-wise max scalars of the input tensor.

The max operators with the same returned tensor’s size will be executed in batch. But this batching rule seems not reasonable enough and needs to be modified.

Parameters

input – The input tensor.
row – The row number for which the row-wise max should be calculated. Note that the input tensor’s size should be divisible by the row number.

Returns

The tensor of maximal values. Its size is input.size() / row;

Node *insnet::maxPool(Node &input, int row)

Find the column-wise max pooling.

For example, maxPool([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], 2) returns [0.5, 0.6], and maxPool([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], 3) returns [0.4, 0.5, 0.6].

The operators with the equal row number will be executed in batch. For example, maxPool([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], 2) and maxPool([0.1, 0.2], 2) will be executed in batch. This guarantees that the maxPool operators in the same mini-batch will generally be executed in batch.

Parameters

input – The input tensor.
row – The row number. Note that the input tensor’s size should be divisible by the row number.

Returns

The result tensor. Its size is row.

Node *insnet::minPool(Node &input, int row)

Find the column-wise min pooling.

It is the shortcut of mul(*maxPool(*mul(input, -1), row), -1).

Node *insnet::mul(Node &input, dtype factor)

It multiplies the input tensor by the factor. For example, mul([0.1, 0.1], 2) will return [0.2, 0.2].

All mul operators will be executed in batch.

Parameters

input – The input tensor.
factor – The number to multiply with.

Returns

The multiplied tensor. Its size is equal to input.size().

Node *insnet::mul(Node &a, Node &b)

Pointwise multiplication. \([{a_0}{b_0}, {a_1}{b_1}, ..., {a_n}{b_n}]\)

The operators with equal a.size()(also b.size()) will be executed in batch. For example, in RNN networks, the pointwise multiplication of the forget gate and the last hidden state in the same mini-batch or beam search will be executed in batch.

Parameters

a – The first input tensor.
b – The second input tensor. Of course they are exchangable.

Returns

The result tensor. Its size is equal to both a.size() and b.size().

Node *insnet::param(Graph &graph, BaseParam &param)

Copy the parameters to the Node object.

The param operator should be used only once in a computation graph.

Parameters

graph – The computation graph.
param – The parameters. The result tensor. Its size is equal to param.size().

Node *insnet::relu(Node &input)

The relu activation function.

All relu operators will be executed in batch.

Parameters: input – The input tensor.
Returns: The result tensor. Its size is equal to input.size().

Node *insnet::sigmoid(Node &input)

The sigmoid activation function.

All sigmoid operators will be executed in batch.

Parameters: input – The input tensor.
Returns: The result tensor. Its size is equal to input.size().

Node *insnet::softmax(Node &input, int row)

The row-wise softmax operator.

For example, softmax([0.1, 0.1, 0.2, 0.2], 2) returns [0.5, 0.5, 0.5, 0.5]

All the operators will be executed in batch. This guarantees that all self-attention and cross-attention in the same layer will be executed in batch.

Parameters

input – The input tensor.
row – The row number. Note that the input tensor’s size should be divisible by the row number.

Returns

The result tensor. Its size is input.size();

inline Node *insnet::softmax(Node &input)

The row-wise softmax operator of the input vector.

This is the shorcut of softmax(input, input.size()).

Node *insnet::split(Node &input, int result_row, int row_offset, int input_col = 1)

Returns a contiguous region of a matrix.

For example, split([0.1, 0.2, 0.3, 0.4], 2, 2) will return [0.3, 0.4] and split([0.1, 0.2, 0.3, 0.4], 1, 1, 2) will return [0.2, 0.4].

All the operators will be executed in batch.

Parameters

input – The input tensor.
result_row – The result tensor’s row number. It should be no greater than input.size() / input_col.
row_offset – The row-wise offset where the split begins.
input_col – The column number of the input matrix.

Returns

The result tensor. Its size is result_row * input_col.

Node *insnet::sqrt(Node &input)

The pointwise sqrt function.

All sqrt operators will be executed in batch.

Parameters: input – The input tensor.
Returns: The result tensor. Its size is equal to input.size().

Node *insnet::sub(Node &a, Node &b)

Subtract an input tensor by another. \([a_0-b_0, a_1-b_1, ..., a_n-b_n]\).

All sub operators will be executed in batch.

Parameters

a – Its size should be equal to b.size().
b – Its size should be equal to a.size().

Returns

The result tensor. Its size is equal to both a.size() and b.size().

Node *insnet::sum(Node &input, int input_row)

Sum up the input tensor’s elements in the row-wise direction.

For example, sum([0.1, 0.1, 0.1, 0.2, 0.2, 0.2], 3) will return [0.3, 0.6].

If you want to sum up in the column-wise direction, use sumPool instead.

The operators that returns the equal size of tensors will be executed in batch. But this batching rule seems not reasonable enough and needs to be modified.

Parameters

input – The input tensor.
input_row – The input tensor’s row number.

Returns

The result tensor. Its size is equal to input.size() / input_row.

Node *insnet::sumPool(Node &input, int row)

Find the column-wise sum pooling.

For example, sumPool([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], 2) returns [0.9, 1.2], and sumPool([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], 3) returns [0.5, 0.7, 0.9].

As similar to maxPool, the operators with the equal row number will be executed in batch.

Parameters

input – The input tensor.
row – The row number. Note that the input tensor’s size should be divisible by the row number.

Returns

The result tensor. Its size is row.

Node *insnet::tanh(Node &input)

The tanh activation function.

All tanh operators will be executed in batch.

Parameters: input – The input tensor.
Returns: The result tensor. Its size is equal to input.size().

Node *insnet::tensor(Graph &graph, const std::vector<dtype> &list)

Initialize a tensor with the computation graph and list.

For example, tensor(graph, {0.1, 0.2}) will return [0.1, 0.2].

The operators passed with the equal size of list will be executed in batch. For example, tensor(graph, {0.1, 0.2}) and tensor(graph, {0.3, 0.4}) will be executed in batch.

Parameters

graph – The computation graph.
list – The list to initialize the tensor.

Returns

The result tensor. Its size is equal to list.size().

Node *insnet::tensor(Graph &graph, int size, dtype value)

Initialize a tensor with the computation graph, specified size and value.

For example, tensor(graph, 2, 0) will return [0, 0].

The operators passed with the equal size will be executed in batch. For example, tensor(graph, 1024, 0), and tensor(graph, 1024, 0.1) will be executed in batch.

Parameters

graph – The computation graph.
size – The result tensor’s size.
value – The value to initialize the tensor.

Returns

The result tensor. Its size is equal to size.

Modules

Node *insnet::multiheadAttention(Node &Q, Node &K, Node &V, int embed_dim, int num_heads, LinearParams &Wo, dtype dropout, bool use_mask)

The multi-head attention.

The operators inside guarantee that multiheadAttention with the equal embed_dim, num_heads, Wo, dropout and use_mask will be executed in batch. For example, multiheadAttention in the same layer and mini-batch will be commonly executed in batch.

Parameters

Q – The query matrix before divided into multi-heads. Its size can be different with K and V, but should be divisble by embed_dim.
K – The key matrix before divided into multi-heads. Its size should be equal to V and be divisible by embed_dim.
V – The value matrix before divided into multi-heads. Its size should be equal to K and be divisible by embed_dim. Note that multiheadAttention assumes Q, K and V are already linear transformed before they are passed so they will not be linear transformed again and as such multiheadAttention does not accept weight matrix parameters of Wq, Wk and Wv.
embed_dim – The row number of Q, K and V. It should be divisible by num_heads.
num_heads – The head number.
Wo – The weight matrix of the output linear transformation.
dropout – The dropout value of the dropout following the output linear transformation.
use_mask – Whether to mask future tokens in K, which is typically used in the Transformer decoder’s self-attention. For the moment, user-defined masks are not supported yet.

Returns

The result matrix. Its size is equal to Q.size().

Node *insnet::gru(Node &last_state, Node &input, GRUParams &params, dtype dropout)

Return the next GRU hidden state.

The operators inside guarantee that gru with the same params and dropout value will be executed in batch.

Parameters

last_state – The last hidden state.
input – The input vector.
params – The GRU parameters.
dropout – The dropout value. The dropout will be added when returning the result vector.

Returns

The next hidden state.

std::vector<Node*> insnet::gru(Node &initial_state, const std::vector<Node*> &inputs, GRUParams &params, dtype dropout)

Return GRU hidden states.

It is implemented using gru(Node &, Node &, GRUParams &, dtype).

Parameters

initial_state – The initial hidden state commonly remarked as h_0.
inputs – The input vectors.
params – The GRU parameters.
dropout – The dropout value.

Returns

The hidden states.

LSTMState insnet::lstm(LSTMState &last_state, Node &input, LSTMParams &params, dtype dropout)

Return the next LSTM hidden state, i.e., hi and ci.

The operators inside guarantee that lstm with the same params and dropout value will be executed in batch.

Parameters

last_state – The last lstm state containing h_i and c_i.
input – The input vector.
params – The LSTM parameters.
dropout – The dropout value. The dropout will be added when returning the hidden vector.

Returns

The next LSTM state.

std::vector<Node*> insnet::lstm(LSTMState &initial_state, const std::vector<Node*> &inputs, LSTMParams &params, dtype dropout)

Return LSTM hidden states.

It is implemented using lstm(LSTMState &, Node &, LSTMParams &, dtype).

Parameters

initial_state – The initial state commonly remarked as h_0 and c_0.
inputs – The input vectors.
params – The LSTM parameters.
dropout – The dropout value.

Returns

The hidden states.

std::vector<Node*> insnet::transformerDecoder(Node &encoder, Node &input, TransformerDecoderParams &params, dtype dropout_value)

The Transformer decoder. It uses the pre-layernorm version of the Transformer. See On Layer Normalization in the Transformer Architecture.

The operators inside guarantee that transformerDecoder with the same params and dropout will be executed in batch layer by layer.

Parameters

input – The input matrix. Note that for the current version, the positional encoding is added inside transformerDecoder using sin and cos, which may lack flexibility.
params – The Transformer decoder parameters.
dropout – The dropout value. The dropout is added after self-attention, cross-attention and FFN, respectively.

Returns

The list of hidden matrices of each layer.

TransformerDecoderState insnet::transformerDecoder(TransformerDecoderState &state, const std::vector<Node*> &encoder_keys, const std::vector<Node*> &encoder_values, Node &input, TransformerDecoderParams &params, dtype dropout)

Return the next Transformer hidden state, i.e., n layers of key and value matrices(their column number is equal to the decoded length), where n is the layer number.

It exploits the previous state to compute the next, which is useful in beam search.

Parameters

state – The last state. In particular, it should contain nullptr vector of size n if it is the initial state, where n is the layer number.
encoder_keys – The encoder key matrices. Its size is equal to the layer number.
encoder_values – The encoder value matrices. Its size is equal to the layer number.
input – The decoder input vector. Its size is equal to hidden_dim, i.e., it does not contain previous inputs.
params – The Transformer decoder parameters.
dropout – The dropout value.

Returns

The next state which appends one column to the last state’s key and value matrices.

std::vector<Node*> insnet::transformerEncoder(Node &input, TransformerEncoderParams &params, dtype dropout)

The Transformer encoder. It uses the pre-layernorm version of the Transformer. See On Layer Normalization in the Transformer Architecture.

The operators inside guarantee that transformerEncoder with the same params and dropout will be executed in batch layer by layer.

Parameters

input – The input matrix. Note that for the current version, the positional encoding is added inside transformerEncoder using sin and cos, which may lack flexibility.
params – The Transformer encoder parameters.
dropout – The dropout value. The dropout is added after self-attention and FFN, respectively.

Returns

The list of hidden matrices of each layer.

Loss Functions

float insnet::BCELoss(std::vector<Node*> &probs, const std::vector<std::vector<int>*> &answers, dtype factor)

The binary cross entropy loss.

It returns the loss and accumulate gradients to probs.

It will be executed eagerly.

Parameters

probs – The probability vectors. Note that for the current version they are all vectors of the same row and we may change it to support matrices of variant sizes in the future.
answers – The answers. Their values should be either 0 or 1.
factor – The factor that the loss will be multiplied with.

Returns

The loss.

dtype insnet::KLDivLoss(std::vector<Node*> &probs, const std::vector<std::vector<dtype>*> &answers, dtype factor)

The KL divergence loss.

It returns the loss and accumulate gradients to probs.

It will be executed eagerly.

Parameters

probs – The probability vectors. Note that for the current version they are all vectors of the same row and we may change it to support matrices of variant sizes in the future.
answers – The answers.
factor – The factor that the loss will be multiplied with.

Returns

The loss.

dtype insnet::NLLLoss(std::vector<Node*> &log_probs, int row, const std::vector<std::vector<int>> &answers, dtype factor)

The negative log likelihood loss.

It returns the loss and accumulate gradients to probs.

It will be executed eagerly.

Parameters

log_probs – The natural log probability matrices. their sizes can be variant but should all be divisible by row.
row – The row number of probability matrices.
answers – The answers. The inner vector’s sizes should be equal to probs’ size one by one.
factor – The factor that the loss will be multiplied with. Specifically, pass 1.0 if you want sum reduction, or 1.0 / n if you want average reduction, where n is the sum of answer sizes.

Returns

The loss.