Benchmarks
This topic will conduct the benchmarks (Source Code) to measure InsNet’s training speed and memory usage on Transformer-based seq2seq models, with various model size settings.
Dataset
We use a neural machine translation dataset IWSLT’15 English-Vietnamese with sentence length statistics as follows:
Length (mean and standard deviation) |
|
---|---|
source sentence |
\(20.3\pm15.0\) |
target sentence |
\(24.8\pm18.8\) |
The source and target vocabulary size is 17191 and 7709, respectively.
Model
We use two Transformer-based seq2seq models, namely BASE and LARGE, with their settings as follows:
enc layer |
dec layer |
hidden size |
inner size |
head size |
dropout |
|
---|---|---|---|---|---|---|
BASE |
3 |
3 |
512 |
2048 |
8 |
0.1 |
LARGE |
6 |
6 |
768 |
3072 |
12 |
0.1 |
For Both models, we use the ADAM optimizer.
Environment
All benchmarks are conducted on our laptop with configuration as follows:
CPU Frequency |
2.6G Hz |
GPU |
RTX 2060 Laptop |
GPU Memory |
6GB |
OS |
Ubuntu 16.04 |
The completed GPU configuration is as follows:
Device 0: "GeForce RTX 2060"
CUDA Driver Version / Runtime Version 11.1 / 10.0
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 5935 MBytes (6222839808 bytes)
(30) Multiprocessors, ( 64) CUDA Cores/MP: 1920 CUDA Cores
GPU Max Clock rate: 1200 MHz (1.20 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
PyTorch Implementation
We use OpenNMT as the PyTorch implementation for comparison.
Results
Training Throughput on GPU
batch_size |
insnet |
PyTorch |
---|---|---|
500 |
14885 |
12443 |
1000 |
16782 |
17822 |
1500 |
17401 |
19846 |
2000 |
17767 |
20953 |
2500 |
17952 |
21555 |
3000 |
17994 |
21912 |
4500 |
17885 |
22674 |
6000 |
17348 |
22474 |
batch_size |
insnet |
PyTorch |
---|---|---|
500 |
4788 |
5901 |
1000 |
5235 |
5968 |
1500 |
OOM |
6970 |
2000 |
OOM |
OOM |
2500 |
OOM |
OOM |
3000 |
OOM |
OOM |
4500 |
OOM |
OOM |
6000 |
OOM |
OOM |
Memory Usage on GPU
batch_size |
insnet |
PyTorch |
---|---|---|
500 |
1667 |
2397 |
1000 |
1907 |
2353 |
1500 |
2243 |
2551 |
2000 |
2165 |
3345 |
2500 |
2163 |
3754 |
3000 |
2850 |
3727 |
4500 |
3158 |
5284 |
6000 |
3255 |
5447 |
batch_size |
insnet |
PyTorch |
---|---|---|
500 |
5169 |
5271 |
1000 |
5793 |
5277 |
1500 |
OOM |
5813 |
2000 |
OOM |
OOM |
2500 |
OOM |
OOM |
3000 |
OOM |
OOM |
4500 |
OOM |
OOM |
6000 |
OOM |
OOM |
The low memory usage shall be partly attributed to InsNet’s Padding-free Dynamic Batching feature.
We will illustrate InsNet’s efficient memory management in detail in the future.