Benchmarking TornadoVM

Benchmarks

Currently the benchmark runner script can execute the following benchmarks:

*saxpy
*addImage
*stencil
*convolvearray
*convolveimage
*blackscholes
*montecarlo
*blurFilter
*euler
*renderTrack
*nbody
*sgemm
*dgemm
*mandelbrot
*dft

For each benchmark, a Java version exists in order to obtain timing measurements. All performance and time measurements are obtained through a number of iterations (e.g. 130). Also, each benchmark can be tested for various array sizes ranging from 256 to 16777216.

How to run

Go to the directory <tornadovm path>/bin/sdk/bin. Then, the run options can be found with the following command:

usage: tornado-benchmarks.py [-h] [--validate] [--default] [--medium]
                             [--iterations ITERATIONS] [--full]
                             [--skipSequential] [--skipParallel]
                             [--skipDevices SKIP_DEVICES] [--verbose]
                             [--printBenchmarks]

Tool to execute benchmarks in TornadoVM. With no options, it runs all
benchmarks with the default size

optional arguments:
  -h, --help            show this help message and exit
  --validate            Enable result validation
  --default             Run default benchmark configuration
  --medium              Run benchmarks with medium sizes
  --iterations ITERATIONS
                        Set the number of iterations
  --full                Run for all sizes in all devices. Including big data
                        sizes
  --skipSequential      Skip java version
  --skipParallel        Skip parallel version
  --skipDevices SKIP_DEVICES
                        Skip devices. Provide a list of devices (e.g., 0,1)
  --verbose, -V         Enable verbose
  --printBenchmarks     Print the list of available benchmarks
  --jmh                 Run with JMH

Example

Example of running all benchmark for all devices available in your system with the default data size.

$ tornado-benchmarks.py
Running TornadoVM Benchmarks
[INFO] This process takes between 30-60 minutes
[INFO] TornadoVM options: -Xms24G -Xmx24G -server
bm=saxpy-101-16777216, id=java-reference      , average=7.604811e+06, median=7.521843e+06, firstIteration=1.179550e+07, best=7.355636e+06
bm=saxpy-101-16777216, device=0:0  , average=1.852340e+07, median=1.708197e+07, firstIteration=2.788138e+07, best=1.612269e+07, speedupAvg=0.4106, speedupMedian=0.4403, speedupFirstIteration=0.4231, CV=10.5305%, deviceName=NVIDIA CUDA -- GeForce GTX 1050
bm=saxpy-101-16777216, device=0:1  , average=4.503467e+07, median=4.482944e+07, firstIteration=6.696712e+07, best=4.236860e+07, speedupAvg=0.1689, speedupMedian=0.1678, speedupFirstIteration=0.1761, CV=4.7203%, deviceName=Intel(R) OpenCL -- Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
bm=saxpy-101-16777216, device=0:2  , average=2.212386e+07, median=2.129296e+07, firstIteration=3.493844e+07, best=1.975243e+07, speedupAvg=0.3437, speedupMedian=0.3533, speedupFirstIteration=0.3376, CV=7.5316%, deviceName=AMD Accelerated Parallel Processing -- Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
bm=saxpy-101-16777216, device=0:3  , average=1.835022e+07, median=1.830117e+07, firstIteration=2.965289e+07, best=1.760201e+07, speedupAvg=0.4144, speedupMedian=0.4110, speedupFirstIteration=0.3978, CV=3.2015%, deviceName=Intel(R) OpenCL HD Graphics -- Intel(R) Gen9 HD Graphics NEO
bm=add-image-101-2048-2048, id=java-reference      , average=6.076920e+07, median=5.912435e+07, firstIteration=9.159228e+07, best=5.539140e+07
bm=add-image-101-2048-2048, device=0:0  , average=2.587469e+07, median=2.560709e+07, firstIteration=6.173938e+07, best=2.399116e+07, speedupAvg=2.3486, speedupMedian=2.3089, speedupFirstIteration=1.4835, CV=5.1914%, deviceName=NVIDIA CUDA -- GeForce GTX 1050
bm=add-image-101-2048-2048, device=0:1  , average=3.250553e+07, median=3.089569e+07, firstIteration=8.700214e+07, best=2.691534e+07, speedupAvg=1.8695, speedupMedian=1.9137, speedupFirstIteration=1.0528, CV=11.3154%, deviceName=Intel(R) OpenCL -- Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
bm=add-image-101-2048-2048, device=0:2  , average=3.061671e+07, median=3.037699e+07, firstIteration=7.024932e+07, best=2.742994e+07, speedupAvg=1.9848, speedupMedian=1.9464, speedupFirstIteration=1.3038, CV=4.3990%, deviceName=AMD Accelerated Parallel Processing -- Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
bm=add-image-101-2048-2048, device=0:3  , average=2.564357e+07, median=2.512443e+07, firstIteration=6.052658e+07, best=2.316377e+07, speedupAvg=2.3698, speedupMedian=2.3533, speedupFirstIteration=1.5133, CV=4.9465%, deviceName=Intel(R) OpenCL HD Graphics -- Intel(R) Gen9 HD Graphics NEO
bm=stencil-101-1048576, id=java-reference      , average=1.841053e+05, median=1.885090e+05, firstIteration=4.734246e+06, best=1.636910e+05
bm=stencil-101-1048576, device=0:0  , average=1.862818e+05, median=1.863900e+05, firstIteration=8.547734e+06, best=1.672090e+05, speedupAvg=0.9883, speedupMedian=1.0114, speedupFirstIteration=0.5539, CV=13.9480%, deviceName=NVIDIA CUDA -- GeForce GTX 1050
bm=stencil-101-1048576, device=0:1  , average=1.323170e+05, median=1.272060e+05, firstIteration=7.506147e+06, best=1.057020e+05, speedupAvg=1.3914, speedupMedian=1.4819, speedupFirstIteration=0.6307, CV=12.2388%, deviceName=Intel(R) OpenCL -- Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
bm=stencil-101-1048576, device=0:2  , average=1.238349e+05, median=1.095310e+05, firstIteration=4.092201e+06, best=8.586900e+04, speedupAvg=1.4867, speedupMedian=1.7211, speedupFirstIteration=1.1569, CV=47.6368%, deviceName=AMD Accelerated Parallel Processing -- Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
bm=stencil-101-1048576, device=0:3  , average=2.464191e+05, median=2.296330e+05, firstIteration=4.807327e+06, best=2.218090e+05, speedupAvg=0.7471, speedupMedian=0.8209, speedupFirstIteration=0.9848, CV=12.3793%, deviceName=Intel(R) OpenCL HD Graphics -- Intel(R) Gen9 HD Graphics NEO
bm=convolve-array-100-2048-2048-5, id=java-reference      , average=2.612301e+08, median=2.609304e+08, firstIteration=4.006838e+08, best=2.544892e+08
bm=convolve-array-100-2048-2048-5, device=0:0  , average=8.143104e+06, median=8.214443e+06, firstIteration=1.811648e+07, best=7.609697e+06, speedupAvg=32.0799, speedupMedian=31.7648, speedupFirstIteration=22.1171, CV=4.6348%, deviceName=NVIDIA CUDA -- GeForce GTX 1050
bm=convolve-array-100-2048-2048-5, device=0:1  , average=9.842007e+07, median=9.631152e+07, firstIteration=1.018732e+08, best=9.032237e+07, speedupAvg=2.6542, speedupMedian=2.7092, speedupFirstIteration=3.9332, CV=9.3753%, deviceName=Intel(R) OpenCL -- Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
...

Using JMH

The tornado-benchmarks.py script is configured to use JMH.

$ tornado-benchmarks.py --jmh

The script runs all benchmarks using JMH. This process takes ~3.5h.

Additionally, each benchmark has a JMH configuration. Users can execute any benchmark from the list as follows:

$ tornado -m tornado.benchmarks/uk.ac.manchester.tornado.benchmarks.<benchmark>.JMH<BENCHMARK>

This process takes ~10mins per benchmark.

For example:

$ tornado -m tornado.benchmarks/uk.ac.manchester.tornado.benchmarks.dft.JMHDFT
# JMH version: 1.23
...
Benchmark          Mode  Cnt   Score   Error  Units
JMHDFT.dftJava     avgt    5  19.736 ± 1.589   s/op
JMHDFT.dftTornado  avgt    5   0.155 ± 0.008   s/op