Hybrid API: Native Library Tasks
The hybrid API lets a TaskGraph mix JIT-compiled Java tasks with calls
into vendor-optimized native libraries (e.g., NVIDIA cuBLAS). Library tasks
share TornadoVM-managed device buffers with regular tasks, so data produced by
a JIT kernel can feed a library call (and vice versa) without extra copies or
special memory management.
TaskGraph taskGraph = new TaskGraph("cuBLAS")
.transferToDevice(DataTransferMode.EVERY_EXECUTION, matrix, vector)
.task("preprocess", MyClass::preprocess, matrix) // JIT-compiled kernel
.libraryTask("sgemv", CuBlas::cublasSgemv, // native cuBLAS call
CuBlasOperation.CUBLAS_OP_T.operation(),
m, n, alpha, matrix, lda, vector, incx, beta, output, incy)
.task("postprocess", MyClass::postprocess, output) // JIT-compiled kernel
.transferToHost(DataTransferMode.EVERY_EXECUTION, output);
Run with the CUDA backend (make BACKEND=cuda). Examples live in the
tornado-cublas module:
tornado -m tornado.cublas/uk.ac.manchester.tornado.cublas.tests.TestCuBlasSgemv
tornado -m tornado.cublas/uk.ac.manchester.tornado.cublas.tests.TestCuBlasSgemvBeta
tornado -m tornado.cublas/uk.ac.manchester.tornado.cublas.tests.TestCuBlasSgemm
tornado -m tornado.cublas/uk.ac.manchester.tornado.cublas.tests.TestCuBlasSgemvWithTornadoVMTasksPOST
Performance
BenchmarkSgemm compares the TornadoVM JIT-generated matrix-multiply kernel
against a cuBLAS SGEMM library task on the same device buffers:
tornado -m tornado.cublas/uk.ac.manchester.tornado.cublas.tests.BenchmarkSgemm 2048 50
Reference numbers on an NVIDIA GeForce RTX 4090 (FP32, CUDA 12.6): the cuBLAS library task reaches 24 / 46 / 51 TFLOP/s at sizes 1024 / 2048 / 4096, a 6-10x speedup over the JIT-generated kernel (~4-5 TFLOP/s) with identical results.
With --enableProfiler console, library tasks report TASK_KERNEL_TIME
(host-timed, bounded by stream markers) together with BACKEND, DEVICE
and METHOD, alongside regular tasks.
How it works
A library task is a SchedulableTask without a sketch (like a pre-built
task): its per-argument Access[] comes from the library binding, and it
flows through the standard data-flow graph and bytecode pipeline. The regular
ALLOC / TRANSFER / LAUNCH bytecodes manage its buffers and
dependencies; at LAUNCH the interpreter resolves each argument to the raw
device pointer of its TornadoVM buffer (past the array header) and dispatches
the call through a library provider instead of launching a kernel.
The native library runs on the same CUDA stream as the backend’s kernels
and transfers (the provider binds its handle with cublasSetStream), so
ordering with surrounding tasks is automatic and no host synchronization is
introduced.
Because library calls share the backend stream, they are also CUDA Graph
compatible: with executionPlan.withCUDAGraph() the library call is
recorded into the captured graph together with the surrounding kernels and
transfers, and replayed with a single cuGraphLaunch on subsequent
executions (see TestCuBlasSgemvWithTasksCudaGraph). Native contexts are
created in the pre-compilation pass, before capture starts, since handle
creation allocates device memory; per-call profiler timing is disabled while
capturing.
Provider SPI
Library bindings are discovered with java.util.ServiceLoader — adding
a new library requires no changes to the core runtime:
Implement
uk.ac.manchester.tornado.runtime.library.spi.TornadoLibraryProvider(seeCuBlasLibraryProviderintornado-cublas): create a native context per (device, execution plan), dispatch function calls from aLibraryInvocation(resolved device pointers + boxed scalars), and release the context when the execution plan closes.Declare it in the module descriptor:
provides TornadoLibraryProvider with MyProvider;Expose user-facing factory methods that build a
LibraryTaskDescriptor(library name, function name, parameters, accesses), followingCuBlas.cublasSgemv.Add a JNI module for the native binding (see
tornado-drivers/cublas-jni), linked under thecuda-backendMaven profile.
Backends expose their native stream to providers through
TornadoNativeStreamSupport (implemented by the CUDA backend). Provider
canHandle(device) rejects devices without native-stream interop.
Scope and roadmap
Currently supported: FP32 cublasSgemv and cublasSgemm on the
CUDA backend. When beta != 0 the output operand is also read by cuBLAS;
the binding marks it READ_WRITE automatically (include it in
transferToDevice if its initial values come from the host). Batch
processing (withBatch) is not supported for library tasks.
The same SPI accommodates other host-API libraries (cuBLASLt, cuDNN, cuFFT, cuSPARSE, cuSOLVER, cuTENSOR, NCCL): each is a module pair implementing the provider interface, with per-(plan, device) contexts for cached descriptors and plans, and a workspace-allocation hook planned for libraries that need scratch buffers. Header-only device libraries (CUB, CUTLASS/CuTe) are a different integration track — they plug into the CUDA-C backend’s NVRTC compilation, not the library-task path.
Note that cuBLAS assumes column-major storage: for row-major TornadoVM arrays pass the transpose operation (SGEMV) or swap operands (SGEMM), as in the example tests.