Hybrid API: Native Library Tasks
================================

The hybrid API lets a :code:`TaskGraph` mix JIT-compiled Java tasks with calls
into vendor-optimized native libraries (e.g., NVIDIA cuBLAS). Library tasks
share TornadoVM-managed device buffers with regular tasks, so data produced by
a JIT kernel can feed a library call (and vice versa) without extra copies or
special memory management.

.. code:: java

   TaskGraph taskGraph = new TaskGraph("cuBLAS")
       .transferToDevice(DataTransferMode.EVERY_EXECUTION, matrix, vector)
       .task("preprocess", MyClass::preprocess, matrix)                    // JIT-compiled kernel
       .libraryTask("sgemv", CuBlas::cublasSgemv,                          // native cuBLAS call
               CuBlasOperation.CUBLAS_OP_T.operation(),
               m, n, alpha, matrix, lda, vector, incx, beta, output, incy)
       .task("postprocess", MyClass::postprocess, output)                  // JIT-compiled kernel
       .transferToHost(DataTransferMode.EVERY_EXECUTION, output);

Run with the CUDA backend (:code:`make BACKEND=cuda`). Examples live in the
``tornado-cublas`` module:

.. code:: bash

   tornado -m tornado.cublas/uk.ac.manchester.tornado.cublas.tests.TestCuBlasSgemv
   tornado -m tornado.cublas/uk.ac.manchester.tornado.cublas.tests.TestCuBlasSgemvBeta
   tornado -m tornado.cublas/uk.ac.manchester.tornado.cublas.tests.TestCuBlasSgemm
   tornado -m tornado.cublas/uk.ac.manchester.tornado.cublas.tests.TestCuBlasSgemvWithTornadoVMTasksPOST

Performance
-----------

``BenchmarkSgemm`` compares the TornadoVM JIT-generated matrix-multiply kernel
against a cuBLAS SGEMM library task on the same device buffers:

.. code:: bash

   tornado -m tornado.cublas/uk.ac.manchester.tornado.cublas.tests.BenchmarkSgemm 2048 50

Reference numbers on an NVIDIA GeForce RTX 4090 (FP32, CUDA 12.6): the cuBLAS
library task reaches 24 / 46 / 51 TFLOP/s at sizes 1024 / 2048 / 4096, a 6-10x
speedup over the JIT-generated kernel (~4-5 TFLOP/s) with identical results.

With :code:`--enableProfiler console`, library tasks report ``TASK_KERNEL_TIME``
(host-timed, bounded by stream markers) together with ``BACKEND``, ``DEVICE``
and ``METHOD``, alongside regular tasks.

How it works
------------

A library task is a :code:`SchedulableTask` without a sketch (like a pre-built
task): its per-argument :code:`Access[]` comes from the library binding, and it
flows through the standard data-flow graph and bytecode pipeline. The regular
``ALLOC`` / ``TRANSFER`` / ``LAUNCH`` bytecodes manage its buffers and
dependencies; at ``LAUNCH`` the interpreter resolves each argument to the raw
device pointer of its TornadoVM buffer (past the array header) and dispatches
the call through a library provider instead of launching a kernel.

The native library runs on the **same CUDA stream** as the backend's kernels
and transfers (the provider binds its handle with :code:`cublasSetStream`), so
ordering with surrounding tasks is automatic and no host synchronization is
introduced.

Because library calls share the backend stream, they are also **CUDA Graph
compatible**: with :code:`executionPlan.withCUDAGraph()` the library call is
recorded into the captured graph together with the surrounding kernels and
transfers, and replayed with a single :code:`cuGraphLaunch` on subsequent
executions (see ``TestCuBlasSgemvWithTasksCudaGraph``). Native contexts are
created in the pre-compilation pass, before capture starts, since handle
creation allocates device memory; per-call profiler timing is disabled while
capturing.

Provider SPI
------------

Library bindings are discovered with :code:`java.util.ServiceLoader` — adding
a new library requires **no changes to the core runtime**:

1. Implement
   :code:`uk.ac.manchester.tornado.runtime.library.spi.TornadoLibraryProvider`
   (see ``CuBlasLibraryProvider`` in ``tornado-cublas``): create a native
   context per (device, execution plan), dispatch function calls from a
   :code:`LibraryInvocation` (resolved device pointers + boxed scalars), and
   release the context when the execution plan closes.
2. Declare it in the module descriptor:
   :code:`provides TornadoLibraryProvider with MyProvider;`
3. Expose user-facing factory methods that build a
   :code:`LibraryTaskDescriptor` (library name, function name, parameters,
   accesses), following ``CuBlas.cublasSgemv``.
4. Add a JNI module for the native binding (see
   ``tornado-drivers/cublas-jni``), linked under the ``cuda-backend`` Maven
   profile.

Backends expose their native stream to providers through
:code:`TornadoNativeStreamSupport` (implemented by the CUDA backend). Provider
:code:`canHandle(device)` rejects devices without native-stream interop.

Scope and roadmap
-----------------

Currently supported: FP32 :code:`cublasSgemv` and :code:`cublasSgemm` on the
CUDA backend. When :code:`beta != 0` the output operand is also read by cuBLAS;
the binding marks it ``READ_WRITE`` automatically (include it in
:code:`transferToDevice` if its initial values come from the host). Batch
processing (:code:`withBatch`) is not supported for library tasks.

The same SPI accommodates other host-API libraries (cuBLASLt, cuDNN, cuFFT,
cuSPARSE, cuSOLVER, cuTENSOR, NCCL): each is a module pair implementing the
provider interface, with per-(plan, device) contexts for cached descriptors
and plans, and a workspace-allocation hook planned for libraries that need
scratch buffers. Header-only device libraries (CUB, CUTLASS/CuTe) are a
different integration track — they plug into the CUDA-C backend's NVRTC
compilation, not the library-task path.

Note that cuBLAS assumes column-major storage: for row-major TornadoVM arrays
pass the transpose operation (SGEMV) or swap operands (SGEMM), as in the
example tests.