Dynamic Memory Management on GPUs with SYCL

Programming graphical processing units can speed up code, but you may lose access to standard library features. Russell Standish investigates a cross-platform accelerator API: SYCL.

Dynamic memory allocation is not traditionally available in kernels running on graphical processing units (GPUs). This work aims to build on Ouroboros, an efficient dynamic memory management library for CUDA applications, by porting the code to SYCL, a cross-platform accelerator API. Since SYCL can be compiled to a CUDA backend, it is possible to compare the performance of the SYCL implementation with that of the original CUDA implementation, as well as test it on non-CUDA platforms such as Intel’s Xe graphics.

Introduction

Dynamic memory allocation is not traditionally part of the runtime environment of code running on accelerators, such as graphical processing units (GPUs), signal processors (DSPs) and field programmable gate arrays (FPGAs). Instead memory is allocated by the host code to fixed sizes prior to launching the kernel code that runs on the accelerator. This works well for many applications with fixed partitioning of space and time, such as finite element methods of partial differential equations, or algorithms based on matrix multiplication.

However, some applications, such as graph algorithms, or agent based models, require memory to be dynamically partitioned between the objects of the computation. Stopping the kernel, resizing memory allocations and relaunching is simply unfeasible in those circumstances. Instead, what is needed is to preallocate a chunk of memory on the host to act as a heap, and to run a heap allocation algorithm on the accelerator device.

There are two main C++ APIs for programming the host and kernel code. CUDA [Luebke08] is an extension to standard C++, and kernel code and host code are distinct. CUDA is developed by NVIDIA, and supports only GPUs manufactured by that company. SYCL [Reinders23], on the other hand, does not distinguish between host and kernel code, it is all standard C++. However, not all standard library features are available in a kernel context, which will be flagged by the compiler. Drivers are available for all major GPU types from Intel, AMD and NVIDIA, so SYCL is considered cross-platform.

At this point, it is worth mentioning a third option, also cross-platform, called OpenCL. It is an older C API, and consequentially less feature rich, and more difficult to use than CUDA or SYCL.

CUDA gained dynamic memory allocation in 2009 [NVIDIA], but it is often considered slow and unreliable. Since then, a number of dynamic memory allocators have been proposed, mostly for CUDA, although one non-open source OpenCL implementation exists [Spliet14]. Details of these can be found in a survey by Winter and Mlakar [Winter21]. The basic strategy is to divide the preallocated block into chunks of different sizes, each of which go into a lock-free queue dedicated for the particular chunk size.

In the above survey, benchmarks indicate Ouroboros [Winter20], by the same authors, as being the most performant. Therefore this work took the CUDA Ouroboros code and translated it into SYCL, called Ouroboros-SYCL¹. Since SYCL has implementations that target CUDA, we can directly compare the resulting performance of Ouroboros-SYCL with the original Ouroboros code on the same NVIDIA hardware, as well as provide results on non-NVIDIA hardware (Intel Xe graphics).

Porting the code to SYCL

As a way of getting started, I attempted to use the automatic CUDA to SYCL translator, called SYCLomatic [Liang24]. As might be expected for a project with the complexity of Ouroboros, this tool failed to generate compilable SYCL code, but managed to convert perhaps half of the code into SYCL equivalents, leaving constructs it failed to convert in the original CUDA form, and annotating some other parts it felt needed further attention from the programmer to check the conversion was valid. In the end, whilst the code conversion was a good start, pretty much all of the automatically converted code was modified manually. The first reason for this is that CUDA always uses a 3D layout of processing elements, whereas SYCL allows for the possibility of 1, 2 or 3D layouts of processing elements, indicated by a template parameter. SYCLomatic always renders the translated code as 3D, to match the original source code, however, Ouroboros, being a library needs to handle arbitrary layouts. Thus code to extract, for example, a processing element’s rank (known as id in SYCL) uses the get_global_linear_id() call, which returns a single number regardless of the dimensionality of the processing element layout.

In the process of getting the code to compile, SYCLomatic failed to convert atomic memory operations. SYCL has an atomic reference type, that wraps a variable, and allows a range of atomic operations, including the usual binary operation suspects (+,–,∧,∨,⊕,min,max), compare_and_swap. CUDA, by contrast has a set of atomic library functions, implementing the same operations. In the end, the simplest approach was to provide implementations of the CUDA library functions, implemented using SYCL atomic reference types.

The next issue was how to represent CUDA’s global threadIdx and blockIdx variables. These refer to the coordinates (in CUDA’s 3D space) of the thread (within its block) and the coordinates of the block. In SYCL terminology, a CUDA block is a group. An nd_item object contains all the information about the current thread’s rank within its group, and its group’s rank within the global range. However, the nd_item object is not available as a global function, but must be passed as a parameter into the stack frame where it is needed. A similar issue relates to I/O – CUDA has a printf statement available that is callable in any kernel function, whereas the SYCL equivalent is a stream object, usable like std::cout that must be created in the command group scope, and passed as a parameter to inner function scopes. In the Ouroboros-SYCL case, each class member function called in kernel code must take a template parameterised parameter (Desc) that has an item (an nd_item of arbitrary rank) and an out object that supports operator<< serialisation. Being a template parameter, users of the library can choose to define the rank of item, and use a sycl::stream or a dummy out object as appropriate.

It should be noted that Intel’s oneAPI SYCL compiler (called DPC++) also provides an experimental free function get_nd_item(), and an experiment free function printf() that can be used for this purpose. These functions are proposed for a future SYCL standard.

Another point about the sycl::stream object is that it buffers the string data written to it, and the message is only written to the console when the stream object goes out of scope. Unfortunately, if the problem being diagnosed is a deadlock, or a crash, the stream object never goes out of scope, so any helpful debug messages written by way of this object will not be seen – a frustrating exercise indeed.

CUDA compute capability 7 introduced a nanosleep function for pausing threads for a specific period of time. Ouroboros uses this function to throttle threads demanding to allocate memory so that other threads freeing memory can catch up. This function would indeed be a useful optimisation technique, but is unavailable in the SYCL programming environment. Instead, all we can do is perform an atomic_fence(), which ensures that other threads catch up to the fence.

The final difficulty in converting the Ouroboros CUDA code had to do with the use of warp vote functions, which allowed multiple allocations to occur within one warp. In SYCL terminology, a warp is known as a subgroup, and corresponding group reduction algorithms exist, applied to subgroups, that are the equivalent of the warp functions. Unfortunately there is an issue. The CUDA equivalents take a mask, so that threads not participating in the group operation can be masked out by passing the results of __activemask().

In SYCL, there is no real way of obtaining the active mask, and according to the standard, group operations block until all threads call the group operation function.

It would be more useful if group operations required all subgroups within a group to participate, and only those active threads should be required to call the group operation. So it should be possible to obtain the active mask by means of the following code:

  auto sg=i.get_sub_group();
  auto activeMask=sycl::reduce_over_group(
    sg,
    1ULL<<sg.get_local_linear_id(),
    sycl::bit_or<>()
  );

Interestingly, when run on an Intel GPU, or on the CPU, this code runs as expected, and generates the active mask. But when run on an NVIDIA GPU, this code deadlocks, both with Intel’s oneAPI, and with the AdaptiveCpp compiler, unless all threads in the subgroup are active.

Methods

Ouroboros comes with a driver program for each of the six alternative heap algorithms, chunk, page, virtual array chunk, virtual array page, virtual list chunk and virtual list page. Arguments passed to the driver program specify the data size to be allocated, and number of allocations to be allocated in parallel. Finally, the program iterates ten times through allocating memory, writing some data, checking that the data is correct when read back and then freeing the memory. The average time for performing the allocations and frees is calculated.

In this work, one additional change was made to the original code, aside from a trivial change to reduce the total amount of heap space available in order to fit on device available to the author. SYCL implementations typically compile the kernel code into an intermediate representation, such as SPIR-V [Kessenich18], and transpiling this into the native machine code of the accelerator occurs in a just in time (JIT) fashion when the kernel is first launched. As a result, there can be a big disparity between the time recorded for the first iteration, and the times recorded for subsequent iterations. So the code was modified to report the average over all iterations, and the average over all but the first iteration (ie subsequent iterations). This allow a more apples-to-apples comparison between the CUDA implementation and the SYCL implementation.

The actual code can be found in the Ouroboros-SYCL GitHub repository². The SYCL code is available in the master branch, and the original optimised CUDA code in the cuda-ouroboros branch.

Hardware:

Dell Precision 7540 laptop with i9-9880H CPU @ 2.3GHz and NVIDIA Quadro T2000 GPU.
Asus NUC 13, i5-1340P CPU with integrated Iris Xe graphics GPU

Software:

Intel oneAPI 2025.1 (icpx compiler)
Codeplay’s oneAPI for NVIDIA GPUs plugin
CUDA 12.8
Adaptive C++, compiled from source code³, commit f336ab84. Adaptive C++ was previously known as HipSYCL.

After running cmake, it is necessary to insert the compiler manually using ccmake: for oneAPI the compiler is icpx, and you need to add the option -fsycl, as well as for NVIDIA use, the option -fsycl-targets=nvptx64-nvidia-cuda. For Adaptive C++, the compiler is acpp, and doesn’t require any special command line flags.

The use of these compilers allows the comparison of the translated code with original code, running on the same hardware. Adaptive C++ targets CUDA’s PTX machine, so is closer to what the CUDA compiler nvcc produces. However, it does the final conversion of intermediate code to ptx in a just-in-time fashion, so for a proper comparison, we should compare only measured alloc/free times after the first iteration. Similarly, Codeplay’s plugin performs JIT compilation of intermediate code to ptx.

The optimised Ouroboros code has a few instances of embedded PTX code, also making use of nanosleep(), and the ability to mask warp voting functions by the active mask. To make the comparison fair with the SYCL versions, I created a deoptimised version, with the embedded code replaced by high level code equivalents, nanosleep replaced by an atomic_fence, and the code using warp functions replaced by the simplified code used in the SYCL versions. This code can be found in the deoptimised branch of the Ouroboros-SYCL repository.

Results

In interpreting the algorithm results, it should be noted that the heap is divided into chunks of different sizes, and the allocation requests are served as pages from within each chunk.

Note that the Adaptive C++ compiled code would struggle as the number of threads increased, with loops timing out or becoming deadlocked.

The raw results files are available in the supplementary materials [Standish25], and a Ravel⁴ file with the data loaded to assist in the data analysis.

Page allocator

The simplest allocator is the page-based allocator, where pages of fixed size are allocated from a queue. Total heap memory is divided amongst the queues, each queue managing a different page size. Being the simplest allocator, it is also the fastest, but suffers more from fragmentation than the other more sophisticated schemes. Figure 1 shows the average subsequent timings of allocations as a function of allocation size when 1024 threads are attempting to simultaneously allocate, and the timings as a function of number of threads simultaneously attempting to allocate 1000 bytes.

Figure 1

The performance of the SYCL code ends up being about half that of the CUDA code. Interestingly, the attempt to deoptimise the CUDA code to make it more comparable to the SYCL version only seem to make it more performant, if anything.

Chunk allocator

The chunk allocator maintains queues of chunks that have free pages, first obtaining a chunk index, then scanning the chunk for free pages. It is a more complex algorithm, but queue sizes are smaller.

Figure 2 shows the average time to allocate memory for different allocation sizes. The allocator is implemented as a linked list of chunk queues, each queue managing chunks sized according to powers of two. You can see the effect of having to walk through this link list as the chunk size increases. On the right, you can see the effect of thread contention as more threads attempt to allocate chunks simultaneously. We can conclude from these figures that not only does the SYCL version work (data is written to the allocated chunks and checked), but that the implementation performance is broadly in line with the original Ouroboros implementation when run on the same hardware.

Figure 2

Virtualised array and list allocators

Ouroboros also introduces virtual queues, which reduce queue sizes even further. Figures 3 through 6 show the equivalent results for the virtualised versions of the page and chunk allocators.

Figure 3

Figure 4

Figure 5

Figure 6

Conclusion

The results indicate that the conversion of Ouroboros’s CUDA-based code into SYCL was successful, and within a factor of 2 performance of the original code for the faster page-based algorithms, and within statistical noise of the performance of the chunk-based algorithms using Intel’s oneAPI toolset. Adaptive C++ unfortunately suffered from timeouts and deadlocks, which may limit the use of this code with this compiler. As it hasn’t yet fully implemented the SYCL 2020 standard, perhaps this is a matter of time.

The exercise also highlighted some deficiencies of SYCL with respect to CUDA – in particular the need for global access to a thread’s nd_item, a global printf function for debugging purposes (both of these are proposed as experimental additions to SYCL in the oneAPI toolset) and the need for group reduction algorithms to be masked by the active threads only.

References

[Kessenich18] John Kessenich, Boaz Ouriel, and Raun Krisch. SPIR-V specification. Khronos Group, 3:17, 2018.

[Liang24] Wentao Liang, Norihisa Fujita, Ryohei Kobayashi, and Taisuke Boku. Using SYCLomatic to migrate CUDA code to oneAPI adapting NVIDIA GPU. In 2024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops), pages 192–193. IEEE, 2024.

[Luebke08] David Luebke. CUDA: Scalable parallel programming for high-performance scientific computing. In 2008 5th IEEE international symposium on biomedical imaging: from nano to macro, pages 836–838. IEEE, 2008.

[NVIDIA] NVIDIA. CUDA C++ programming guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide (accessed 1st April 2025).

[Reinders23] James Reinders, Ben Ashbaugh, James Brodman, Michael Kinsner, John Pennycook, and Xinmin Tian. Data Parallel C++: Programming Accelerated Systems Using C++ and SYCL. Springer Nature, 2023.

[Spliet14] Roy Spliet, Lee Howes, Benedict R Gaster, and Ana Lucia Varbanescu. KMA: A dynamic memory manager for OpenCL. In Proceedings of Workshop on General Purpose Processing Using GPUs, pages 9–18, 2014.

[Standish25] Russell K. Standish. Ouroboros-SYCL. https://osf.io/2zwrt/, 2025.

[Winter20] Martin Winter, Daniel Mlakar, Mathias Parger, and Markus Steinberger. Ouroboros: virtualized queues for dynamic memory management on GPUs. In Proceedings of the 34th ACM International Conference on Supercomputing, pages 1–12, 2020.

[Winter21] Martin Winter, Mathias Parger, Daniel Mlakar, and Markus Steinberger. Are dynamic memory managers on GPUs slow? a survey and benchmarks. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 219–233, 2021.9

Footnotes

https://github.com/highperformancecoder/Ouroboros-SYCL
https://github.com/highperformancecoder/Ouroboros-SYCL
https://github.com/AdaptiveCpp/AdaptiveCpp
Ravel is a revolutionary product for interactively analysing multidimensional data, available from https://ravelation.net

Russell Standish Russell gained a PhD in Theoretical Physics, and has had a long career in computational science and high performance computing. Currently, he operates a consultancy specialising in computational science and HPC, with a range of clients from academia and the private sector.