Evaluating different memory managers for dynamic GPU memory allocation.
The framework was tested on Windows 10, Arch Linux <5.9.9> as well as Manjaro <5.4>
-
CUDA Toolkit
- Tested on
10.1
,10.2
,11.0
and11.1
- Windows Download
- Arch Linux (
pacman -S cuda
)
- Tested on
-
C++ Compiler
- Tested on
gcc 9.0
andgcc 10.2
- Arch Linux (
pacman -S gcc
)
- Arch Linux (
VS 2019
- Tested on
-
boost (required for ScatterAlloc)
- Tested with boost
1.66
and1.74
- Windows Download
- Set the installed location in
BaseCMake.cmake
- Set the installed location in
- Arch Linux (
pacman -S boost
)
- Windows Download
- Tested with boost
-
CMake
- Version
>= 3.16
, tested with3.18
- Windows Download
- Arch Linux (
pacman -S cmake
)
- Version
-
Python
- Tested with
Python 3.8
andPython 3.9
- Windows Download or download via Windows Store
- Arch Linux (
pacman -S python
)
- Requires packages
argparse
(python pip -m install argparse
)
- Tested with
- Make sure all requirements are installed and configured correctly
- On
Windows
also set the correct boost path inBaseCMake.cmake
- On
- Setup
- Option A: Setup from Archive
- Extract archive
- In top-level directory, call
git submodule init
git submodule update
- Option B: Setup from GitHub
git clone --recursive -b AEsubmission https://github.com/GPUPeople/GPUMemManSurvey.git
- Option A: Setup from Archive
python init.py
- Install
- On
Windows
use theDeveloper PowerShell for VS 20XX
(msbuild
is needed) to call the scripts - Option A:
- If you want to build everything, call
python setupAll.py --cc XX
, set correct CC (tested with 61, 70 and 75)
- If you want to build everything, call
- OptionB:
- You can build each testcase separately, there is a
setup.py
in each tests folderpython setup.py --cc XX
, set correct CC tested with 61, 70, 75
- You can build each testcase separately, there is a
- On
- To clean/reset the build folders, simply call
python cleanAll.py
- Once again, there is a separate
clean.py
in every test subfolder
- Once again, there is a separate
To run a representative testsuite, simply call
python testAll.py -mem_size 8 -device 0 -runtest -genres
- The memory size is in GB
- The device ID of the device to use (has to match with the CC passed in build stage)
These runtime measures were measured for the limited testcase as setup in testAll.py
on a TITAN V and an Intel Core i7-8700X on Windows 10 and Manjaro respectively.
Task | Time (min:sec) - Linux | Time (min:sec) - Windows |
---|---|---|
Overall | 28 min 28 sec | 1 h 3 min 47 sec |
Build | 9 min 45 sec | 28 min 15 sec |
Test All | 18 min 43 sec | 35 min 32 sec |
- | - | - |
Allocation | 1 min 48 sec | 2 min 39 sec |
Mixed Allocation | 0 min 35 sec | 2 min 46 sec |
Scaling | 2 min 52 sec | 4 min 03 sec |
Fragmentation | 1 min 47 sec | 2 min 36 sec |
Out-of-Memory | 7 min 15 sec | 8 min 21 sec |
Graph Init | 0 min 11 sec | 1 min 44 sec |
Graph Update | 0 min 11 sec | 1 min 44 sec |
Graph Update Range | 0 min 11 sec | 1 min 44 sec |
Register Footprint | 0 min 03 sec | 0 min 05 sec |
Initialization | 0 min 05 sec | 0 min 12 sec |
Synthetic Workload | 1 min 58 sec | 4 min 50 sec |
Synthetic Workload Write | 1 min 57 sec | 4 min 48 sec |
The framework does not perform many sanity checks, please read the documentation first if something is not working as expected if some parameter was not configured correctly for example.
frameworks
-> includes code for all frameworksexternals
-> not all CUDA versions have CUB yetinclude
/src
/scripts
-> framework codetests
-> all test implementationsalloc_test
-> all allocation teststest_allocation.py
test_mixed_allocation.py
test_scaling.py
frag_test
-> all memory/fragmentation teststest_fragmenation.py
test_oom.py
graph_test
-> all graph teststest_graph_init.py
test_graph_update.py
synth_test
-> all synthetic teststest_registers.py
test_synth_init.py
test_synth_workload.py
Framework | Status | Paper | Code |
---|---|---|---|
CUDA Device Allocator | ✔️ | - | - |
XMalloc (2010) | ✔️ | Webpage | - |
ScatterAlloc (2012) | ✔️ | Webpage | GitHub - Repository |
FDGMalloc (2013) | ❓ | Webpage | Webpage |
Register Efficient (2014) | ✔️ | Webpage | Webpage |
Halloc (2014) | ✔️ | Presentation | GitHub - Repository |
DynaSOAr (2019) | ❌ | Webpage | GitHub - Repository |
Bulk-Sempaphore (2019) | ❌ | Webpage | - |
Ouroboros (2020) | ✔️ | Paper | GitHub - Repository |
Each testcase is controlled and executed via python scripts, a commonality of all scripts is that to run the testcase, one has to pass -runtest
to the script, to gather all results into one file one can pass -genres
.
Pass -h
to print a help screen with all parameters.
All testcases get a -device
parameter to control which device should execute the GPU code (e.g. 0
) and how much memory on this device should be reserved for the memory manager, specified via -allocsize
(size on GB).
This table shows which test file can be used to generate which plot used in the paper.
Figure/Section | Script | Command |
---|---|---|
Sec. 4.1 |
test_registers.py |
python test_registers.py -t o+s+h+c+r+x -runtest -genres -allocsize 8 -device 0 |
Sec. 4.1 |
test_synth_init.py |
python test_synth_init.py -t o+s+h+c+r+x -runtest -genres -allocsize 8 -device 0 |
Fig. 9.a |
test_allocation.py |
python test_allocation.py -t o+s+h+c+r+x -num 100000 -range 4-8192 -iter 100 -runtest -genres -timeout 120 -allocsize 8 -device 0 |
Fig. 9.b |
test_allocation.py |
python test_allocation.py -t o+s+h+c+r+x -num 100000 -range 4-8192 -iter 100 -runtest -genres -timeout 120 -allocsize 8 -device 0 |
Fig. 9.c |
test_allocation.py |
python test_allocation.py -t o+s+h+c+r+x -num 10000 -range 4-8192 -iter 100 -runtest -genres -warp -timeout 120 -allocsize 8 -device 0 |
Fig. 9.d |
test_mixed_allocation.py |
python test_mixed_allocation.py -t o+s+h+c+r+x -num 10000 -range 4-8192 -iter 100 -runtest -genres -timeout 120 -allocsize 8 -device 0 |
Fig. 10.x |
test_scaling.py |
python test_scaling.py -t o+s+h+c+r+x -byterange 4-8192 -threadrange 0-20 -iter 100 -runtest -genres -timeout 300 -allocsize 8 -device 0 |
Fig. 11.a |
test_fragmentation.py |
python test_fragmentation.py -t o+s+h+c+r+x -num 100000 -range 4-8192 -iter 100 -runtest -genres -timeout 60 -allocsize 8 -device 0 |
Fig. 11.b |
test_oom.py |
python test_oom.py -t o+s+h+c+r+x -num 100000 -range 4-8192 -runtest -genres -timeout 3600 -allocsize 2 -device 0 |
Fig. 11.c |
test_synth_workload.py |
python test_synth_workload.py -t o+s+h+c+r+x -threadrange 0-20 -range 4-64 -iter 100 -runtest -genres -timeout 300 -allocsize 8 -device 0 |
Fig. 11.d |
test_synth_workload.py |
python test_synth_workload.py -t o+s+h+c+r+x -threadrange 0-20 -range 4-4096 -iter 100 -runtest -genres -timeout 300 -allocsize 8 -device 0 |
Fig. 11.e |
test_synth_workload.py |
python test_synth_workload.py -t o+s+h+c+r+x -threadrange 0-20 -range 4-64 -iter 100 -runtest -genres -timeout 300 -allocsize 8 -device 0 -testwrite |
Fig. 11.f |
test_graph_init.py |
python test_graph_init.py -t o+s+h+c+r+x -configfile config_init.json -runtest -genres -timeout 600 -allocsize 8 -device 0 |
Fig. 11.g |
test_graph_update.py |
python test_graph_update.py -t o+s+h+c+r+x -configfile config_update_range.json -runtest -genres -timeout 600 -allocsize 8 -device 0 |
To test single threaded or single warp performance, navigate to tests/alloc_tests
and call the script test_allocation.py
python test_allocation.py -t o+s+h+c+r+x -num 10000 -range 4-64 -iter 50 -runtest -timeout 60 -allocsize 8 -device 0
- This will start
10000
threads, each of them will start by allocating4
Bytes and then increase linearly up to64
Bytes
- This will start
This will generate one csv file for each approach with mean
, min
, max
, median
performance averaged over the number of iterations.
To generate one file with all approaches already executed, pass option -genres
instead or additional to -runtest
.
Option | Parameter-Example | Description |
---|---|---|
-t |
o+s+h+c+f+r+x |
Specify which frameworks to test, first letter of approach separated by + , e.g. c : cuda or s : scatteralloc |
-num |
10000 |
How many threads/warps to start, e.g. 10000 |
-range |
4-64 |
Which allocation range to test, e.g. 4-64 Bytes |
-iter |
50 |
How often to run test and average over runs, e.g. 50 |
-runtest |
Pass this flag to execute the testcase and run the approaches | |
-genres |
Pass this flag to gather all results from existing csv files into one | |
-warp |
Pass this flag to start 1 warp instead of 1 warp per allocation | |
-timeout |
120 |
Timeout in seconds, each individual testcase run will be canceled after this timeout, default is 600 |
-allocsize |
8 |
How large the manageable memory ares per memory manager should be in GB |
-device |
0 |
Which GPU device to use |
To test allocation performance when threads are allocating with different sizes (constrained by a maximum/minimum allocation size), navigate to tests/alloc_tests
and call the script test_mixed_allocation.py
python test_mixed_allocation.py -t o+s+h+c+r+x -num 10000 -range 4-64 -iter 50 -runtest -timeout 60 -allocsize 8 -device 0
- This will start
10000
threads, each of them will allocate in the range of4-64
Bytes
- This will start
This will generate one csv file for each approach with mean
, min
, max
, median
performance averaged over the number of iterations.
To generate one file with all approaches already executed, pass option -genres
instead or additional to -runtest
.
Option | Parameter-Example | Description |
---|---|---|
-t |
o+s+h+c+f+r+x |
Specify which frameworks to test, first letter of approach separated by + , e.g. c : cuda or s : scatteralloc |
-num |
10000 |
How many threads/warps to start, e.g. 10000 |
-range |
4-64 |
Which allocation range to test, e.g. 4-64 Bytes |
-iter |
50 |
How often to run test and average over runs, e.g. 50 |
-runtest |
Pass this flag to execute the testcase and run the approaches | |
-genres |
Pass this flag to gather all results from existing csv files into one | |
-warp |
Pass this flag to start 1 warp instead of 1 warp per allocation | |
-timeout |
120 |
Timeout in seconds, each individual testcase run will be canceled after this timeout, default is 600 |
-allocsize |
8 |
How large the manageable memory ares per memory manager should be in GB |
-device |
0 |
Which GPU device to use |
To test performance scaling over a changing number of threads, navigate to tests/alloc_tests
and call the script test_scaling.py
python test_scaling.py -t o+s+h+c+r+x -byterange 4-64 -threadrange 0-10 -iter 50 -runtest -timeout 60 -allocsize 8 -device 0
- This will start with
2⁰
threads up to2¹⁰
threads, testing all powers of 2 in-between, and for each number of threads test the range4-64
Bytes
- This will start with
This will generate one csv file for each approach and for each number of threads with mean
, min
, max
, median
performance averaged over the number of iterations.
To generate one file with all approaches already executed, pass option -genres
instead or additional to -runtest
.
Can also be started with one warp per allocation by passing -warp
.
Option | Parameter-Example | Description |
---|---|---|
-t |
o+s+h+c+f+r+x |
Specify which frameworks to test, first letter of approach separated by + , e.g. c : cuda or s : scatteralloc |
-threadrange |
0-10 |
The range of threads to test, given as a power of 2, e.g. 0-10 would test 2⁰ , 2¹ , ..., 2¹⁰ threads for the given -byterange |
-byterange |
4-64 |
Which allocation range to test, e.g. 4-64 Bytes |
-iter |
50 |
How often to run test and average over runs, e.g. 50 |
-runtest |
Pass this flag to execute the testcase and run the approaches | |
-genres |
Pass this flag to gather all results from existing csv files into one | |
-warp |
Pass this flag to start 1 warp instead of 1 warp per allocation | |
-timeout |
120 |
Timeout in seconds, each individual testcase run will be canceled after this timeout, default is 600 |
-allocsize |
8 |
How large the manageable memory ares per memory manager should be in GB |
-device |
0 |
Which GPU device to use |
This testcase tests the fragmentation of the returned addresses of a given allocation by reporting the maximum address range returned by each allocating thread. It also tracks the static maximum over a number of iterations.
It continues to allocate and free a number of allocations for the number of -iter
and returns those ranges.
python test_fragmentation.py -t o+s+h+c+r+x -num 10000 -range 4-64 -iter 50 -runtest -timeout 60 -allocsize 8 -device 0
- This will start
10000
threads, each of them will start by allocating4
Bytes and then increase linearly up to64
Bytes, reporting the current range and static maximum range
- This will start
This will generate one csv file for each approach with min address range
, max address range
, min address range (static)
and max address range (max)
.
To generate one file with all approaches already executed, pass option -genres
instead or additional to -runtest
.
Option | Parameter-Example | Description |
---|---|---|
-t |
o+s+h+c+f+r+x |
Specify which frameworks to test, first letter of approach separated by + , e.g. c : cuda or s : scatteralloc |
-num |
10000 |
Starts 10000 threads |
-range |
4-64 |
Which allocation range to test, e.g. 4-64 Bytes |
-iter |
50 |
How often to run test and average over runs, e.g. 50 |
-runtest |
Pass this flag to execute the testcase and run the approaches | |
-genres |
Pass this flag to gather all results from existing csv files into one | |
-timeout |
120 |
Timeout in seconds, each individual testcase run will be canceled after this timeout, default is 600 |
-allocsize |
8 |
How large the manageable memory ares per memory manager should be in GB |
-device |
0 |
Which GPU device to use |
Tests out-of-memory behavior for a range of allocation sizes, hence how efficient the memory is utilized. The range will be sampled for each power of 2 in-between the given -range
python test_oom.py -t o+s+h+c+r+x -num 10000 -range 4-64 -runtest -timeout 60 -allocsize 8 -device 0
- This starts
10000
allocating threads, tests powers of 2 in the range4-64
and continues to allocate until out-of-memory is reported, recording the number of iterations in the csv file
- This starts
This will generate one csv file for each approach and records the number of successful iterations.
To generate one file with all approaches already executed, pass option -genres
instead or additional to -runtest
.
Option | Parameter-Example | Description |
---|---|---|
-t |
o+s+h+c+f+r+x |
Specify which frameworks to test, first letter of approach separated by + , e.g. c : cuda or s : scatteralloc |
-num |
10000 |
Starts 10000 threads |
-range |
4-64 |
Which allocation range to test, e.g. 4-64 Bytes |
-runtest |
Pass this flag to execute the testcase and run the approaches | |
-genres |
Pass this flag to gather all results from existing csv files into one | |
-timeout |
120 |
Timeout in seconds, each individual testcase run will be canceled after this timeout, default is 600 |
-allocsize |
8 |
How large the manageable memory ares per memory manager should be in GB |
-device |
0 |
Which GPU device to use |
Graph testcases require a config.json
file, which has the following parameters
Parameter | Value-Example | Description |
---|---|---|
iterations |
10 |
How many iterations to do, in which the graph is initialized new, e.g. 10 |
update_iterations |
10 |
How many edge update iterations to perform |
batch_size |
10000 |
How many edges to insert each iteration |
range |
0 |
If range is 0 , the edge sources are randomly distributed amongst the available vertices, if > 0 , then updates will be focused on this smaller range, which is shifted over the graph update_iterations times |
test_init |
true |
If this is set to true, only initialization will be measured. |
verify |
false |
If this is set to true, then each operation will be verified against a host dynamic graph -> takes quite a long time |
realistic_deletion |
false |
If this is set to false , the deletion operation will delete exactly the same edges that where introduced during the insertion opertion. Otherwise, random edges will be selected from the graph. |
The testcase can handle .mtx
(Matrix Market Format) files, which can be downloaded from the SuiteSparse Collection, and will automatically convert each file into a more efficient binary format, which greatly improves load times for multiple runs.
This testcase will test dynamic graph initialization. One has to pass a configfile as described above, the list of graphs to test is given at the top of test_graph_init.py
.
python test_graph_init.py -t o+s+h+c+r+x -configfile config_init.json -runtest -timeout 120 -allocsize 8 -device 0
- Tests initialization performance for all graphs noted in
test_graph_init.py
, configured according toconfig_init.json
- Tests initialization performance for all graphs noted in
Option | Parameter-Example | Description |
---|---|---|
-t |
o+s+h+c+f+r+x |
Specify which frameworks to test, first letter of approach separated by + , e.g. c : cuda or s : scatteralloc |
-configfile |
config_init.json |
All the configuration details for this testcase, as described above |
-graphstats |
Writes out graph statistics, does not run the actual testcase afterwards | |
-runtest |
Pass this flag to execute the testcase and run the approaches | |
-genres |
Pass this flag to gather all results from existing csv files into one | |
-timeout |
120 |
Timeout in seconds, each individual testcase run will be canceled after this timeout, default is 600 |
-allocsize |
8 |
How large the manageable memory ares per memory manager should be in GB |
-device |
0 |
Which GPU device to use |
This testcase will test dynamic graph updates. One has to pass a configfile as described above, the list of graphs to test is given at the top of test_graph_update.py
.
python test_graph_update.py -t o+s+h+c+r+x -configfile config_update.json -runtest -timeout 120 -allocsize 8 -device 0
- Tests edge update performance for all graphs noted in
test_graph_update.py
, configured according toconfig_update.json
, this will test random edge updates
- Tests edge update performance for all graphs noted in
python test_graph_update.py -t o+s+h+c+r+x -configfile config_update_range.json -runtest -timeout 120 -allocsize 8 -device 0
- Tests edge update performance for all graphs noted in
test_graph_update.py
, configured according toconfig_update_range.json
, this will test pressured edge updates with a given range of source vertices shifted over the graph
- Tests edge update performance for all graphs noted in
Option | Parameter-Example | Description |
---|---|---|
-t |
o+s+h+c+f+r+x |
Specify which frameworks to test, first letter of approach separated by + , e.g. c : cuda or s : scatteralloc |
-configfile |
config_update.json |
All the configuration details for this testcase, as described above |
-graphstats |
Writes out graph statistics, does not run the actual testcase afterwards | |
-runtest |
Pass this flag to execute the testcase and run the approaches | |
-genres |
Pass this flag to gather all results from existing csv files into one | |
-timeout |
120 |
Timeout in seconds, each individual testcase run will be canceled after this timeout, default is 600 |
-allocsize |
8 |
How large the manageable memory ares per memory manager should be in GB |
-device |
0 |
Which GPU device to use |
This testcase will report the number of registers required for a respective call to malloc
or free
.
python test_registers.py -t o+s+h+c+r+x -runtest -allocsize 8 -device 0
Option | Parameter-Example | Description |
---|---|---|
-t |
o+s+h+c+f+r+x |
Specify which frameworks to test, first letter of approach separated by + , e.g. c : cuda or s : scatteralloc |
-runtest |
Pass this flag to execute the testcase and run the approaches | |
-genres |
Pass this flag to gather all results from existing csv files into one | |
-allocsize |
8 |
How large the manageable memory ares per memory manager should be in GB |
-device |
0 |
Which GPU device to use |
This testcase will test how long it takes to initialize each memory manager.
python test_synth_init.py -t o+s+h+c+r+x -runtest -allocsize 8 -device 0
Option | Parameter-Example | Description |
---|---|---|
-t |
o+s+h+c+f+r+x |
Specify which frameworks to test, first letter of approach separated by + , e.g. c : cuda or s : scatteralloc |
-runtest |
Pass this flag to execute the testcase and run the approaches | |
-genres |
Pass this flag to gather all results from existing csv files into one | |
-allocsize |
8 |
How large the manageable memory ares per memory manager should be in GB |
-device |
0 |
Which GPU device to use |
This testcase will test the classic case of a number of threads producing varying numbers of output elements and compares it to a baseline implemented with an CUB::ExclusiveSum
.
python test_synth_workload.py -t o+s+h+c+r+x -threadrange 0-10 -range 4-64 -iter 50 -runtest -timeout 60 -allocsize 8 -device 0
- This will start with
2⁰
threads up to2¹⁰
threads, testing all powers of 2 in-between, and for each number of threads test the range4-64
Bytes - The option
-testwrite
will test write performance to this memory area
- This will start with
Option | Parameter-Example | Description |
---|---|---|
-t |
o+s+h+c+f+r+x+b |
Specify which frameworks to test, first letter of approach separated by + , e.g. b : baseline (CUB exclusive sum) or c : cuda or s : scatteralloc |
-threadrange |
0-10 |
The range of threads to test, given as a power of 2, e.g. 0-10 would test 2⁰ , 2¹ , ..., 2¹⁰ threads for the given -byterange |
-range |
4-64 |
Which allocation range to test, e.g. 4-64 Bytes |
-iter |
50 |
How often to run test and average over runs, e.g. 50 |
-runtest |
Pass this flag to execute the testcase and run the approaches | |
-genres |
Pass this flag to gather all results from existing csv files into one | |
-allocsize |
8 |
How large the manageable memory ares per memory manager should be in GB |
-device |
0 |
Which GPU device to use |
-testwrite |
If parameter is passed, not the allocation performance is measured but the write performance to these allocations |
Build | Init | Reg. | Perf. 10K | Perf. 100K | Warp 10K | Warp 100K | Mix 10K | Mix 100K | Scale | Frag. 1 | OOM | Graph Init. | Graph Up. | Graph Range | Synth.4-64 | Synth.4-4096 | Synth. Write | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CUDA | 🆎 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
XMalloc | ✔️ | ✔️ | ✔️ | 💥 | ✔️ | ✔️ | ✔️ | 💥 | 💥 | 💥 | 💥 | ❓ | 💥 | 💥 | ✔️ | ✔️ | ✔️ | |
ScatterAlloc | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
Halloc | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
Reg-Eff - AW | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
Reg-Eff - C | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | 💥 | 💥 | ✔️ | ✔️ | ✔️ | ✔️ | 💥 | 💥 | 💥 | 💥 | 💥 | |
Reg-Eff - CF | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
Reg-Eff - CM | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | 💥 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
Reg-Eff - CFM | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
Oro - P - S | 🆎 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Oro - P - VA | 🆎 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Oro - P - VL | 🆎 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Oro - C - S | 🆎 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Oro - C - VA | 🆎 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Oro - C - VL | 🆎 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Build | Init | Reg. | Perf. 10K | Perf. 100K | Warp 10K | Warp 100K | Mix 10K | Mix 100K | Scale | Frag. 1 | OOM | Graph Init. | Graph Up. | Graph Range | Synth.4-64 | Synth.4-4096 | Synth. Write | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CUDA | 🆎 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ||
XMalloc | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | 💥 | 💥 | ✔️ | ✔️ | ✔️ | |||
ScatterAlloc | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |||
Halloc | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |||
Reg-Eff - AW | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |||
Reg-Eff - C | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | 💥 | 💥 | ✔️ | ✔️ | ✔️ | |||
Reg-Eff - CF | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | 💥 | 💥 | ✔️ | ✔️ | ✔️ | |||
Reg-Eff - CM | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | 💥 | 💥 | ✔️ | ✔️ | ✔️ | |||
Reg-Eff - CFM | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | 💥 | 💥 | ✔️ | ✔️ | ✔️ | |||
Oro - P - S | 🆎 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
Oro - P - VA | 🆎 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
Oro - P - VL | 🆎 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
Oro - C - S | 🆎 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
Oro - C - VA | 🆎 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | 💥 | |
Oro - C - VL | 🆎 | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |