Conformal Willmore flow

Based on the work of Keenan Crane, this is an implementation of Conformal curvature flow using spin transformations.

Dependencies

Eigen (header only, version 3.3.7)
IGL (header only)
qmake (version 5)

Dependencies (Device code only)

CUDA (version 9.0 or greater)
Thrust (shipped with CUDA)
CUSP (header only)

Dependencies (Tests and benchmarks)

Google Test
Google Benchmark

Build

Configuration

The following environment variables are required for compilation:

CUDA_PATH : Needs to point to the base directory of your cuda lib and includes CUDA_ARCH : Your device compute capability (e.g. 30, 35, ... 50, 52 ...) HOST_COMPILER : Your local g++ compiler (compatible with cuda compiles, g++ 4.8.5)

> git submodule update --init --recursive
> qmake
> make -j

References:

[1] K. Crane, U. Pinkall, and P. Schröder, “Spin transformations of discrete surfaces,” ACM Transactions on Graphics, vol. 30, no. 4, p. 1, Jul. 2011.
    
[2] K. Crane, U. Pinkall, and P. Schröder, “Robust fairing via conformal curvature flow,” ACM Transactions on Graphics, vol. 32, no. 4, p. 1, Jul. 2013.

Project Structure

The library contains a host and a device implementation, separated by namspaces.
The host implementation is header only and was designed to feel like an extension of the popular IGL library (it follows their style guide too.)
The device implementation requires compiling into a shared library and has been R-pathed into the tests and benchmarks for convenience.
The project comes with 2 host samples, and one device sample. The first host, and only device sample is a demo implementation of willmore flow. The second host sample named matrix_bake, will export all matrices required by the tests and benchmarks in matrix market format.
The project comes with a full set of host and device tests and benchmarks.
Currently all subdirectories dump an executable in their directory.

Coding standards

The host side follows the same style as IGL using single capital letters for parameter names of common data, and snake case for function names.
The device side uses a similar style but due to the increased number of parameters in functions, I decided to provide full parameter names.
Both use view types where ever possible to avoid needless copies, but also to avoid enforcing specific data types on the user.

##NOTE

Spin positions test is failing for the DEVICE side implementation as it finds an equivalent but different solution to the linear system
Before testing and benchmarking you must place the matrices and models folders in the test directory

Development of parallel implementation

This project was developed as a final year programming assignment, with the goal of reimplementing a sequential algorithm on the GPU, and benchmarking the performance gains. I chose to implement conformal willmore flow using spin tranformations as nature of a geometry processing project lends itself to data oriented design. I have outlined some of the more noteworthy parts of the project below.

Adjacency Matrix Indices

Overview

One of the main aspects of the project was to assemble sparse matrices in parallel. One possible approach would be to compute lists of triplets, which could then be sorted by row and column index, before being reduced. The draw back of this approach is of course using two sorts. The two sparse matrices required for this project both share the same sparsity pattern (laplacian), and we know that, all entries in the matrix will only be written to a maximum of twice (except the diagonals), so it makes sense to try and use atomics here as the number of collisions will be very low. To facilitate this, each thread working on the matrix would require an index to write out it's value, and this function provides that.

The goal is to produce the matrix column and row indices, as well as diagonal indices (useful for later), and entry indices for parallel assembly. The first two are essentially the vertex-vertex adjacency lists with added diagonals. The adjacency lists can be found by creating a list of all possible half edges and then using unique to remove duplicates. Once we've obtained them, we can find the locations where our diagonal entries should be inserted by comparing each row index against each column index and writing a one if, it is less. If we follow this with a reduce, we've obtained the amount of rows per column before the diagonal, so to finish we can add an ascending list and the cumulative valence of each row (prior to diagonals being inserted) to these reduced values and voi-la.

row_index: 0 0 0 1 1 2 2 3 3 3
col_index: 1 2 3 0 3 0 3 0 1 2
--------------------------------
compared : 0 0 0 1 0 1 0 1 1 1
--------------------------------
index    : 0 1 2 3
reduced  : 0 1 1 3
--------------------------------
valence  : 3 2 2 3
cumval   : 0 3 5 7 10
sequence : 0 1 2 3
(reduced + cumval + sequence =)
diagonals: 0 5 8 13

Handy optimizations are to pack the comparisson and reduce into one kernel using a thrust transform iterator on the way in, and then pack the final sequence and valence add in to the end using a transform output iterator, leaving us with one single kernel launch.

Given the diagonals, it is simple to produce the row and column indices by copying across the vertex-vertex adjacency lists, but making sure to skip our diagonals. Then in a second pass, the diagonals indices are copied to using an ascending sequence. In practice this can all be done with thrust and permutation iterators.

row_index: 0 0 0 1 1 2 2 3 3 3
col_index: 1 2 3 0 3 0 3 0 1 2
diagonals: 0 5 8 13
--------------------------------------
1st pass : - 0 0 0 1 - 1 2 - 2 3 3 3 -
2nd pass : 0 0 0 0 1 1 1 2 2 2 3 3 3 3

As for the regular entry indices, we use a parallel binary search method on each in each face (same order as the threads in the sparse matrix assembly kernels.) Given an edge A -> B we find the upper bound and lower bounds as the cumulative valence of vertex A, and one past A. We then perform a binary search within these bounds for B. Unfortunately this can yeild unbalanced work-loads for threads in the kernel, however it need only be computed once and executes quick enough for this purpose, especially when combined with the gains in the sparse matrix assembly.

row_index   : 0 0 0 1 1 2 2 3 3 3
col_index   : 1 2 3 0 3 0 3 0 1 2
cumval      : 0 3 5 7 10
------------------------------
f0          : 0 1 3
lower_bound : 0 3 7
upper_bound : 3 5 10
b-search    : 0 4 7

Cotangent Laplacian

Overview

The cotangent laplacian is a very useful discrete representation of a surfaces mean curvature, used heavily for this algorithm. It will be recomputed at every iteration, and so it was essential that the assembly be quick. Initially I began by launching 6 threads per face, meaning 2 threads per face edge, which directly corresponds to the amount of writes into the matrix required. This allowed for one atomic add per thread, and hence a fairly quick assembly, however the uneven number of threads per block, and the fact that threads working on the same face could be in different warps limited my ability to combat bank conflicts. My original solution was to instead have 8 threads per face, with 2 dummy threads doing no work, but aligning the rest. However I then realised that once the alignment was in place, the entire computation could be performed using only interwarp communcation rather than block-wide. This would mean I could get rid of shared memory and potential back conflicts, and instead use warp shuffling to share data between my threads. At this point it seemed wasteful to have 2 dead threads, so I settled for 4 threads per face, with one dead thread. Each alive thread would write to global memory twice, once in the upper triangular part and once in the lower. The thread communication is fairly simple, each thread loads in a vertex positon (1-per thread), and then using a warp shuffle, we subtract the neighbours vertex value to produce 3 edges. With these edges I first compute the face area, and then the final cotangent laplacian value. Writing these value back to global memory is simple since we have computed the entry indices in the previous step.

if lane < 3:
    read vertex_pos

shuffle (0 1 2 -) -> (1 2 0 1)
shuffle down subtract (1-2 2-0 0-1 1-2)
shuffle donw neighbor_edge
inv_area <- 0.5 * rsqrt(cross(edge, neighbor_edge))
result <- dot(edge, neigbor_edge) * inv_area

In the real implementation I don't just kill off the 4th thread, instead I use it as temporary storage to make shuffle downs easier. The main kernel leaves the diagonals untouched, as they are easier computed in a second pass, we need simply reduce each column and write the negation of the sum.

Intrinsic Dirac

Overview

The intrinsic dirac operator is the core of the algorithm and luckily has the same sparsity pattern as the cotangent laplacian, so we can reuse the row and column indices, as well as the diagonal and entry indices for computation. The resulting computation is more complex, involving quaternion hammilton products, and is not entirely symmetrical, however the resolution wasn't too bad. Another issue I faced (unresolved) is that we can't take advantage of vectorized writes to global memory when using atomics as they're only implemented for 4 byte primitives. The final diagonal computation was also more challenging to implement, involving more variables than the simple reduce for a cotangent laplacian. To optimize this I first noted that all contributions were coming from adjacent triangles, and so could probably be modelled using a reduce. In particular, the contribution would come from the case where we use the same vertex for both sides of our edge, which after a subtraction will result in zero, so we can therefor assume that all diagonals will be purely real quaternions with a zero imaginary vector. The real part can be computed as the dot product of the two edges stemming from our vertex, plus the regular real component computed from only our change in mean curvature. After all this we can model the diagonals as a reduce by vertex triangle adjacency, where at each adjacency pair we transform the two edges containing our vertex into the resulting diagonal.

Tri <- [v0, v1, v2]
find two edges:
e0 <- v1-v0
e1 <- v2-v0

result <- (rho^2 * area / 9) + dot(e0, e1) / (4 * area)

Solving the sparse Linear systems

Overview

This project involved solving two sparse linear systems. The first called Similarity transforms, produces a quaternion per vertex that best describes it's conformal deformation, using the intrinsic dirac matrix as a left hand side. The second called Spin positions, produces new vertex positions that best fit the edges calculated using the similarity transforms. This was probably the most challenging and important stage of the project. These two processes are the slowest host functions as well, meaning it was crucial to speed them up. I ended up implementing two methods for both, one using a direct solve via the cuSolverSp and cuSparse API, and the other using an iterative conjugate gradient solver via the CUSP API.

Interestingly the conjugate gradient solve out performs a direct cholesky decomposition drastically for the similarity transforms, and beats the host version for some of the smaller meshes. The direct solve wins for larger meshes in the Spin positions step, where the iterative solve performs poorly.

Future improvements

I ran out of time before I was able to implement the mean curvature calculation and projection on the device side, and would add this next. I would also work on moving duplicate code out of the sparse matrix assembly kernels which would allow the shuffle methods to be used by other functions, such as face area calculations. Finally I would like to have used streams more as there is definitely potential to overlap computation in this project.

Benchmarks

Host

Intel® Xeon(R) CPU E5-1650 v3 @ 3.50GHz × 12 processor

name	iterations	real_time	cpu_time	time_unit
HOST_cotangent_laplacian_cube	182627	3852.71	3852.37	ns
HOST_cotangent_laplacian_spot	90	7.68602e+06	7.65249e+06	ns
HOST_cotangent_laplacian_bunny	35	2.03471e+07	2e+07	ns
HOST_cotangent_laplacian_tyra	8	7.18162e+07	6.8106e+07	ns
HOST_divergent_edges_cube	471392	1477.79	1477.58	ns
HOST_divergent_edges_spot	100	5.24034e+06	5.2395e+06	ns
HOST_divergent_edges_bunny	50	1.39971e+07	1.39957e+07	ns
HOST_divergent_edges_tyra	12	5.04251e+07	5.04111e+07	ns
HOST_face_area_cube	4344953	164.346	164.308	ns
HOST_face_area_spot	2074	339835	339751	ns
HOST_face_area_bunny	723	974836	974671	ns
HOST_face_area_tyra	226	3.1432e+06	3.14241e+06	ns
HOST_intrinsic_dirac_cube	44546	15991	15988.7	ns
HOST_intrinsic_dirac_spot	8	7.53803e+07	7.53403e+07	ns
HOST_intrinsic_dirac_bunny	1	6.25203e+08	6.25151e+08	ns
HOST_intrinsic_dirac_tyra	1	1.83794e+09	1.83772e+09	ns
HOST_mean_curvature_cube	2044611	365.646	365.569	ns
HOST_mean_curvature_spot	1228	559729	559669	ns
HOST_mean_curvature_bunny	623	1.14741e+06	1.14732e+06	ns
HOST_mean_curvature_tyra	162	4.46348e+06	4.463e+06	ns
HOST_orthonormalize_cube	912301	768.27	768.2	ns
HOST_orthonormalize_spot	5326	135622	135611	ns
HOST_orthonormalize_bunny	2243	310662	310637	ns
HOST_orthonormalize_tyra	749	934909	934832	ns
HOST_project_basis_cube	1746431	397.31	397.274	ns
HOST_project_basis_spot	11059	62670	62664.8	ns
HOST_project_basis_bunny	4811	142988	142976	ns
HOST_project_basis_tyra	1644	419267	419232	ns
HOST_quaternion_matrix_cube	64695	10676.5	10675.6	ns
HOST_quaternion_matrix_spot	24	2.96319e+07	2.96293e+07	ns
HOST_quaternion_matrix_bunny	12	5.74237e+07	5.7419e+07	ns
HOST_quaternion_matrix_tyra	3	2.02537e+08	2.02519e+08	ns
HOST_similarity_xform_cube	29636	23613.9	23611.8	ns
HOST_similarity_xform_spot	1	6.35867e+08	6.35806e+08	ns
HOST_similarity_xform_bunny	1	4.63868e+09	4.63829e+09	ns
HOST_similarity_xform_tyra	1	8.65426e+09	8.65345e+09	ns
HOST_spin_positions_cube	38234	18642.5	18641	ns
HOST_spin_positions_spot	1	6.29275e+08	6.29218e+08	ns
HOST_spin_positions_bunny	1	4.65188e+09	4.65144e+09	ns
HOST_spin_positions_tyra	1	8.61795e+09	8.61715e+09	ns
HOST_vertex_mass_cube	3144441	224.885	224.867	ns
HOST_vertex_mass_spot	1726	412928	412894	ns
HOST_vertex_mass_bunny	575	1.21601e+06	1.21591e+06	ns
HOST_vertex_mass_tyra	180	3.92145e+06	3.9211e+06	ns
HOST_vertex_normals_cube	398109	1786.48	1786.32	ns
HOST_vertex_normals_spot	370	4.03304e+06	2.02852e+06	ns
HOST_vertex_normals_bunny	262	5.30947e+06	2.79652e+06	ns
HOST_vertex_normals_tyra	118	1.05496e+07	6.1032e+06	ns
HOST_vertex_triangle_adjacency_cube	209312	2846.6	2811.03	ns
HOST_vertex_triangle_adjacency_spot	1168	696248	674157	ns
HOST_vertex_triangle_adjacency_bunny	308	2.49694e+06	2.46715e+06	ns
HOST_vertex_triangle_adjacency_tyra	86	8.45339e+06	8.39191e+06	ns
HOST_vertex_vertex_adjacency_cube	372804	1905.26	1901.45	ns
HOST_vertex_vertex_adjacency_spot	172	4.13094e+06	4.12159e+06	ns
HOST_vertex_vertex_adjacency_bunny	55	1.28761e+07	1.28478e+07	ns
HOST_vertex_vertex_adjacency_tyra	16	4.38669e+07	4.37587e+07	ns

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

name	iterations	real_time	cpu_time	time_unit
HOST_cotangent_laplacian_cube	271031	2576.69	2574.4	ns
HOST_cotangent_laplacian_spot	103	6.9254e+06	6.71974e+06	ns
HOST_cotangent_laplacian_bunny	42	1.6869e+07	1.62396e+07	ns
HOST_cotangent_laplacian_tyra	12	5.43377e+07	5.20702e+07	ns
HOST_divergent_edges_cube	261742	2713.79	2711.13	ns
HOST_divergent_edges_spot	90	7.78507e+06	7.77556e+06	ns
HOST_divergent_edges_bunny	34	2.00418e+07	2.00014e+07	ns
HOST_divergent_edges_tyra	7	9.34462e+07	9.31885e+07	ns
HOST_face_area_cube	6017563	113.593	113.481	ns
HOST_face_area_spot	2523	280010	279535	ns
HOST_face_area_bunny	878	805238	803422	ns
HOST_face_area_tyra	270	2.583e+06	2.57878e+06	ns
HOST_intrinsic_dirac_cube	47654	13423.8	13411.1	ns
HOST_intrinsic_dirac_spot	10	6.21724e+07	6.20564e+07	ns
HOST_intrinsic_dirac_bunny	1	5.33196e+08	5.32015e+08	ns
HOST_intrinsic_dirac_tyra	1	1.6005e+09	1.59639e+09	ns
HOST_mean_curvature_cube	3933690	178.399	178.214	ns
HOST_mean_curvature_spot	1458	479014	478333	ns
HOST_mean_curvature_bunny	756	978688	976697	ns
HOST_mean_curvature_tyra	158	4.39168e+06	4.38173e+06	ns
HOST_orthonormalize_cube	2125151	330.327	329.96	ns
HOST_orthonormalize_spot	8477	83593.6	83457.2	ns
HOST_orthonormalize_bunny	3731	189577	189290	ns
HOST_orthonormalize_tyra	1255	565450	564393	ns
HOST_project_basis_cube	5539554	128.969	128.826	ns
HOST_project_basis_spot	18190	40110.3	40037.7	ns
HOST_project_basis_bunny	8605	84039.3	83884.8	ns
HOST_project_basis_tyra	2904	233260	232939	ns
HOST_quaternion_matrix_cube	78228	8942.42	8933.33	ns
HOST_quaternion_matrix_spot	29	2.40185e+07	2.39839e+07	ns
HOST_quaternion_matrix_bunny	15	4.68857e+07	4.68207e+07	ns
HOST_quaternion_matrix_tyra	4	1.63473e+08	1.63205e+08	ns
HOST_similarity_xform_cube	38645	18438.3	18414	ns
HOST_similarity_xform_spot	1	7.42056e+08	7.38969e+08	ns
HOST_similarity_xform_bunny	1	4.0992e+09	4.08846e+09	ns
HOST_similarity_xform_tyra	1	7.66838e+09	7.6493e+09	ns
HOST_spin_positions_cube	49689	13943.9	13925.5	ns
HOST_spin_positions_spot	1	5.45467e+08	5.44281e+08	ns
HOST_spin_positions_bunny	1	4.34899e+09	4.33687e+09	ns
HOST_spin_positions_tyra	1	7.72303e+09	7.69932e+09	ns
HOST_vertex_mass_cube	4696537	150.733	150.515	ns
HOST_vertex_mass_spot	1984	345572	344981	ns
HOST_vertex_mass_bunny	730	979333	977439	ns
HOST_vertex_mass_tyra	222	3.39383e+06	3.38649e+06	ns
HOST_vertex_normals_cube	603713	1159.48	1158.22	ns
HOST_vertex_normals_spot	1018	5.08585e+06	1.03785e+06	ns
HOST_vertex_normals_bunny	415	8.53295e+06	1.69286e+06	ns
HOST_vertex_normals_tyra	147	2.08555e+07	5.71142e+06	ns
HOST_vertex_triangle_adjacency_cube	450150	1691.17	1687.71	ns
HOST_vertex_triangle_adjacency_spot	957	672810	640440	ns
HOST_vertex_triangle_adjacency_bunny	366	1.85957e+06	1.8288e+06	ns
HOST_vertex_triangle_adjacency_tyra	102	6.66418e+06	6.64907e+06	ns
HOST_vertex_vertex_adjacency_cube	454305	1478.19	1476.08	ns
HOST_vertex_vertex_adjacency_spot	183	3.80413e+06	3.79766e+06	ns
HOST_vertex_vertex_adjacency_bunny	62	1.09962e+07	1.09678e+07	ns
HOST_vertex_vertex_adjacency_tyra	19	3.6735e+07	3.66314e+07	ns

Device

Quadro K2200

name	iterations	real_time	cpu_time	time_unit
DEVICE_adjacency_matrix_indices_cube	1859	341886	340272	ns
DEVICE_adjacency_matrix_indices_spot	952	815516	811983	ns
DEVICE_adjacency_matrix_indices_bunny	648	1043000	1040070	ns
DEVICE_adjacency_matrix_indices_tyra	303	2308040	2305360	ns
DEVICE_cotangent_laplacian_cube	2636	266425	264868	ns
DEVICE_cotangent_laplacian_spot	1822	389558	387782	ns
DEVICE_cotangent_laplacian_bunny	1180	592784	590895	ns
DEVICE_cotangent_laplacian_tyra	250	2780910	2779790	ns
DEVICE_divergent_edges_cube	2755	250017	248657	ns
DEVICE_divergent_edges_spot	1829	388904	386151	ns
DEVICE_divergent_edges_bunny	1227	573293	571326	ns
DEVICE_divergent_edges_tyra	416	1661450	1661310	ns
DEVICE_face_area_cube	174640	4013.78	4013.42	ns
DEVICE_face_area_spot	39931	18190.5	18189	ns
DEVICE_face_area_bunny	10000	70404.2	70398.4	ns
DEVICE_face_area_tyra	10000	228716	228697	ns
DEVICE_intrinsic_dirac_cube	11973	55211.1	55206.5	ns
DEVICE_intrinsic_dirac_spot	605	1157520	1156930	ns
DEVICE_intrinsic_dirac_bunny	265	2778810	2767580	ns
DEVICE_intrinsic_dirac_tyra	58	11694600	11693600	ns
DEVICE_quaternion_matrix_cube	64005	10361	10360.2	ns
DEVICE_quaternion_matrix_spot	1619	432756	432711	ns
DEVICE_quaternion_matrix_bunny	825	844616	844546	ns
DEVICE_quaternion_matrix_tyra	278	2.40125e+06	2.40105e+06	ns
DEVICE_similarity_xform_direct_cube	410	1.65968e+06	1.65722e+06	ns
DEVICE_similarity_xform_direct_spot	1	3.72199e+09	3.72164e+09	ns
DEVICE_similarity_xform_direct_bunny	1	1.48061e+10	1.48048e+10	ns
DEVICE_similarity_xform_direct_tyra	1	9.21995e+10	9.2191e+10	ns
DEVICE_similarity_xform_iterative_cube	354	1.97964e+06	1.97947e+06	ns
DEVICE_similarity_xform_iterative_spot	1	8.37923e+08	8.35415e+08	ns
DEVICE_similarity_xform_iterative_bunny	1	1.60284e+09	1.59807e+09	ns
DEVICE_similarity_xform_iterative_tyra	1	4.12793e+09	4.12634e+09	ns
DEVICE_spin_positions_direct_spot	1	3.68252e+09	3.68218e+09	ns
DEVICE_spin_positions_direct_bunny	1	1.4813e+10	1.48117e+10	ns
DEVICE_spin_positions_direct_tyra	1	9.26087e+10	9.25993e+10	ns
DEVICE_spin_positions_iterative_spot	1	3.20449e+09	3.19223e+09	ns
DEVICE_spin_positions_iterative_bunny	1	2.43237e+10	2.42893e+10	ns
DEVICE_spin_positions_iterative_tyra	1	4.5125e+10	4.5104e+10	ns
DEVICE_vertex_mass_cube	1850	382962	381300	ns
DEVICE_vertex_mass_spot	774	918686	914777	ns
DEVICE_vertex_mass_bunny	715	985748	981242	ns
DEVICE_vertex_mass_tyra	550	1.27652e+06	1.27083e+06	ns
DEVICE_vertex_triangle_adjacency_cube	7564	83150.4	83142.5	ns
DEVICE_vertex_triangle_adjacency_spot	975	645653	644782	ns
DEVICE_vertex_triangle_adjacency_bunny	687	951790	950354	ns
DEVICE_vertex_triangle_adjacency_tyra	341	2.00665e+06	2.00445e+06	ns
DEVICE_vertex_vertex_adjacency_cube	1754	402834	401607	ns
DEVICE_vertex_vertex_adjacency_spot	328	2.11945e+06	2.11526e+06	ns
DEVICE_vertex_vertex_adjacency_bunny	205	3.42353e+06	3.41836e+06	ns
DEVICE_vertex_vertex_adjacency_tyra	78	8.39362e+06	8.38729e+06	ns

Quadro M3000M

name	iterations	real_time	cpu_time	time_unit
DEVICE_adjacency_matrix_indices_cube	7884	79085	78952.7	ns
DEVICE_adjacency_matrix_indices_spot	3765	186365	186057	ns
DEVICE_adjacency_matrix_indices_bunny	1981	354997	354429	ns
DEVICE_adjacency_matrix_indices_tyra	698	1.00454e+06	1.00289e+06	ns
DEVICE_cotangent_laplacian_cube	19028	36727.9	36673.4	ns
DEVICE_cotangent_laplacian_spot	7318	95927	95779	ns
DEVICE_cotangent_laplacian_bunny	3788	185237	184952	ns
DEVICE_cotangent_laplacian_tyra	639	1.09654e+06	1.09491e+06	ns
DEVICE_divergent_edges_cube	22758	31024.4	30975	ns
DEVICE_divergent_edges_spot	5095	137633	137418	ns
DEVICE_divergent_edges_bunny	2589	270829	270412	ns
DEVICE_divergent_edges_tyra	714	882609	881253	ns
DEVICE_face_area_cube	242575	2905.48	2902.24	ns
DEVICE_face_area_spot	43658	17166.9	17149	ns
DEVICE_face_area_bunny	10000	59671.9	59607.6	ns
DEVICE_face_area_tyra	10000	167694	167522	ns
DEVICE_intrinsic_dirac_cube	12720	48998.3	48921.5	ns
DEVICE_intrinsic_dirac_spot	1224	571206	570304	ns
DEVICE_intrinsic_dirac_bunny	546	1.28079e+06	1.27895e+06	ns
DEVICE_intrinsic_dirac_tyra	115	5.41181e+06	5.40575e+06	ns
DEVICE_quaternion_matrix_cube	73965	8841.74	8831.27	ns
DEVICE_quaternion_matrix_spot	2955	237677	237398	ns
DEVICE_quaternion_matrix_bunny	1530	458183	457665	ns
DEVICE_quaternion_matrix_tyra	540	1.29391e+06	1.29254e+06	ns
DEVICE_similarity_xform_direct_cube	585	1.18767e+06	1.18497e+06	ns
DEVICE_similarity_xform_direct_spot	1	8.58772e+08	8.57566e+08	ns
DEVICE_similarity_xform_direct_bunny	1	2.73471e+09	2.7312e+09	ns
DEVICE_similarity_xform_direct_tyra	1	7.14028e+09	7.13095e+09	ns
DEVICE_similarity_xform_iterative_cube	422	1.63953e+06	1.63654e+06	ns
DEVICE_similarity_xform_iterative_spot	3	2.18979e+08	2.18621e+08	ns
DEVICE_similarity_xform_iterative_bunny	1	7.19855e+08	7.17361e+08	ns
DEVICE_similarity_xform_iterative_tyra	1	2.08458e+09	2.07955e+09	ns
DEVICE_spin_positions_direct_spot	1	8.20441e+08	8.1929e+08	ns
DEVICE_spin_positions_direct_bunny	1	2.71461e+09	2.71119e+09	ns
DEVICE_spin_positions_direct_tyra	1	6.88368e+09	6.87474e+09	ns
DEVICE_spin_positions_iterative_spot	1	1.39873e+09	1.39645e+09	ns
DEVICE_spin_positions_iterative_bunny	1	1.09265e+10	1.08924e+10	ns
DEVICE_spin_positions_iterative_tyra	1	2.23644e+10	2.23161e+10	ns
DEVICE_vertex_mass_cube	9017	72878.7	72766.8	ns
DEVICE_vertex_mass_spot	2724	256301	255283	ns
DEVICE_vertex_mass_bunny	2391	294982	293866	ns
DEVICE_vertex_mass_tyra	1370	512385	510719	ns
DEVICE_vertex_triangle_adjacency_cube	9364	68021.9	67925	ns
DEVICE_vertex_triangle_adjacency_spot	1339	470203	468639	ns
DEVICE_vertex_triangle_adjacency_bunny	1007	647694	645818	ns
DEVICE_vertex_triangle_adjacency_tyra	508	1.35423e+06	1.35101e+06	ns
DEVICE_vertex_vertex_adjacency_cube	4683	146918	146687	ns
DEVICE_vertex_vertex_adjacency_spot	501	1.36479e+06	1.36066e+06	ns
DEVICE_vertex_vertex_adjacency_bunny	338	2.0472e+06	2.04122e+06	ns
DEVICE_vertex_vertex_adjacency_tyra	132	4.88977e+06	4.88117e+06	ns

nitronoid/flo

Conformal Willmore flow

Dependencies

Dependencies (Device code only)

Dependencies (Tests and benchmarks)

Build

Configuration

References:

Project Structure

Coding standards

Development of parallel implementation

Overview

Overview

Overview

Overview

Future improvements

Benchmarks

Host

Intel® Xeon(R) CPU E5-1650 v3 @ 3.50GHz × 12 processor

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

Device

Quadro K2200

Quadro M3000M