Non-reproducible hit finding on cuda
Closed this issue · 3 comments
hit search on cuda seems to be somewhat non-reproducible. It's expect that hits may come in with a different order, but they should always be the same hits. Running a pipeline on voyager data with range -5 to 5Hz/sec then moving to cpu for hit_search gives a consistent 11 hits (I've run this a few dozen times repeatedly):
hit: .start_freq_MHz=8419.921875 (.index=524288), .drift_rate_Hz_per_second=-0.000000 (.index=500), .SNR=24762.199219, .power=3523141632, bandwidth=0.0
hit: .start_freq_MHz=8419.565228 (.index=651993), .drift_rate_Hz_per_second=-0.367353 (.index=536), .SNR=127.647469, .power=12842161, bandwidth=449.8
hit: .start_freq_MHz=8419.542734 (.index=659988), .drift_rate_Hz_per_second=-0.367353 (.index=536), .SNR=1071.168579, .power=107766488, bandwidth=61.5
hit: .start_freq_MHz=8419.520239 (.index=668097), .drift_rate_Hz_per_second=-0.367353 (.index=536), .SNR=122.781815, .power=12352645, bandwidth=441.4
hit: .start_freq_MHz=8419.475402 (.index=684087), .drift_rate_Hz_per_second=-0.367353 (.index=536), .SNR=13.124967, .power=1320456, bandwidth=5.6
hit: .start_freq_MHz=8419.475080 (.index=684202), .drift_rate_Hz_per_second=-0.357149 (.index=535), .SNR=14.723750, .power=1481304, bandwidth=5.6
hit: .start_freq_MHz=8419.610230 (.index=635887), .drift_rate_Hz_per_second=-0.387762 (.index=538), .SNR=13.594336, .power=1116704, bandwidth=329.7
hit: .start_freq_MHz=8419.565711 (.index=651764), .drift_rate_Hz_per_second=-0.367353 (.index=536), .SNR=14.698190, .power=1478733, bandwidth=8.4
hit: .start_freq_MHz=8419.564756 (.index=652106), .drift_rate_Hz_per_second=-0.387762 (.index=538), .SNR=13.553493, .power=1113350, bandwidth=11.2
hit: .start_freq_MHz=8419.520717 (.index=667868), .drift_rate_Hz_per_second=-0.367353 (.index=536), .SNR=13.233637, .power=1331390, bandwidth=5.6
hit: .start_freq_MHz=8419.519762 (.index=668210), .drift_rate_Hz_per_second=-0.387762 (.index=538), .SNR=13.299211, .power=1092462, bandwidth=14.0
When the cuda connected_components implementation runs, we get ....more hits... A lot of them are in the range above and below 1.3 Hz/sec. If I reduce the search range to -1 to 1 Hz/sec, there are usually 11 hits but unfortunately often we get 12 hits and sometimes even 13 hits. The extra hits can look like this:
hit: .start_freq_MHz=8419.542708 (.index=659990), .drift_rate_Hz_per_second=0.795932 (.index=22), .SNR=56.486515, .power=3594187, bandwidth=64.3
hit: .start_freq_MHz=8419.542594 (.index=660047), .drift_rate_Hz_per_second=0.612255 (.index=40), .SNR=70.239807, .power=4996826, bandwidth=290.6
hit: .start_freq_MHz=8419.921953 (.index=524253), .drift_rate_Hz_per_second=-0.540826 (.index=153), .SNR=1547.952393, .power=110120584, bandwidth=162.1
It does look like the expected hits are always there, so it's not necessarily an issue of random hits but something is randomly causing bad hits and at higher drift rate ranges there are nearly guaranteed bad hits showing up.
I'm not sure how the code is setup, but inconsistent results from CUDA-land sounds like a synchronization issue (e.g. between tasks (or maybe __syncthreads()
if you're using shared memory on the GPU). Just one of many possibilities... :P
Narrowing down issues:
- This is only related to connected_components_cuda hit search. Doing the rest of the pipeline on cuda and connected_components_cpu has no inconsistency
- running local_maxima_cuda (on branch MR #50) always gives 64 hits when a range of -1 to 1 Hz/sec is used and 1235 hits when -5 to 5 Hz/sec is used. N=several (~-10) of each when neighbor l1 dist 15 and snr is 50. When snr is 10 and l1 dist is 7, it's always 5290 hits with -5 to 5
- if I take out the label_spreading from connected_components I still see variable numbers of hits!
- I did find a place where the visited array was being initialized incorrectly by not taking in to account the sizeof(int) in cudaMemSet, that's fixed and does reduce variability but still an issue lingering
I took out what should have been a dumb optimization to set the entire neighborhood to a label rather than just the max to a label (should reduce a call to spread_labels) that winds up thrashing a bit... without that things are consistent and without spread_labels the number of protohits now matches local_maxima (they are very similar kernels without spreading labels)
Adding back in spread_labels gets us back to a non-reproducible number of hits, so the (remaining) issue is entirely with spread labels and need to think through a reproducible method to do this operation which is OKish since that's been that optimization target so far and was at least partial of #48.
side note: on that branch (MR #50) cpu local_maxima is borked, but not a priority at the moment