understanding-HMC: A Python repository from jaekor91

This is documentation for a little project to understand HMC for inference with a fixed dimensionality.

/---- Contents
- Lessons learned
- Design and goals
- Main modules
- Case studies
- Notes for the future





/---- Lessons learned
These are conclusions drawn from the below study.
- NUTS vs. Random trajectory lengths: My exploration of NUTS algorithm did not conclusively show that it is better than the other sampler. Random sampler has three parameters to tune [L_low, L_high] and dt whereas NUTS only has one, dt. On the outset it may seem like NUTS should be easier to use. I didn't find that to be the case. Here are some general observations. 
1) Both samplers produce independent samples at each step (ESS ~ Total Nsample) *if properly tuned*.
2) NUTS sampler often terminates prematurely in high-covariance cases, making shorter steps compared to the random sampler. NUTS sampler then progresses slowly compared to the other. 
3) NUTS sampler takes long sojourns when the target distribution does not exhibit large covariance. In order to make the journey happen in a shorter amount of compute time dt has to be increased. 
4) For PCAT, we want to set a single step size on the outset and keep it fixed. The random sampler may waste a lot of computation retracing its own trajectories but the payoff is longer travel in general.
5) Burn-in: I also find that NUTS may exhibit worse burn-in.
6) If properly tune, NUTS's (# of expensive computations) / (effective sample) can be much larger, mostly, for low covariance cases.

- Optimal step size: Conservatism is appropriate for a well-functioning pipieline. dt should be chosen small enough that tha acceptance rate hovers around 80% for most challenging cases (e.g., high covariance or small variance due to pecise measurements). Random trajectory range should be chosen so that for a truly high dimensional problem, the particle has an adequate time to move far. For this purpose, trajectory range of 100 to 500 seems a modest compromise. Aim for ESS/(Total number of stored sample) > 0.5.

- Momentum distribution: The momentum distribution cannot be adjusted to improve performance since we use the blocking method this makes it impossible to choose a single momentum distribution that works for all different blocks.





/---- Design and goals:
- Implement HMC with the following sampling schemes: 
	- Fixed trajectory length: This one is not profiled extensively as we already know of pathologies that arrive from resonance.
	- Random trajectory length
	- NUTS sampler
For all options, it should be possible to specify the momentum distribution and different time steps for each dimension.
- All studies are done with Multivariate-Normal distributions
- Default parameters
	- Chains: 10
	- Steps: 1000 (The large number of chains/iterations are to get accurate statistics on the speed)
	- default time step: 0.5 (A conservative choice so we don't have to worry about it)
	- Warm-up: First 200
	- Thinning rate: Set to 1 to be able to calculate auto-correlation.
	- Random trajectory length interval: [10, 100]. Arbitrarily chosen.
- Sampling convention:
	- The initial point gets the index 0.
	- The final point gets the indes Niter.	Hence there are total of Niter+1 points.
	- If N_warmup is asked to be discarded, then the first N_warmup number is discarded. That is, the first saved point is the result of taking N_warmup iterations.
	- Marginal energy and differential energy is recorded in each iteration after momentum resampling.
- Metrics for comparing performance:
	- Number of computation time step per effective sample
	- Show estimation of standard deviation for each quantity being inferred
- Diagonostics
	- Rhat
	- Effective sample size
	- Energy distribution: Marginal distribution E = U + K and Del_E = (U + K)_current_initial - (U+K)_original_initial.
- Summary plot 3 x 3:
	- Title: Case number. # chains/# iters/R_thin/# warm-up 
	- (0, 0): Scatter plot: All points. Burn-in points are marked as red. The rest black.
	- (0, 1), (1, 0): Histograms and true distributions scaled to match the normalization of the histograms.
	- (0, 2): Energy vs. Energy difference histograms (from all chains combined?)
	- (1, 1): 1) Total time taken, 2) Total computation step, 3) # step per ES, median, 4) worst, 5) best
	- (1, 2): Rhat stats distributions. Report mean and std.  
	- (2, 0): Inferred bias(mean) vs. true variance (mean is a more useful quantity to compute)
	- (2, 1): Inferred vs. true standard deviations
	- (2, 2): Inferred/true standard deviations
- Videos: 
	- Video is made about the first
	- Videos should be for the first 100 steps though the whole inference is run regarldess.
	- Once videos are created, all supporting files are immediately discarded.





/---- Case studies
- Case study 1: Unit multi-variate normal of dimensionality [2, 10, 100]
	- Hyper-parameters: Single time step, starting distribution with 2 * I
	- Common: dt=1e-1, L_low=5, L_high=20
	- Script used: case1-script.py
	- case-1a:
		- D=2
		- Video and summary plots.
	- case-1b:
		- D=10
		- Video and summary plots.
	- case-1c:
		- D=100
		- Video and summary plots.


- Case study 2: Unit multi-variate normal of dimensionality [2, 10, 100] with bad starting point
	- Hyper-parameters: Single time step, starting distribution with 100 * I. 
	The first one get extra large starting point for illustration purposes.
	- Common: dt=1e-1, L_low=5, L_high=20
	- Script used: case2-script.py
	- case-2a:
		- D=2
		- Video and summary plots.
	- case-2b:
		- D=10
		- Video and summary plots.
	- case-2c:
		- D=100
		- Video and summary plots.


- Case study 3: Multi-variate normal of dimensionality [2, 10, 100] with rho = 0.95
	- Hyper-parameters: Single time step, starting distribution with 2 * I
	- Common: dt=1e-1, L_low=5, L_high=20
	- Script used: case3-script.py
	- case-3a:
		- D=2
		- Video and summary plots.
	- case-3b:
		- D=10
		- Video and summary plots.
	- case-3c:
		- D=100
		- Video and summary plots.
	- case-3d:
		- D=100, dt=1e-1, L_low=50, L_high=200
		- case-3c showed inadequate convergence. Not enough steps for the HMC to travel far enough. Also, acceptance rate was ~80%. The acceptance rate doesn't seem to be problematic.
		- Video and summary plots. 


- Case study 4: Multi-variate normal of dimensionality [2, 10, 100] with rho = 0.99
	- Hyper-parameters: Single time step, starting distribution with 2 * I
	- Common: dt=1e-1, L_low=5, L_high=20
	- Script used: case4-script.py
	- case-4a:
		- D=2
		- Video and summary plots.
	- case-4b:
		- D=10
		- Video and summary plots.
	- case-4c:
		- D=100
		- Video and summary plots.
	- case-4d:
		- D=100, dt=5e-2, L_low=50, L_high=200
		- case-4bc showed inadequate convergence. I only attempt to improve the last case as it is the more challenging.
		- Video and summary plots. 



- Case study 5: Multi-variate normal of dimensionality [2, 10, 100] with rho = 0.999
	- Hyper-parameters: Single time step, starting distribution with 2 * I
	- Common: dt=1e-1, L_low=5, L_high=20
	- Script used: case5-script.py
	- case-5a:
		- D=2
		- Video and summary plots.
	- case-5b:
		- D=10
		- Video and summary plots.
	- case-5c:
		- D=100
		- Video and summary plots.
	- case-5d:
		- D=100, dt=1e-2, L_low=200, L_high=1000
		- case-5abc showed complete failures. I only attempt to improve the last case as it is the more challenging.This worked quite adequately.
		- Video and summary plots. 
	
	
- Case study 6: Multi-variate normal of dimensionality [2, 10, 100] with random variance
	- Hyper-parameters: Single time step, starting distribution with 100 * I
	- Variance: log-uniform sampling from [0, 100]
	- Video: Yes

- Case study 7: Multi-variate normal of dimensionality [2, 10, 100] with random variance
	- Hyper-parameters: Time step of each dimension matches its standard deviation, starting distribution with 100 * I
	- Variance: log-uniform sampling from [0, 100]
	- Video: Yes

- Case study 8: Multi-variate normal of dimensionality [2, 10, 100] with random variance
	- Hyper-parameters: Time step of each dimension matches its standard deviation, starting distribution with 100 * I
	- Variance: log-uniform sampling from [0, 100]
	- Video: Yes











# ----- Archived notes



utils.py 

/---- Target distributions
All toy target distribution studied are multi-variate normals (D>=2) specified by mean mu0
and covariance cov0.


/----- Number of sample and convergence:
In general, we want to take independent samples (after warm-up phase) from a distribution.
The set of samples characterize the posterior distribution.
The more complicated a posterior distribution, the greater the number of samples are required
for a good characteriztaion of the distribution. Imagine a shallow moon shaped distribution, 
the sharper the curvature, the more samples we would need to fully characterize the distribution.

Nonetheless, in many cases, we can make a normal approximation to the full posterior. 
In such case, for each quantity being inferred we want to know the mean and covariance matrix
of the posterior. Some observations:
- Covariance does not seem to affect the variance of the estimate of the mean.
- Nsample improves the mean estimate only by 1/sqrt(Nsample)
- Larger intrinsic variance gives noisier estimates of the mean as expected.

/----- "Pathological" cases:
Regions of extreme curvature will be missed by HMC no matter how small one makes each time step.
For this, one must use Riemanian-HMC.

/----- Degeneracy:


/----- Cases to be studied:
- "Typical" object
- Very bright object
- Very dim object
- One bright, one faint
- Two birght
- Etc.



/---- List of tasks
- Basis cases
- Processing speed requiremetns
- Number of sample problem 
- Understanding which machine to use
- Degeneracy or non-identifiability
	- One dimensional case with two true objects.
- Other PCAT problems here
- Model bias due to PSF mis-specification
- Bias in quantities measured





/---- Optimal sampling from micro-canonical distribution
- Static implementation -- Inefficient approach: Generate a random trajectory, L_b backward and L_f forward integrated such that L = L_b + L_f. Save each point in the trajectory and its pi(z) = e^-H(z). Sample from the multinomial distribution given by pi(z)/sum_z' pi(z').

- Static implementation -- Efficient, uniform progressive sampling approach: Start with the initial point. Integrate forward or backward (flip a coin). Compute pi(z) for both and save sum_z pi(z). Use Bernouli sampling to get the live point. Integrate forward or backward. Do Bernouli sampling of the trajectory by comparing the previous sum_z pi(z) to pi(z_new). If the old trajectory is chosen then retain the old point. Otherwise, retain the new point. Repeat until the desired trajectory lenght is reached.

- Static implementation -- Efficient, doubling sampling appraoch: Start with an initial point. Integrate forward or backward, and therefore doubling the trajectory length. Do the Benouli sampling of the trajectory. Next, integrate backward or forward by *2^L_current* (so initially this is 2). While doing this, perform uniform progressive sampling. That is the first new point is the live point of the new trajectory. The live point gets updated as the uniform progressive sampling approach described above. While updates are being made be sure to save sum_z pi(z) for the new trajectory. Perform Bernouli sampling of the trajectory.

- Static implementation -- Efficient, biased sampling approach: Let's assume we are using this in combination with the trajectory doubling scheme mentioned above. Every time with compute a new trajectory (and pick a live point based on uniform progress sampling), we then next perform a Bernouli sampling with min(1, w_new//w_old) for the new trajectory. If w_new = w_old, as in the case of doubling trajectory scheme with exact energy computation, then we always pick points that are futher away from the initial point. If trajectory becomes worse, then the bias becomes less prominent.

- Dynamic implementation -- General idea: Necessary to explore all energy level sets equally well. The same as static implementation except repeat the procedure until the termination condition is met. A naive scheme won't work as it violates time reversibility. The fundamental problem is that certain subjectories would pre-maturely terminate if allowed to be considered.

- Dynamic implementation -- NUTS sampler: Multiplicatively expand the trajectory until the termination condition is reached at both ends. When building the new sub-trajectory check if any sub-tree statisfies the termination condition and if so, reject the expansion and stop building the tree. This will require 2^j - 1 checking of sub-tree termination, effectively doubling the dot product time. As before, perform uniform progressive sampling (multinomial) in each new trajectory with Bernouli biased sampling among new and old trajectories.
Pathology checking: Length of new subjectory 2**d.
d=0: 
x 
done! 

d=1:
x 
x x 
done! 

d=2:
x 
x x 
x o x 
x o x x 

d=3:
x  
x x
x o x 
x o x x 
x o o o x 
x o o o x x 
x o o o x o x
x o o o x o x x 

m = 1 
q0, p0 = q_tmp, p_tmp 
ql, pl = q0, p0
qr, pr = q0, p0


while m <= 2**d:
m = 1   -->   x    
m = 2   -->   x x |
m = 3   -->   x o | x
m = 4   -->   x o | x x | 
m = 5   -->   x o | o o | x 
m = 6   -->   x o | o o | x x | 
m = 7   -->   x o | o o | x o | x 
m = 8   -->   x o | o o | x o | x x | 

m = 9   -->   x o | o o | o o | o o | x 
m = 10  -->   x o | o o | o o | o o | x x | 
m = 11  -->   x o | o o | o o | o o | x o | x 
m = 12  -->   x o | o o | o o | o o | x o | x x | 
m = 13  -->   x o | o o | o o | o o | x o | o o | x
m = 14  -->   x o | o o | o o | o o | x o | o o | x x | 
m = 15  -->   x o | o o | o o | o o | x o | o o | x o | x
m = 16  -->   x o | o o | o o | o o | x o | o o | x o | x x |

m = 17  -->   x o | o o | o o | o o | o o | o o | o o | o o | x    
m = 18  -->   x o | o o | o o | o o | o o | o o | o o | o o | x x |
m = 19  -->   x o | o o | o o | o o | o o | o o | o o | o o | x o | x
m = 20  -->   x o | o o | o o | o o | o o | o o | o o | o o | x o | x x | 
m = 21  -->   x o | o o | o o | o o | o o | o o | o o | o o | x o | o o | x 
m = 22  -->   x o | o o | o o | o o | o o | o o | o o | o o | x o | o o | x x | 
m = 23  -->   x o | o o | o o | o o | o o | o o | o o | o o | x o | o o | x o | x 
m = 24  -->   x o | o o | o o | o o | o o | o o | o o | o o | x o | o o | x o | x x | 
m = 25  -->   x o | o o | o o | o o | o o | o o | o o | o o | x o | o o | o o | o o | x 
m = 26  -->   x o | o o | o o | o o | o o | o o | o o | o o | x o | o o | o o | o o | x x | 
m = 27  -->   x o | o o | o o | o o | o o | o o | o o | o o | x o | o o | o o | o o | x o | x 
m = 28  -->   x o | o o | o o | o o | o o | o o | o o | o o | x o | o o | o o | o o | x o | x x | 
m = 29  -->   x o | o o | o o | o o | o o | o o | o o | o o | x o | o o | o o | o o | x o | o o | x
m = 30  -->   x o | o o | o o | o o | o o | o o | o o | o o | x o | o o | o o | o o | x o | o o | x x | 
m = 31  -->   x o | o o | o o | o o | o o | o o | o o | o o | x o | o o | o o | o o | x o | o o | x o | x
m = 32  -->   x o | o o | o o | o o | o o | o o | o o | o o | x o | o o | o o | o o | x o | o o | x o | x x


m+=1 # Proceed to the next step

Check points -- examples
m=2 --> 1 
m=4 --> 1, 3 
m=6 --> 5 
m=8 --> 1, 5, 7
m=10 --> 9 
m=12 --> 9, 11
m=14 --> 13
m=16 --> 1, 9, 13, 15
m=18 --> 17
m=20 --> 17, 19
m=22 --> 21
m=24 --> 17, 21, 23
m=26 --> 25
m=28 --> 25, 27
m=30 --> 29

Release -- examples
m, l = 4, 3
m, l = 8, (5, 7) --> 4, (1, 3)
m, l = 12, 11 --> 4, 3
m, l = 20, 19 --> 4, 3
m, l = 24, (21, 23) --> 8, (5, 7) --> 4, (1, 3)
m, l = 28, 27 --> 12, 11 --> 4, 3

m, l = 10, 9 --> 2, 1 x
m, l = 14, 13 --> 6, 5 --> 2, 1 x

Rules
- m is power of 2, then release is all odd points between (1, m)
- m is not power of 2, then while (d_last > 1) and (m>4)
continue reducing (m, l) with 2**d_last and d_last -=1.
If m==4, then answer = True.










/---- Short term

MHMC
- If the variables are unccorelated and have unit variance, then MH is effective at doing high dimensional inference.

- Intuition about why HMC is expected to work better.
- Trajectory with different parameters. 2D gaussian examples.
	- Different integration time
	- Different momentum distribution 
 
- Show simple cases in 1D and 2D. MC vs. HMC.
	- Convergence plot (show for a Gaussian case how it works!) Which one is faster? Real time.
	- Even though the trajectory spend less time more trajectories pass the mode 
	- Can choose any distribuon, but nice to choose something in the right direction 
 	- Difficult target distribution
jaekor91/understanding-HMC