Construct Evil Example
Closed this issue · 28 comments
Make an evil example.
The idea would be to find an important function (from the literature) and then sample it very nonuniformly.
We could try the model in the (retracted, but still a model) paper:
https://www.nature.com/articles/s41586-021-04096-9
The idea is to find an explicit function in the literature,
which we will then sample. However, we won't sample this uniformly, but rather nonuniformly. (E.g., random numbers but not uniformly distributed.) We'll then see how well diverse-selector works versus random selection.
We could also use actuarial tables.
There are some online cancer-risk calculators.
https://knowyourchances.cancer.gov/big_picture_charts.html
https://www.calculators.org/health/cancer.php
https://www.sciencedirect.com/science/article/pii/S0090429505010071
https://www.mskcc.org/nomograms
https://riskcalc.org/
Some GitHub repositories have calculators.
https://github.com/raghav103/Lung_Cancer_Predictor
https://github.com/PanduDcau/Lung-Cancer-Project
https://github.com/ToshY/pca-riskcalculator
https://github.com/advikmaniar/ML-Healthcare-Web-App/tree/main
https://github.com/videntity/python-framingham10yr/tree/master
https://github.com/Jean-njoroge/Breast-cancer-risk-prediction
@xychem will look at these and decide which is best for generating data. The main goal is that it is favorable to
- have lots of descriptors/dimensions.
- have an easy model that we can run to generate lots of data (beyond what is actually in the training data).
- more training data is good.
- more stars/forks on the repository is good. This means one of the last three is best, probably.
I'd prefer the last one if it meets all the other criteria. However, it may be better to use the next-to-last one as it is a simple function that we can use. (It's the easiest choice.) The easiest way to sample nonuniformly would be to pick a variable (say, age) and sample it very inhomogeneously. Then our predictions should be bad unless we sample with diversity.
It seems that all the calculators on riskcalc.org from the Cleveland Clinic can be accessed in a free-and-open-source way.
https://github.com/orgs/ClevelandClinicQHS/repositories?type=all
There is a lot of data also on risk factors vs incidence at
https://github.com/kritikaparmar-programmer/HealthCheck
Consider the 4-well potential in the attached papers.
- Sample using random numbers in the range
$-3 \le R_1 \le 3$ and$-3 \le R_2 \le 3$ . Select a huge number of points (~1e6). - Sample using Boltzmann sampling. For a given (inverse) temperature, keep each point with probability
$\exp[-\beta (E - E_0)]$ where$E_0$ is the lowest energy structure, which has energy 1.780 (see table 1 in the paper. Since the highest reaction barrier is about 1.5 units above the lowest energy, it's reasonable to consider$\beta = 1.5^{-k}$ where k = 0 is a random sample (so it's not necessary), and k=1, k=2, k=3, etc. are increasingly selective samples.
For a given number of samples,
- Select
$S$ points from the randomly generated data. ($\beta = 0$ ) - Select
$S$ points from Boltzmann sampling with different$\beta$ . These$S$ points are chosen at random, because the Boltzmann sample already screened based on energy. - Select the
$S$ points with lowest energy. ($\beta \rightarrow \infty$ )
Try to fit the potential energy curve using a method like Gaussian processes. Measure the error based on a regular grid of points in the range
- Mean absolute error.
- Root-mean-square error.
- Maximum absolute error.
The hypothesis is that the errors get worse when biased sampling is performed, but also that diverse selection helps "cure" some of the problems due to biased sampling. So if we randomly sample
- Dey, B. K., & Ayers, P. W. (2006). A Hamilton–Jacobi type equation for computing minimum potential energy paths. Molecular Physics, 104(4), 541-558.
- Liu, Y., Burger, S. K., Dey, B. K., Sarkar, U., Janicki, M. R., & Ayers, P. W. (2010). The Fast Marching Method for Determining Chemical Reaction Mechanisms in Complex Systems. Quantum Biochemistry, 171-195.
For breast cancer dataset.
- Can you fit an SVM on a (random) subset of the data? How does the error increase ans the quantity of data decreases?Assume, for now, random sampling.
- Can you get away with less data if you use diverse sampling?
- If you sub-sample the data with a bias (e.g., reject data where the tumor has a large perimeter or large area, using Boltmann-like screening of the data), can you still fit an SVM? Does the fit work better with a "diverse" sample or a "random" sample?
@PaulWAyers @FanwangM @FarnazH
There are some crude restults on my computers. It seems that:
- Random sampling is better when k is smaller whatever the maximum error, mean squared error or mean absolute error.
- The mean squared error and mean abolute error is decay slowly but better than maximum error, and the maximum error looks like not convergent when k is lager and the largest sampling number is 200.
- Optisim method has smaller maximum error than random sampling and all the maximum error of Optisim method is convergent.
Question:
- Boltzmann sampling : I randomly choose the 1e+6 points and calculate their probabilities, then according their probabilities to choose 1e+3 points and iterate the processure to get 1e+6 points. This is one of reasons that my caculation is so large. I don't know how to do the boltzamann sampling in a more economical method, so I need your help.
- An account for Comupter Canada: I need an account of Computer Canada.The job is large when sampling number over 100 and it runs slowly on my computer, so I need to submit the jobs to servers. Fanwang tought me how to submit jobs and I also have booked the SHARCNET New User / Refresher Webinar Confirmation of Computer Canada tomorrow.
The maximum error with different k by random sampling:
The mean absolute error with different k by random sampling:
The mean squared error with different k by random sampling:
The compare with k=0, 10, 20 between opisim sampling and random sampling:
I'd look at a smaller step in k. k=20 is very aggressive screening.
To sample with Boltzmann probability, you just compute the probability, which is
The plots you have are very jumpy, because you need to find a way to make multiple samples of the same size. That's easy for random sampling but hard for most of the methods in DiverseSelector, because (by default) they start with the medoid.
- The basic idea is to first make a huge sample. I recommended 1e6 points, but you may need more. 1e7 will let you consider a larger value of k, for example.
- Make a Boltzmann sample for a given
$k \cdot \beta$ .$k = 0$ corresponds to a random sample. Also consider the 1000 points with lowest energy; this corresponds to$k \rightarrow \infty$ . At the end of step 2, you have a sample for each value of k. When the size of the Boltzmann sample is less than 1e3, the sample is too small and cannot be used. - Construct sub-samples. For each value of
$k$ , choose a random sub-sample of 1e3 points. Do this repeatedly (perhaps 10 times), so that you have 10 sub-samples for each$k$ . By averaging your results over these sub-samples, you'll help smooth over the "bumps" in the curves. - Use random selection,
OptiSim
, and other algorithms to select$S$ points. - Fit using Gaussian Process Regression (or similar).
- Compute mean-absolute error, root-mean-square error, and maximum absolute error for the regular grid of points from
$-3.0 \le x,y \le 3.0$ . - Average these errors over all of the subsamples from step 3 and plot the result.
I hope this will give smoother plots of error vs.
I use the boltzmann sampling method that you say but it needs much time to get 1e7 points (more than several hours when k is large, so I kill the job). Maybe I need an account of CC (Computer Canada).
And there is a question about the symbol of k. If
New code of boltzmann sampling:
# define the boltzmann sampling method
def boltzmann_sample(k,sample_number):
'''
k : int
the order of 1.5
sample_number : int
the number of sample number
'''
E_0 = 1.780 # the minimum energy
sample_list=[]
while len(sample_list) < sample_number:
rng = np.random.default_rng()
rn = rng.random() # generate a random number
pd = (rng.random((1,2))-0.5)*6 # generate a point with random method from -3 to 3
p = np.exp(-((1.5)**(k))*((potential_energy(pd[0,0],pd[0,1]))-E_0)) # calculate the probability of the point
if p >= rn: # if p >= random_number than choose the point
sample_list.append(pd)
sample_list = np.array(sample_list)[:,0] # change list to np.array
return sample_list
The are instructions on how to make an account on Compute Canada in the BootCamp repository that I think I shared with you when you started. It's good to use Compute Canada. Ideally, you can generate data for a day or two; more data is a good thing.
As
and
The CCRI is in the bootcamp.
If you fix
I realized there is a simpler way to do the Boltzmann sampling than described previously. See
#144 (comment)
- The basic idea is to first make a huge sample. I recommended 1e6 points, but you may need more. 1e7 will let you consider a larger value of k, for example.
- Make several (perhaps 10) Boltzmann samples for a given
$k \cdot \beta$ .$k = 0$ corresponds to a random sample. Also consider the 1000 points with lowest energy; this corresponds to$k \rightarrow \infty$ . At the end of step 2, you have a sample for each value of k. When the size of any of the Boltzmann standards for a given$k$ is less than 1e3, the samples are too small for that value of$k$ , and it cannot be used. See the note below on creating independent Boltzmann samples. - Use random selection,
OptiSim
, and other algorithms to select$S$ points. - Fit using Gaussian Process Regression (or similar).
- Compute mean-absolute error, root-mean-square error, and maximum absolute error for the regular grid of points from
$-3.0 \le x,y \le 3.0$ . - Average these errors over all of the subsamples from step 3 and plot the result.
Generating Independent Samples with Boltzmann Probability for a given
I'm sorry for misunderstanding whats you write before. What I did before is choosing a point with random sampling and calculate the probability of the point, select it when it's probability larger than a random number. By using the iteration that I can get 1e7 biased database. It is very strange that I want to get a 1e7 boltzmann sampling database.
What I need to do is:
- using random sampling to generate 1e6~1e7 points database
$D$ - using the boltzmann sampling with different k on the database
$D$ to get the subdatabase$D_s'$ - using the 10 or more times random sampling to select 1e3 points on the subdatabase
$D_s'$ to get the subsubdatabase$D_{ss}''$ - using the random selection, OptiSim and other algorithms to select
$S$ points on the subsubdatabase$D_{ss}''$ . - fitting the PES with Gaussian Process
- compute mean absolute error, root mean square error, and maximum absolute error for the regular grid of points from -3 to 3 with 1e4 points (which axis may 1e2)
- average these errors to get smoother plots
Parameter details:
with
- I have change the code and get some result with the 30 times random sampling to get smoother plots and the sample number is range(1,300,2). It is hard to do larger sample and larger iterations in my compute and I have submitted the application of the compute canada last week.
- A strange thing is when k = np.inf, the error is increased with the sample number. I think it's must something wrong with my code, so I'm checking my code.
It will be helpful to explain exactly what these tests are doing. Write your procedure in your own words. Also the label "sample number" should be "number of data points" or something like that. ("Sample number" sounds like you are comparing different (but equivalent) samples.)
Have you tried non-random-sampling?
What's the difference between the first set of plots and the second set of plots? The first set looks (more-or-less) like I expect.
Oh, I think I understand. The plots are the same, but the last one has infinity.
Keep in mind that infinity doesn't actually work in your code; you need to just take the n
points with lowest energy. You expect very bad results from this strategy, so it doesn't surprise me if the numbers are bad. But this is deterministic so there shouldn't be ups-and-downs I think.
However, the k=8 case probably should look similar, though perhaps you need a larger value of k to see it. Was k=9 impossible because the sample was too small (there weren't 300 points left to use?)
- Sample number is the number of selected points S in step 4.
- Yeah, the two plots are same but the last one has the result k=np.inf. I am dong the non-random sampling now.
- Yeah, I have same idea with you that the error of np.inf is large or not convergent but it shouldn't be increased with sample number. Because I choose a wrong interval when optimize the hyper-parameter of gaussian process so the fitting is terrible when k is large.
- No, actually when k=np.inf , we also have 2130 points can be used(I choose 1e7 points in first step). I choose 300 points before due to the large calculation. Is it appropriate that I choose the 0-20 with the interval 2(which has 10 different k).
"k=np.inf , we also have 2130 points" is very strange. Do all these points have exactly the same energy?
The plots look right! Our impression was correct: when the sampling is very biased, use diverse sampling is really helpful!
I think we are basically in good shape now. We just need to make pretty plots to explain the story, and polish the notebook.
Can we perform more samples; in step 2 of the procedure maybe we can choose more Boltzmann samples? I think that will make the curves a bit smoother. Just going to 20 or 25 samples might smooth things a lot more.
We can probably consider just k=0,2,4,8,16. I think that already the k=16 is extremely biased; it seems that it has essentially no data in the high-density regions. Just making the same plots as you did above, again, for these values of k will give a lot of insight!
Also, we are computing errors from the grid, correct, not the training data (as described in step 5?
@FarnazH and @ramirandaq do you have any thoughts?
My only hesitation (but maybe k=20 is just far too severe!) is that the maximum error is still really big........
- Actually not just a point, but they are centred in the minimum energy point(1.40,1.78). Brief introduce my work: At first I choose 1e7 to do boltzmann sample and get different size database with different k (when k is smaller, the number of database is larger). Then choose 1000 points with random sample several times(30 times) to average the error which you mention in step 3. Finally test the error by different sample method (random,maxmin,optisim) with different data points number S. The more details in my previous comment.
- I did 30 sample times before, I think it looks like so fluctuating since the the change of the error is small (just 0.0002). I will increase the sample times to 40 or 50.
- I don't use the training data, but it perhaps exists some training data in my test data. I test the error by generating the uniform distribution from (-3,3) with 100 points in one dimension, which means the interval between two points is 6/100 = 0.06. So I can get 1e5 points of two dimensions.
- I'm testing the different sample method with different k. I think we will get more information this week.
The attachment is the points distribution with different k before random sampling.
These plots of points distribution are all the points that are selected? Or after sub-sampling 1000?
I'm curious what k=8 looks like. I'm curious whether there are still points in all 4 wells.
Based on these, I feel like using k=0, 2.5, 5, and 10 may be perfect for this study.
Also, we may want to select more points. How may points do we get with k=14? We might use some number of points close to that for the k-fold sampling (step 2). Little jumps will always be there, and will be a little bit less apparent if we (diversely/randomly) select points at larger intervals (e.g., steps of 5 or 10 points on the x axis in figures like
#144 (comment)
@xychem, I couldn't find your example notebook. Can you please share a link here asap? Preferably you can make a branch for this issue, or just make a PR.
Sorry for late reply, I'm too busy these days. I upload a folder in notebook.There is still something wrong which I'm modifying.
(1) Minyaev, R. M.; Quapp, W.; Subramanian, G.; Schleyer, P. von R.; Mo, Y. Internal Conrotation and Disrotation in H2BCH2BH2 and Diborylmethane 1,3 H Exchange. Journal of Computational Chemistry 1997, 18 (14), 1792–1803.