lab-cosmo/glosim

How to handle similarity/distance matrix with sketchmap ?

Opened this issue · 7 comments

yiino commented

As far as I can understand, glosim tool generates similarity or distance matrix between the molecules or the structures. While, sketchmap tool handle the high dimensional input data.
How can I input the similarity or distance matrix into sketchmap ?
Is there the mode like sklearn.manifold.MDS to handle pre-computed dissimilarities?
(sklearn.manifold.MDS can handle them with dissimilarity=‘precomputed’ option)
regard,

yiino commented

Thank you, I'll try it.

yiino commented

For the similarity matrix input, 'dimlandmark' works? I tries it as follow:

dimlandmark -similarity -n 1000 -mode minmax -w -lowmem < dist_rematch-564_peratom > output

The output looks strange since it contains many same numbers.
'dist_rematch-564_peratom' is 564x564 matrix data.
Generated output file contains 1000 data lines and each line contains 1001 numbers in which gtom 4th to 1000th numbers are same.
Is it the expected action? Are the options for the program wrong?

yiino commented

Hi! Sandip, Thank you for your quick response.

Frequently sorry, but I'm stuck to project whole data using highd- and lowd-landmarks.
Is there any remark at using the landmark files generated from similarity matrix?

dimproj -P dist-landmark100.k -p dist100.gmds -similarity -fun-hd 0.05,3,2 -fun-ld 0.05,1,1 < dist_rematch-564_peratom > dist_rematch-564_peratom.lowd
Error in main:
HD and LD point list mismatch

'dist100.gmds' is lowd-landmarks, while 'dist-landmark100.k' is highd-landmarks.
The former contains 100 lines with 3 columns each. (comment lines removed)
The latter is generated with 'select_landmarks.py' utility which contains 100x100 numbers.
'dist_rematch-564_peratom' is the similarity matrix data which contains 564x564 numbers.

'select_landmarks.py' is kicked as follows:

python select_landmarks.py --mode fps --output kernel --nland 100 --prefix dist dist_rematch-564_peratom

I get 'dist-landmark100.k' file, then, I can get low dimensional representation using 'sketch-map.sh':

./sketch-map.sh
Please enter the dimensionality of input data 100
Are we reading the similarity matrix? y
Please enter the input data file name dist-landmark100.k
Please enter the output data prefix dist100
Please enter high dimension sigma, a, b [e.g. 6.0 2 6 ] 0.05 3 2
Please enter low dimension sigma, a, b [e.g. 6.0 2 6 ] 0.05 1 1

regard,
Y.Iino

yiino commented

Hi! Sandip, Thank you for your kind advices.
Finally, I think I can get the sketch map image.
As you pointed out, I'm confused between similarity(distance) and kernel.

For selecting landmark, the input is kernel data. ("sim_rematch-564_peratom.ssv" is kernel data)

$ python select_landmarks.py --mode fps --output distance --nland 100 --prefix dist sim_rematch-564_peratom.ssv

This generates "dist-landmark100.sim" (shrunk distance matrix) and
"dist-landmark100-OOS.sim " (the coordinate of all points base on the coordinate system of landmark).

For reducing dimension, "sketch-map.sh" gets "dist-landmark100.sim" as input.

$ sketch-map.sh
./sketch-map.sh
Please enter the dimensionality of input data 100
Are we reading the similarity matrix? y
Please enter the input data file name dist-landmark100.sim
Please enter the output data prefix dist100
Please enter high dimension sigma, a, b [e.g. 6.0 2 6 ] 0.05 3 2
Please enter low dimension sigma, a, b [e.g. 6.0 2 6 ] 0.05 1 1

Projection of all points is done as follows.
It looks the keypoint to use "dist-landmark100-OOS.sim" as input. (Is that right?)
Also, it is necessary to remove comment lines and third column in the file "dist100.gmds"
generated by sketch-map.sh.

$ dimproj -D 100 -d 2 -P dist-landmark100.sim -p lowd-landmarks_ -similarity -fun-hd 0.05,3,2 -fun-ld 0.05,1,1 -cgmin 3 < dist-landmark100-OOS.sim > dist-landmark100-OOS.sim.lowd

Thank you.
yiino