Possible non-random sampling
austinjp opened this issue · 2 comments
Hi there. Firstly, thanks for a3k, I'm finding it very useful.
I noticed a problem when using --sample 'random.random() < 0.0001'
to randomly sample from the latest Crossref dataset. It seemed to produce identical samples each time, whereas I was expecting it to produce different samples each time. I've not yet looked through the code, but I wondered if it might be an issue with seeding the random generator? Perhaps this is expected behaviour, so apologies if I missed this in the docs.
An example:
$ a3k populate --sample 'random.random() < 0.0001' /tmp/crossref.db crossref ./data-files/crossref/
$ sqlite3 -batch /tmp/crossref.db 'select id,doi,title from works where id in (select max(id) from works) or id in (select min(id) from works) order by id';
id doi title
----- --------------------------- ------------------------------------------------------------------------------------------
0 10.1007/978-3-658-29701-5_1 Keynote Speech Disruption in mobility – new trends, new concepts and new business models?!
21383 10.18356/98a0368f-en-fr No. 47244 International Bank for Reconstruction and Development and Brazil
$ rm /tmp/crossref.db
$ a3k populate --sample 'random.random() < 0.0001' /tmp/crossref.db crossref ./data-files/crossref/
$ sqlite3 -batch /tmp/crossref.db 'select id,doi,title from works where id in (select max(id) from works) or id in (select min(id) from works) order by id';
id doi title
----- --------------------------- ------------------------------------------------------------------------------------------
0 10.1007/978-3-658-29701-5_1 Keynote Speech Disruption in mobility – new trends, new concepts and new business models?!
21383 10.18356/98a0368f-en-fr No. 47244 International Bank for Reconstruction and Development and Brazil
Notice the identical results after deleting and recreating the database with a 'fresh' sample. Perhaps this is expected behaviour, but I was expecting a random sample, and hence different each time.
Some quick sanity checks:
$ sqlite3 -batch /tmp/crossref.db 'select count(*) from works;'
count(*)
--------
10000
$ ls -l data-files/crossref/ | head -n 4
total 185934MB
-rwxrwxrwx 1 austinjp austinjp 8MB 2023-08-10 19:35 0.json.gz
-rwxrwxrwx 1 austinjp austinjp 11MB 2023-08-10 20:10 10000.json.gz
-rwxrwxrwx 1 austinjp austinjp 7MB 2023-08-10 20:11 10001.json.gz
$ ls -1 data-files/crossref/ | wc -l
28702
Workaround
As a workaround, I use --sample '( random.seed() ) or random.random() < 0.0001'
to re-seed the random generator at every sample decision. It's inefficient, but it gives the results I'd expected:
$ rm /tmp/crossref.db
$ a3k populate --sample '( random.seed() ) or random.random() < 0.0001' /tmp/crossref.db crossref ./data-files/crossref/
$ sqlite3 -batch /tmp/crossref.db 'select id,doi,title from works where id in (select max(id) from works) or id in (select min(id) from works) order by id';
id doi title
----- -------------------------------- -----------------------------------------------------------------------
0 10.1097/00001721-199206000-00004 Protein S negates the activated protein C inhibitory activity of plasma
37767 10.1177/109980040000200201 Summer Camp for Scientists
$ rm /tmp/crossref.db
$ a3k populate --sample '( random.seed() ) or random.random() < 0.0001' /tmp/crossref.db crossref ./data-files/crossref/
$ sqlite3 -batch /tmp/crossref.db 'select id,doi,title from works where id in (select max(id) from works) or id in (select min(id) from works) order by id';
id doi title
----- ----------------------------- --------------------------------------------------------------------------------------------------------------------------
0 10.15296/ijwhr.2017.33 Health Promoting Behaviors and Self-efficacy of Physical Activity During Pregnancy: An Interventional Study
21383 10.1016/s0973-0826(08)60299-9 Risk and uncertainty analysis of natural environmental assets threatened by hydropower projects: case study from Sri Lanka
Best wishes.
Thank you for your kind words! The random number generator is indeed seeded with a constant at the beginning of the program's operation so that a3k results are repeatable. The workaround you suggest should help with your process. We could add an option to set the seed (or not set it at all) if you think this is an important feature.
Looking forward to read about your results.
Hi again. No problem! 😃 I just had a look through the code and yep, I spotted the deterministic seeding. I appreciate that it might be useful, I just wasn't anticipating it. Perhaps the docs could highlight the fact that the sampling is deterministic? I'll send a PR, feel free to use/ignore as you see fit.
My workaround 'works', although it's inefficient. I guess that's not really a problem in reality, since it's plenty fast enough for my needs. A CLI flag for setting a fresh seed every invocation might be good, though, since it would allow users to set the seed themselves and hence have more control. But this is more of a feature request than an issue, so I'm happy for this to be closed.