/data

Cannabis Data Analysis Portal

Primary LanguagePHPGNU General Public License v3.0GPL-3.0

Data

An Open Data Portal. This tool is designed to import data directly from BioTrack, LeafData or METRC into a common schema. Then you can run reports on this data-set.

Loading Akerna/MJFreeway/MJPlatform/LeafData Data

Get the ZIP files or Table Dumps from LeafData. Put the LeafData CSV/Zip files in some origin directory (/mnt/leafdata). Extract and prepare them as necessary. Symlink those files into ./source-data/$NAME.csv, using the name required by the import script. Then remove the origin file, leaving an orphan symlink to track what's been completed. And clean up when all the way done.

BOM, UTF-16

Sometimes the files are like this. Some of the files are UTF-16-LE

iconv

  • Use iconv -f UTF-16LE -t UTF-8 iconv -f UTF-16LE -t UTF-8 -o

Extracting

The files may be delivered as zip, and internally they are labeled as csv files. They may use a comma, they may use a TAB.

Importing

Use ./bin/leafdata.sh. I usually let it do one at a time, and time that request.


Loading BioTrack Data

Fetch the database from the state system, it's stored in ./var/biotrack_*

Then run the scripts in ./bin Once that is finished the BioTrack Database will be mostly ready to go.

export SQL_FILE="/opt/data.openthc.org/var/biotrackthc_20170228.db"
./bin/load-biotrack-01.sh
./bin/load-biotrack-02.sh
./bin/load-biotrack-03.sh

Run php ./bin/load-biotrack-02.php to fill out extended data

Now prime the datasets for the web view:

  • bash -x ./bin/data-prime-01.sh

  • bash -x ./bin/data-prime-02.sh

    zgrep 'data.openthc.org' /var/log/apache2/access.log* >> request.raw cut -d'"' -f2 request.raw > request.log sed -i 's/GET //' request.log sed -i 's/OPTIONS //' request.log sed -i 's/ HTTP/1.1//' request.log sed -i 's/ HTTP/1.0//' request.log

    sed -i 'd/^*$/' request.log

    sort request.log > x && mv x request.log uniq -c request.log | sort -nr > x && mv x request.log uniq -c request.log |sort -nr |awk ' $1 > 4 { print $2 }' for U in $(uniq -c request.log |sort -nr |awk '{ print $2 }'); do echo "curl -qs "https://data.openthc.org$U\""; done for U in $(uniq -c request.log |sort -nr |awk '{ print $2 }'); do echo "curl -qs "https://data.openthc.org$U\""; done > x.sh

Data Information

Product Categories: All the Standard Numbers +Joints, BHO

Group by Sizes But find the Common Sizes/Size Groups First!

Facet: License Facet: Product Facet: Variety

Retail Detail

Data!

The data scripts make TSV files from the SQL. Each TSV answers one question built around these "base facets"

  • Category: Plants, Wholesale, Retail, QA
  • An Inventory Type (5, 13, 28, etc)
  • A Date Delta - Daily, Weekly, Monthly, Q, Y

The answers are "business questions" - like price per gram or volume Details are in the ./data-*.php scripts, the parameter is 'a'

Each Data category is answered by a PHP script ./data-*.php Those scripts are called by the wrapper ./data.php, like this:

./data.php wholesale d=w t=28 a=ppg-avg

That will in-turn execute code in the ./data-wholesale.php and pass the parameters in as $_GET values

Those $_GET values are used to control logic that will assemble the SQL in the form of

SELECT (value-expression)
FROM (necessary-tables)
	JOIN (may-need-one-or-more-of-these)
WHERE (desired-constructed-filter)
	AND (additional-conditions-we-discover-to-remove-crap-data)

And then iterate the Date Delta spewing the results into a TSV file Actually, the PHP scripts output to STDOUT which is redirected to capture the output.

Shell Wrappers

The shell wrappers (data-*.sh) are designed to iterate over the "base facets"

foreach Date
	foreach Type
		foreach OtherThing
			./data.php [category] d=Date t=Type a=ppg-avg \
				> ./webroot/pub/data/[predictable-path-here-somehow]

We kind of know these to be the base queries for the data that people want. So we can pre-build these.

Time to First Harvest

Time from Harvest to Cure

Store-level summary by inventory type of what they paid for X and what they charged for X where X is the # of units or grams sold during the time period in question carry level of detail that allows drilldown to a specific processor or producer

This should enable a bunch of stuff I'll add to the document later today

Markup by County

Chocolate vs Cookies Regions Strains on Sales