/pval

Pval is a utility for generating a PDF (probability density function) and CDF (cumulative distribution function) from a data-input file. It is a two-pass utility, such that the output on the first execution can be bootstrapped into the utility to generate "p-values" (i.e. P10, P50, P90).

Primary LanguageVMIT LicenseMIT

Pval - PDF and CDF Utility

Pval is a utility for generating a PDF (probability density function) and CDF (cumulative distribution function) from a data-input file. It is a two-pass utility, such that the output on the first execution can be bootstrapped into the utility to generate "p-values" (i.e. P10, P50, P90); where the p-value (PXX) is the probabilistic outcome such that there is XX % likelihood that all other values will fall below that measure - effectively the inverse measure of the CDF.

Pval is unique in that is can also execute this analysis on dates which follow the ISO8601 date convention (yyyy-mm-dd).

This utility can be ran manually, or integrated into an automated-process/work-flow. In general these utilities follow the UNIX philosophy much as possible.

Project Motivation

Generating distributions for PDF and CDF data is not easily available outside of statistical packages such as R / Matlab / Mathematica. Moreover working with dates can be difficult at times.

Pval is particularily useful for importing the output into BI tools and generating a combination graph of bar-chart (PDF) on a and line-chart (CDF) data. P-values can also be plotted to determine the inverse of the CDF.

Pre-Compiled Binaries

Binaries (.exe) for Windows OS have been pre-compiled and can be found in the 'bin' folder. Or in the "Releases" section in GitHub.

With git, you can download all the latest source and binaries with git clone https://github.com/chipnetics/pval

Alternatively, if you don't have git installed:

  1. Download the latest release here
  2. Unzip to a local directory.
  3. Navigate to 'bin' directory for executables.

Compiling from Source

Utilities are written in the V programming language and will compile under Windows, Linux, and MacOS.

V is syntactically similar to Go, while equally fast as C. You can read about V here.

After installing the latest V compiler, it's as easy as executing the below. Be sure that the V compiler root directory is part of your PATH environment.

git clone https://github.com/chipnetics/pval
cd src
v pval.v

Alternatively, if you don't have git installed:

  1. Download the bundled source here
  2. Unzip to a local directory
  3. Navigate to src directory and run v pval.v

Please see the V language documentation for further help if required.

Running Optional Command Line Arguments

For Windows users, if you want to pass optional command line arguments to an executable:

  1. Navigate to the directory of the utility.
  2. In the navigation (file-path) bar of Windows Explorer type "cmd" and hit enter to get a command prompt.
  3. Type the name of the exe along with the optional argument (i.e. pval.exe --help ).
Options:
  -c, --column <int>        Column index to generate distribution from
  -d, --is-date             Indicate column is date (fmt: yyyy-mm-dd)
  -b, --binsize <int>       Bin sizing
  -n, --no-header           Indicate input file has no header
  -e, --expand              Expand output generated by pval
  -f, --file-in <string>    Input file
  -h, --help                display this help and exit
  --version                 output version information and exit

Viewing Large Files

As an aside, the author recommends the excellent tool EmEditor by Emurasoft for manually editing or viewing large text-based files for data science & analytics. Check out the tool here. EmEditor is paid software but well worth the investment to increase effeciency.

Examples

Numerical data

A relatively basic example on somewhat uniform data. The below would generate PDF and CDF from column 3 of the input file big_data_set.txt, with a binning size of 1.

pval.exe -f big_data_set.txt -b 1 -c 3 > result.txt

bin     lower_bound   upper_bound  dsc_count  cml_count  pdf                  cdf
1       1             2            8223       8223       0.1318105313777351   0.1318105313777351
2       2             3            8223       16446      0.1318105313777351   0.2636210627554701
3       3             4            5250       21696      0.0841548449146429   0.3477759076701130
4       4             5            5121       26817      0.0820870401538832   0.4298629478239962
5       5             6            5091       31908      0.0816061553257995   0.5114691031497957
6       6             7            5082       36990      0.0814618898773744   0.59293099302717
7       7             8            5082       42072      0.0814618898773744   0.6743928829045444
8       8             9            5082       47154      0.0814618898773744   0.7558547727819187
9       9             10           5082       52236      0.0814618898773744   0.837316662659293
10      10            11           5079       57315      0.0814138013945660   0.918730464053859
11      11            12           5070       62385      0.0812695359461409   1.

Bootstrapping the result to generate PXX values.

pval.exe -f result.txt -e

pxx     as_dec  bin_l   bin_u   p-value
P00     0.00    1       2       1.000
P01     0.01    1       2       1.076
P02     0.02    1       2       1.152
P03     0.03    1       2       1.228
P04     0.04    1       2       1.303
P05     0.05    1       2       1.379
P06     0.06    1       2       1.455
P07     0.07    1       2       1.531
P08     0.08    1       2       1.607
P09     0.09    1       2       1.683
P10     0.10    1       2       1.759
P10     0.11    1       2       1.835
P11     0.12    1       2       1.910
P12     0.13    1       2       1.986
P13     0.14    2       3       2.062
P15     0.15    2       3       2.138
P16     0.16    2       3       2.214
P17     0.17    2       3       2.290
P18     0.18    2       3       2.366
P19     0.19    2       3       2.441
P20     0.20    2       3       2.517
P21     0.21    2       3       2.593
P22     0.22    2       3       2.669
P23     0.23    2       3       2.745
P24     0.24    2       3       2.821
P25     0.25    2       3       2.897
P26     0.26    2       3       2.973
P27     0.27    3       4       3.076
P28     0.28    3       4       3.195
P29     0.29    3       4       3.313
P30     0.30    3       4       3.432
P31     0.31    3       4       3.551
P32     0.32    3       4       3.670
P33     0.33    3       4       3.789
P34     0.34    3       4       3.908
P35     0.35    4       5       4.027
P36     0.36    4       5       4.149
P37     0.37    4       5       4.271
P38     0.38    4       5       4.393
P39     0.39    4       5       4.514
P40     0.40    4       5       4.636
P41     0.41    4       5       4.758
P42     0.42    4       5       4.880
P43     0.43    5       6       5.002
P44     0.44    5       6       5.124
P45     0.45    5       6       5.247
P46     0.46    5       6       5.369
P47     0.47    5       6       5.492
P48     0.48    5       6       5.614
P49     0.49    5       6       5.737
P50     0.50    5       6       5.859
P51     0.51    5       6       5.982
P52     0.52    6       7       6.105
P53     0.53    6       7       6.227
P54     0.54    6       7       6.350
P55     0.55    6       7       6.473
P56     0.56    6       7       6.596
P57     0.57    6       7       6.719
P58     0.58    6       7       6.841
P59     0.59    6       7       6.964
P60     0.60    7       8       7.087
P61     0.61    7       8       7.210
P62     0.62    7       8       7.332
P63     0.63    7       8       7.455
P64     0.64    7       8       7.578
P65     0.65    7       8       7.701
P66     0.66    7       8       7.823
P67     0.67    7       8       7.946
P68     0.68    8       9       8.069
P69     0.69    8       9       8.192
P70     0.70    8       9       8.314
P71     0.71    8       9       8.437
P72     0.72    8       9       8.560
P73     0.73    8       9       8.683
P74     0.74    8       9       8.805
P75     0.75    8       9       8.928
P76     0.76    9       10      9.051
P77     0.77    9       10      9.174
P78     0.78    9       10      9.296
P79     0.79    9       10      9.419
P80     0.80    9       10      9.542
P81     0.81    9       10      9.665
P82     0.82    9       10      9.787
P83     0.83    9       10      9.910
P84     0.84    10      11      10.033
P85     0.85    10      11      10.156
P86     0.86    10      11      10.279
P87     0.87    10      11      10.401
P88     0.88    10      11      10.524
P89     0.89    10      11      10.647
P90     0.90    10      11      10.770
P91     0.91    10      11      10.893
P92     0.92    11      12      11.016
P93     0.93    11      12      11.139
P94     0.94    11      12      11.262
P95     0.95    11      12      11.385
P96     0.96    11      12      11.508
P97     0.97    11      12      11.631
P98     0.98    11      12      11.754
P99     0.99    11      12      11.877
P100    1.00    11      12      12

Date data

An example using dates. The below would generate PDF and CDF from column 5 of the input file dates_data.txt, with a binning size of 14 days (bi-weekly). Notice the -d flag to indicate column 5 is a date column.

pval.exe -f dates_data.txt -b 14 -c 5 -d > pdf_cdf_dates.txt

bin     lower_bound     upper_bound     dsc_count    cml_count   pdf                     cdf
1       2021-03-22      2021-04-05      87           87          0.0580386924616411      0.0580386924616411
2       2021-04-05      2021-04-19      64           151         0.0426951300867245      0.1007338225483656
3       2021-04-19      2021-05-03      88           239         0.0587058038692462      0.1594396264176118
4       2021-05-03      2021-05-17      28           267         0.0186791194129420      0.1781187458305537
5       2021-05-17      2021-05-31      90           357         0.0600400266844563      0.2381587725150100
6       2021-05-31      2021-06-14      87           444         0.0580386924616411      0.2961974649766511
7       2021-06-14      2021-06-28      112          556         0.0747164776517679      0.370913942628419
8       2021-06-28      2021-07-12      108          664         0.0720480320213476      0.4429619746497665
9       2021-07-12      2021-07-26      106          770         0.0707138092061374      0.513675783855904
10      2021-07-26      2021-08-09      127          897         0.0847231487658439      0.5983989326217478
11      2021-08-09      2021-08-23      152          1049        0.1014009339559707      0.6997998665777185
12      2021-08-23      2021-09-06      121          1170        0.0807204803202135      0.7805203468979319
13      2021-09-06      2021-09-20      116          1286        0.0773849232821881      0.8579052701801201
14      2021-09-20      2021-10-04      0            1286        0.0000000000000000      0.8579052701801201
15      2021-10-04      2021-10-18      0            1286        0.0000000000000000      0.8579052701801201
16      2021-10-18      2021-11-01      0            1286        0.0000000000000000      0.8579052701801201
17      2021-11-01      2021-11-15      0            1286        0.0000000000000000      0.8579052701801201
18      2021-11-15      2021-11-29      0            1286        0.0000000000000000      0.8579052701801201
19      2021-11-29      2021-12-13      0            1286        0.0000000000000000      0.8579052701801201
20      2021-12-13      2021-12-27      0            1286        0.0000000000000000      0.8579052701801201
21      2021-12-27      2022-01-10      1            1287        0.0006671114076051      0.8585723815877252
22      2022-01-10      2022-01-24      212          1499        0.1414276184122749      1.

Bootstrapping the result to generate PXX values.

pval.exe -f pdf_cdf_dates.txt -e

pxx     as_dec  bin_l           bin_u           p-value
P00     0.00    2021-03-22      2021-04-05      2021-03-22
P01     0.01    2021-03-22      2021-04-05      2021-03-24
P02     0.02    2021-03-22      2021-04-05      2021-03-26
P03     0.03    2021-03-22      2021-04-05      2021-03-29
P04     0.04    2021-03-22      2021-04-05      2021-03-31
P05     0.05    2021-03-22      2021-04-05      2021-04-03
P06     0.06    2021-04-05      2021-04-19      2021-04-05
P07     0.07    2021-04-05      2021-04-19      2021-04-08
P08     0.08    2021-04-05      2021-04-19      2021-04-12
P09     0.09    2021-04-05      2021-04-19      2021-04-15
P10     0.10    2021-04-05      2021-04-19      2021-04-18
P10     0.11    2021-04-19      2021-05-03      2021-04-21
P11     0.12    2021-04-19      2021-05-03      2021-04-23
P12     0.13    2021-04-19      2021-05-03      2021-04-25
P13     0.14    2021-04-19      2021-05-03      2021-04-28
P15     0.15    2021-04-19      2021-05-03      2021-04-30
P16     0.16    2021-05-03      2021-05-17      2021-05-03
P17     0.17    2021-05-03      2021-05-17      2021-05-10
P18     0.18    2021-05-17      2021-05-31      2021-05-17
P19     0.19    2021-05-17      2021-05-31      2021-05-19
P20     0.20    2021-05-17      2021-05-31      2021-05-22
P21     0.21    2021-05-17      2021-05-31      2021-05-24
P22     0.22    2021-05-17      2021-05-31      2021-05-26
P23     0.23    2021-05-17      2021-05-31      2021-05-29
P24     0.24    2021-05-31      2021-06-14      2021-05-31
P25     0.25    2021-05-31      2021-06-14      2021-06-02
P26     0.26    2021-05-31      2021-06-14      2021-06-05
P27     0.27    2021-05-31      2021-06-14      2021-06-07
P28     0.28    2021-05-31      2021-06-14      2021-06-10
P29     0.29    2021-05-31      2021-06-14      2021-06-12
P30     0.30    2021-06-14      2021-06-28      2021-06-14
P31     0.31    2021-06-14      2021-06-28      2021-06-16
P32     0.32    2021-06-14      2021-06-28      2021-06-18
P33     0.33    2021-06-14      2021-06-28      2021-06-20
P34     0.34    2021-06-14      2021-06-28      2021-06-22
P35     0.35    2021-06-14      2021-06-28      2021-06-24
P36     0.36    2021-06-14      2021-06-28      2021-06-25
P37     0.37    2021-06-14      2021-06-28      2021-06-27
P38     0.38    2021-06-28      2021-07-12      2021-06-29
P39     0.39    2021-06-28      2021-07-12      2021-07-01
P40     0.40    2021-06-28      2021-07-12      2021-07-03
P41     0.41    2021-06-28      2021-07-12      2021-07-05
P42     0.42    2021-06-28      2021-07-12      2021-07-07
P43     0.43    2021-06-28      2021-07-12      2021-07-09
P44     0.44    2021-06-28      2021-07-12      2021-07-11
P45     0.45    2021-07-12      2021-07-26      2021-07-13
P46     0.46    2021-07-12      2021-07-26      2021-07-15
P47     0.47    2021-07-12      2021-07-26      2021-07-17
P48     0.48    2021-07-12      2021-07-26      2021-07-19
P49     0.49    2021-07-12      2021-07-26      2021-07-21
P50     0.50    2021-07-12      2021-07-26      2021-07-23
P51     0.51    2021-07-12      2021-07-26      2021-07-25
P52     0.52    2021-07-26      2021-08-09      2021-07-27
P53     0.53    2021-07-26      2021-08-09      2021-07-28
P54     0.54    2021-07-26      2021-08-09      2021-07-30
P55     0.55    2021-07-26      2021-08-09      2021-08-01
P56     0.56    2021-07-26      2021-08-09      2021-08-02
P57     0.57    2021-07-26      2021-08-09      2021-08-04
P58     0.58    2021-07-26      2021-08-09      2021-08-05
P59     0.59    2021-07-26      2021-08-09      2021-08-07
P60     0.60    2021-08-09      2021-08-23      2021-08-09
P61     0.61    2021-08-09      2021-08-23      2021-08-10
P62     0.62    2021-08-09      2021-08-23      2021-08-11
P63     0.63    2021-08-09      2021-08-23      2021-08-13
P64     0.64    2021-08-09      2021-08-23      2021-08-14
P65     0.65    2021-08-09      2021-08-23      2021-08-16
P66     0.66    2021-08-09      2021-08-23      2021-08-17
P67     0.67    2021-08-09      2021-08-23      2021-08-18
P68     0.68    2021-08-09      2021-08-23      2021-08-20
P69     0.69    2021-08-09      2021-08-23      2021-08-21
P70     0.70    2021-08-23      2021-09-06      2021-08-23
P71     0.71    2021-08-23      2021-09-06      2021-08-24
P72     0.72    2021-08-23      2021-09-06      2021-08-26
P73     0.73    2021-08-23      2021-09-06      2021-08-28
P74     0.74    2021-08-23      2021-09-06      2021-08-29
P75     0.75    2021-08-23      2021-09-06      2021-08-31
P76     0.76    2021-08-23      2021-09-06      2021-09-02
P77     0.77    2021-08-23      2021-09-06      2021-09-04
P78     0.78    2021-08-23      2021-09-06      2021-09-05
P79     0.79    2021-09-06      2021-09-20      2021-09-07
P80     0.80    2021-09-06      2021-09-20      2021-09-09
P81     0.81    2021-09-06      2021-09-20      2021-09-11
P82     0.82    2021-09-06      2021-09-20      2021-09-13
P83     0.83    2021-09-06      2021-09-20      2021-09-14
P84     0.84    2021-09-06      2021-09-20      2021-09-16
P85     0.85    2021-09-06      2021-09-20      2021-09-18
P86     0.86    2022-01-10      2022-01-24      2022-01-10
P87     0.87    2022-01-10      2022-01-24      2022-01-11
P88     0.88    2022-01-10      2022-01-24      2022-01-12
P89     0.89    2022-01-10      2022-01-24      2022-01-13
P90     0.90    2022-01-10      2022-01-24      2022-01-14
P91     0.91    2022-01-10      2022-01-24      2022-01-15
P92     0.92    2022-01-10      2022-01-24      2022-01-16
P93     0.93    2022-01-10      2022-01-24      2022-01-17
P94     0.94    2022-01-10      2022-01-24      2022-01-18
P95     0.95    2022-01-10      2022-01-24      2022-01-19
P96     0.96    2022-01-10      2022-01-24      2022-01-20
P97     0.97    2022-01-10      2022-01-24      2022-01-21
P98     0.98    2022-01-10      2022-01-24      2022-01-22
P99     0.99    2022-01-10      2022-01-24      2022-01-23
P100    1.00    2022-01-10      2022-01-24      2022-01-24