Stata module for relative distribution analysis

reldist estimates and analyzes the relative distribution of outcomes between two groups (two-sample relative distribution) or between two variables (paired relative distribution). The relative distribution is the distribution of the relative ranks that the outcomes from one distribution take on in the other distribution. An example would be the relative positions that female wages take on in the distribution of male wages. reldist can be used to estimate and plot the relative density function (relative PDF), a histogram of the relative distribution, or the relative distribution function (relative CDF). Furthermore, it computes relative polarization indices as well as descriptive statistics of the relative data, and supports the decomposition of the relative distribution by adjusting for location, scale, and shape differences or for differences in covariate distributions. Statistical inference is implemented in terms of influence functions and supports estimation for complex samples.

To install reldist from the SSC Archive, type

. ssc install reldist, replace

in Stata. Stata version 12 or newer is required. Furthermore, the moremata package is required. To install moremata from the SSC Archive, type

. ssc install moremata, replace

Installation from GitHub:

. net install reldist, replace from(https://raw.githubusercontent.com/benjann/reldist/master/)
. net install moremata, replace from(https://raw.githubusercontent.com/benjann/moremata/master/)

Main changes:

04dec2022 (version 1.3.1)
- r() from reldist is now preserved if option -graph- is specified; this ensures
  that r(table) will be available after running -reldist- with both the -graph-
  option and the -table- option
- the display routine is now executed even if -quietly- is applied to -reldist-,
  so that r(table) will created even if -quietly- is applied
- the display routine will now clear preexisting r() even if -notable- is applied

23feb2022 (version 1.3.0)
- reldist failed or returned invalid results if used with a string variable in
  vce(cluster); this is fixed

23feb2022 (version 1.2.9)
- observations with missing value for the cluster variable specified in 
  vce(cluster) were not excluded from the estimation sample (except by
  -reldist pdf-); this is fixed

30oct2021 (version 1.2.8)
- fixed header misalignment in Stata 17

19jun2021 (version 1.2.7)
- a different approach is now used to take account of balance() when computing 
  influence functions; this is only about organization of the code, results
  should not be affected
- there was a bug in the computation of the influence functions for the PDF in
  case of categorical data (one of the components had a wrong sign; in most
  situations this had no effect on results; this is fixed

05oct2020 (version 1.2.6)
- when applying replication-based variance estimation, e.g. vce(bootstrap), 
  reldist pdf used the SJPI bandwidth selector even if a different bandwidth 
  selector was specified in bwidth(method); this if fixed

29sep2020 (version 1.2.5)
- reldist divergence:
  o if compare() was specified, influence functions were not always processed
    correctly (such that result were wrong or an error occurred); this is fixed
  o option -compare()- did not work with replication vce(); this is fixed
  o option -compare- without argument did not work; this is fixed

28sep2020 (version 1.2.4)
- SEs were not correct if balance() was combined with -pooled-; this is fixed
- etropy balancing crashed if factor variable notation expanded into different
  vectors in the two subsample; this is fixed

27sep2020 (version 1.2.3)
- now beaking ties in order of base weights if balance() is specified, not the 
  balancing weights
- now using stable sort order unless -nosort- is specified

26sep2020 (version 1.2.1)
- balance() reimplemented; balancing weights are no longer assumed fixed when
  computing standard errors
- option -replace- is now allowed with all subcommands so that balance(, generate())
  can overwrite existing variables
- suboption -nord- added in bwidth() to omit the RD correction that is applied to
  bandwidth selectors by default
- reldist pdf used the SJPI bandwidth selector even if a different bandwidth 
  selector was specified in bwidth(method); this if fixed
- reldist div: compare(balance(,generate())) did not store the variable; this
  is fixed

18sep2020 (version 1.2.0)
- major update with many changes:
  o analytic standard errors are now computed for all estimates (based on 
    influence functions)
  o svy is supported through option vce(svy ...)
  o new -reldist divergence- command for estimation of divergence measures; 
    -reldist pdf- and -reldist histogram- no longer compute divergence
  o predict after -reldist- now computed influence functions
  o density estimation now based on moremata's new mm_density()
  o balance() option no longer relies on -kmatch-; supported reweighting 
    methods are IPW and entropy balancing
  o default kernel now "gaussian", leading to smoother results for the PDF
  o now using non-adaptive kernel estimation by default
  o -reldist histogram- now implemented in terms of CDF (or PDF, depending on 
  o -reldist summarize- no longer calls -tabstat-; list of supported statistics
  o option pooled now only allowed in syntax 1
  o option cross() discarded
  o aweights no longer allowed; iweights now treated like pweights
  o and various other changes ...

17jun2020 (version 1.1.8):
- [y]obael() now prunes labels that are too close together; new suboptions 
  -noprune- and -prune()- affect this behavior
- -reldist pdf/hist- now have option -cross()-
- -reldist hist- now also computes divergence measures (based on histogram density)
- -reldist pdf- now computes divergence measures based on output grid (not 
  the internal approximation grid); divergence measures are now also reported
  if option -exact- is specified
- -reldist pdf- now also computes the dissimilarity index (total variation 
  distance); e(divergence) renamed to e(entropy)
- -reldist pdf-: napprox() was set to max(512, n()+1) instead of max(512, n())
  if not specified; this is fixed
- -reldist mrp- now option -reference-

12jun2020 (version 1.1.7):
- [y]olabel() now allows argument #n to generate n labels at evenly spaced 
  (approximately) positions from min to max; [y]olabel without is equivalent
  to [y]olabel(#6); less than n labels may be produced if there is heaping
  in the data

11jun2020 (version 1.1.6):
- options atx(reference) and atx(comparison) added
- new option balance(, contrast): compare unbalanced with balanced distribution
- reldist CDF now has option -alt- to use an alternative estimation method
  based on relative ranks
- reldist graph has new option -[y]oline()-
- -reldist olabel- has new option -line()-
- option -otick- in -reldist olabel- is now called -tick()
- [y]label() etc. may now repeated
- [y]otitle() is now also printed if no labels or ticks are requested
- default for ogrid() has been increased to 401
- -reldist sum- returned error of -balance()- was specified; this is fixed
- minor changes to output header
- option -descending- added
- adjust(,multiplicative) now returns error if the adjustment factor is 0, 
  negative, or missing
- interpolation of relative CDF improved for 0<at<1 if an upright segment is
  hit: now using midpoint of upright segment insted of ceiling
- internal function _rd_quantile() returned error if X only had one row; this is fixed
- internal function _rd_uniq() returned error if X hat less than two rows; this is fixed

07jun2020 (version 1.1.5):
- reldist made Stata freeze if the number of evaluation points was too large
  (due size limits of -matrix-); an error message is now displayed if the
  number of evaluation points is too large

05jun2020 (version 1.1.4):
- made some speed improvements by avoiding repeated storting (data is now sorted
  once when reading the data; subsequent computations then use functions that
  assume sorted data; density estimation, however, still involves repeated
  sorting; this could be further improved).

05jun2020 (version 1.1.3):
- new algorithm for computing relative ranks that breaks ties; use option 
  -nobreak- to employ old algorithm
- option atx() can now be used without arguments, i.e. as -atx-, to use the 
  observed values as evaluation points
- -reldist cdf- and -reldist pdf- now both have options -discrete- and 
  -categorical-; the two options do the same, but -categorical- requests that
  outcome values are positive integers and labels the coefficients using 
  factor-variable notation; -discrete- does not impose such a restriction and 
  labels the coefficients as x#
- atx() now only allows positive integers if -categorical- is specified
- -reldist cdf- now always computes the relavtive CDF based on exact data and
  interpolates between the exact points if necessary
- -reldist pdf- now allows -n()- or -at()- together with -discrete- or 
  -categorical-; in this case the discrete relative density is computed based
  on exact data and then mapped onto the evaluation grid requested by n() or at()
- -reldist pdf- with option -discrete- or -categorical- now automatically removes
  outcome-value evaluation points that do not exist in the reference distribution
  (the discrete relative density is infinity for these values, but the values 
  have zero mass on the y-axis, so it appears reasonable to ignore
- the option to affect the rendition of the reference line was documented as 
  -refline()- but implemented as -refopts()-; the option is now implemented
  as documented
- e(bwmethod) is now reset to "oversmoothed" if default bandwidth estimation

02jun2020 (version 1.1.2):
- -reldist pdf- and -reldist cdf- now have option -discrete- to treat data as 
  discrete; evaluation is performed at existing outcome values; pdf is displayed
  as a step function
- -reldist pdf- and -reldist cdf- now have option -atx- without 
  argument to evaluate at existing outcome values
- new -nomid- option to avoid using midpoints for relative ranks
- relative CDF is no longer computed as a step function; values are 
  interpolated between jumps; this is consistent with breaking ties
  randomly; option -nomid- has no effect on CDF (i.e. CDF is always computed
  as if nomid has been specified)
- no longer using midpoints when computing values of at() from atx()
- now using inverse empirical CDF for computing atx() from at() and for 
  computing e(ogrid) (i.e. no averaging where distribution function is flat)
- no longer using interpolation when computing the olabel() positions; the
  positions are now consistent with how a CDF is computed
- no longer using rangen() for creating evaluation grids (precision issue)
- now only returns e(at) with two rows instead of e(at) and e(atx)

06may2020 (version 1.1.1):
- option balance() added
- option pooled added
- changed approach for olabel()/otick(); reldist now stores quantiles in e(ogrid)
  from where the label positions are computed (instead of computing the positions
  from the original data or e(atx)); option ogrid() can be used to set the
  size of the grid; default is ogrid(201)
- option graph is no longer allowed with vce(bootstrap/jackknife)
- reldist failed if weights were specified and the variable containing the
  weights was abbreviated; this is fixed
- fixed some minor issues with output formatting

02may2020 (version 1.1.0):
- reldist released on GitHub and SSC