/cfout

Stata program to compare two datasets

Primary LanguageStataMIT LicenseMIT

cfout

cfout compares the dataset in memory (the master dataset) to a using dataset. It uses unique ID variables to match observations. cfout optionally saves the list of differences to file.

cfout is available through SSC: type ssc install cfout in Stata to install.

Certification script

The certification script of cfout is cscript/cfout.do. If you are new to certification scripts, you may find this Stata Journal article helpful. See this guide for more on cfout testing.

Stata help file

Converted automatically from SMCL:

log html cfout.sthlp cfout.md

The help file looks best when viewed in Stata as SMCL.

Title

cfout -- Compare two datasets, optionally saving the list of differences to file

Syntax

cfout [varlist] using filename, id(varlist) [options]

options Description ------------------------------------------------------------------------- Main * id(varlist) unique ID variables

String comparison lower convert string variables to lowercase before comparing upper convert string variables to uppercase before comparing nopunct remove punctuation in string variables before comparing strcomp(command) execute command for string variable pairs before comparing

Options saving(filename [, sopts]) save list of differences to filename numcomp(command) use command to determine differences within numeric variable pairs nostring do not compare string variables nonumeric do not compare numeric variables dropdiff do not include variables that differ on every observation nomatch suppress warnings about observations that are not in both master and using data nopreserve do not save original data; programmer's option ------------------------------------------------------------------------- * id() is required.

sopts Description ------------------------------------------------------------------------- variable(newvar) name of variable name variable; default is Question masterval(newvar) name of master value variable; default is Master usingval(newvar) name of using value variable; default is Using all[(newvar)] save all comparisons, not just differences, creating a variable named newvar to mark differences; default is diff keepmaster(varlist) variables to keep from master data keepusing(varlist) variables to keep from using data properties(popts) save variable properties as variables labval save labeled master and using values csv output in comma-separated format instead of as a Stata dataset replace overwrite existing filename -------------------------------------------------------------------------

popts Description ------------------------------------------------------------------------- type[(newvar)] save storage types as newvar; default is type format[(newvar)] save display formats as newvar; default is format vallabel[(newvar)] save value labels as newvar; default is vallabel varlabel[(newvar)] save variable labels as newvar; default is varlabel char(charnamelist) save characteristics charstub(stub) begin characteristic variable names with stub; default is char_ notes(numlist|_all) save notes notesstub(stub) begin notes variable names with stub; default is note -------------------------------------------------------------------------

Description

cfout compares varlist of the dataset in memory (the master dataset) to varlist of filename (the using dataset). It uses unique ID variables to match observations. cfout optionally saves the list of differences to file.

Remarks

If the master and using data contain value labels with the same name, the ones from the master data are used.

Among the applications of cfout is data entry, for which the command may facilitate the reconciliation of two separate entries of the same dataset. cfout can output a list of differences in a format useful for data entry teams. The related SSC program readreplace then inputs the correct values from a similarly formatted file.

The GitHub repository for cfout is here. Previous versions may be found there: see the tags. If outdated syntax is specified, cfout issues a warning message describing how the command will be interpreted.

Remarks for options strcomp() and numcomp()

Options strcomp() and numcomp() specify user-written comparison programs to determine which observation pairs are relevantly different. For instance, while the two strings "Vladimir Levenshtein" and "vladimir levenshtein" are different, they are not when converted to lowercase. The numbers 1 and 1.5 are different, but not when an acceptable difference of 0.75 is specified. strcomp() and numcomp() specify how to determine differences within pairs.

For each variable of varlist, cfout passes to the relevant comparison program two variables that contain the values in the master data and the values in the using, respectively. If the variable is string, the variable pair is passed to the program specified to strcomp(). If it is numeric, the pair is passed to the program specified to numcomp(). From here, the roles of the two comparison programs differ slightly.

strcomp()

The string comparison program specified to strcomp() is expected to change the variable pair so that they are actually different if and only if they are relevantly different. For instance, the program may convert the variables to lowercase, replace some strings with other strings, or make other changes. cfout will then compare the changed variables rather than the original ones.

An example may be helpful. Datasets firstEntry.dta and secondEntry.dta (available through SSC as ancillary files) share a string variable named firstname that stores respondents' first names. The variable is messy, containing punctuation in addition to letters.

use firstEntry cfout firstname using secondEntry, id(uniqueid)

Option nopunct removes much but not all of the punctuation:

cfout firstname using secondEntry, id(uniqueid) nopunct

We notice that firstname also contains errant brackets. We thus write a program to remove them:

program remove_brackets syntax varlist(min=2 max=2 string)

foreach var of local varlist { replace `var' = subinstr(`var', "[", "", .) replace `var' = subinstr(`var', "]", "", .) } end

We then specify the name of the program to strcomp():

cfout firstname using secondEntry, id(uniqueid) nopunct strcomp(remove_brackets)

The comparison command specified to strcomp() may also include options. For instance, we notice that the datasets contain both Lily and Lilly, and we decide not to count this as a difference. We need not write a program tailor-made for Lilys. Suppose we already have a program that replaces one string with another:

program fromto syntax varlist(min=2 max=2 string), from(string) to(string)

foreach var of local varlist { replace `var' = "`to'" if `var' == "`from'" } end

We may now specify this program to strcomp() with options:

cfout firstname using secondEntry, id(uniqueid) nopunct strcomp(fromto, from("Lilly") to("Lily"))

Even here, the command we specify to strcomp() does not include a varlist: cfout will insert one for each master-using variable pair. Further, while the command specified to strcomp() may include options, it may not include other syntactical elements, such as if or using. If command is program_name, options, cfout will always run:

program_name varname_master varname_using, options

numcomp()

The numeric comparison program specified to numcomp() is expected to create a new indicator variable that marks whether the master and using values are relevantly different. For instance, the program could create an indicator variable that is 1 if observation pairs differ by more than 0.75 and 0 if not. cfout will then use this indicator variable to determine relevant differences: only these will be reported and saved to file.

Like strcomp(), numcomp() helps cfout determine which differences are important. However, it differs slightly in its approach: while the string comparison program specified to strcomp() changes the variable pair specified to it, the numeric comparison program should not, instead only creating an indicator variable to mark differences.

Let's see an example. Datasets firstEntry.dta and secondEntry.dta store respondents' ages in variable age:

use firstEntry cfout age using secondEntry, id(uniqueid)

However, suppose we know that age may differ slightly between the two datasets, and this is not a source of concern. We may wish to limit the differences to only those that are more than 5 years.

To do so, we write a program that accepts the variable pair and an option generate(), to which cfout will specify the name of the new difference indicator variable.

program range5 syntax varlist(min=2 max=2 numeric), generate(name) gettoken var1 varlist : varlist gettoken var2 : varlist

generate `generate' = abs(`var1' - `var2') > 5 end

We then specify the name of the program to numcomp():

cfout age using secondEntry, id(uniqueid) numcomp(range5)

Do not specify the program's option generate() to numcomp(): cfout() will specify it when it creates the indicator variable.

Like strcomp(), numcomp() accepts comparison commands that include options. We could rewrite the program above so that the acceptable difference is specified to an option d():

program range syntax varlist(min=2 max=2 numeric), generate(name) d(real) gettoken var1 varlist : varlist gettoken var2 : varlist

generate `generate' = abs(`var1' - `var2') > `d' end

cfout age using secondEntry, id(uniqueid) numcomp(range, d(5))

Again, as with strcomp(), even here the command we specify to numcomp() does not include a varlist. We also continue not to specify option generate(). If command is program_name, options, cfout will always run:

program_name varname_master varname_using, generate(newvar) options

In the indicator variable that the comparison program creates, 0 means that an observation pair is the same, and nonzero values mean that it is different.

General advice

An alternative to strcomp() and numcomp() is to save the list of differences, then load it and drop irrelevant differences. Above, we ran:

use firstEntry cfout firstname using secondEntry, id(uniqueid) nopunct strcomp(remove_brackets)

Yet we could have achieved the same result without specifying strcomp() by executing remove_brackets only after loading the list of differences:

cfout firstname using secondEntry, id(uniqueid) nopunct saving(diffs) use diffs remove_brackets Master Using drop if Master == Using display _N

However, specifying strcomp() and numcomp() has advantages. By dropping observations as soon as possible, it limits the number of differences that ever reach the final list, thereby reducing memory requirements -- sometimes significantly so.

cfout expects the programs specified to strcomp() and numcomp() to behave in certain ways. If they do not, cfout may result in an error or produce incorrect results.

The comparison programs have access to a full dataset, not just the variable pair specified to them. However, they should not make assumptions about the rest of the dataset, and they should not modify it or its metadata, for instance, variable properties like variable labels or characteristics. While the string comparison program is expected to make changes to the variable pair specified to it, it should not make assumptions about or modify their metadata. The same holds for numcomp(), which should also not modify the variable pair, instead only creating an indicator variable.

As a rule, the comparison programs should not use information not passed to them, as this may change across cfout versions. The programs should restrict themselves to the variable names and their values. Order usually does not matter, but cfout will always specify the master value variable first in the pair. The master value variable retains its variable name from the master data, but the name of the using value variable will differ.

The comparison programs are free to sort the data without restoring the original order. In fact, this may reduce the time cost of the programs.

Finally, note that the comparison programs may be run noisily so that error messages are displayed correctly. To reduce this output, add quietly within the programs.

Options

+-------------------+ ----+ String comparison +------------------------------------------------

nopunct specifies changes to string variables before they are compared. It removes the following characters before comparing: ! ' ? It replaces the following strings with a space: ( ) , -- . / : ; It then removes leading or trailing blanks and multiple, consecutive internal blanks.

strcomp(command) specifies a command to execute for all string variable pairs before they are compared. See the remarks above for more information. strcomp() is implemented after the other string comparison options lower, upper, and nopunct.

+---------+ ----+ Options +----------------------------------------------------------

saving(filename [, sopts]) saves the list of differences to filename as a Stata dataset. This "differences dataset" contains an observation for each difference and variables for the unique ID values, the name of the variable that differs, and the values in the master and using data. The master and using values of string variables reflect the changes to the variables that the string comparison options implement. The variables for the master and using values are string if and only if one of the compared variables is string.

all(newvar) specifies that the differences dataset include all comparisons, not just differences. It creates an indicator variable named newvar that is 1 if the master and using values differ and 0 if not. If all is specified without newvar, the indicator variable is named diff. If option numcomp() is specified, the all() indicator variable reflects the indicator variable created by the numeric comparison program. In that case, all() may mark a different master-using value pair as not different, because numcomp() has specified that they are not relevantly different. If option labval is specified, the all() indicator variable marks whether the values actually differ, not whether they do after being formatted: two different values may appear the same after being formatted.

keepmaster(varlist) specifies variables from the master data to include in the differences dataset. They are merged into the differences dataset using the unique ID variables.

keepusing(varlist) specifies variables from the using data to include in the differences dataset. They are merged into the differences dataset using the unique ID variables.

properties(popts) saves the properties of variables in the differences dataset's variable name variable as their own variables. The variable properties of the master data are used.

char(charnamelist) saves the characteristics charnamelist of variables in the differences dataset's variable name variable as their own variables. Characteristic variable names are the combination of the characteristic variable name stub specified to charstub() and the name of the characteristic.

notes(numlist|_all) saves the notes of variables in the differences dataset's variable name variable as their own variables. Notes variable names are the combination of the notes variable name stub specified to notesstub() and the note number. numlist specifies the numbers of the notes to save. If _all is specified, notes are saved from 1 to the maximum note number among the variables specified to cfout.

labval specifies that the master and using values be labeled and formatted according to their value label and display format. By default, the variables that contain the master and using values store numeric values formatted as %24.0g. Value labels and display formats from the master data are used. The variables for the master and using values will be stored as string.

numcomp(command) specifies a command to determine differences within numeric variable pairs. See the remarks above for more information.

dropdiff specifies that variables that differ on every observation not be included. Results for these variables are not reported in the summary, returned in r() stored results, or saved in the differences dataset. If options strcomp() and/or numcomp() are specified, dropdiff follows them in determining which observations are different.

nopreserve is intended for use by programmers. It speeds the comparison by not saving the original data, which normally can be restored should things go wrong or if you press Break. Programmers can specify this option if they have already preserved the original data. nopreserve does not affect the comparison. If both options nopreserve and saving() are specified, the differences dataset is left in memory.

Examples

Compare the variables region-no_good_at_all of the datasets firstEntry.dta and secondEntry.dta, using variable uniqueid to match observations . use firstEntry . cfout region-no_good_at_all using secondEntry, id(uniqueid)

Save the differences to the file diffs.dta . cfout region-no_good_at_all using secondEntry, id(uniqueid) saving(diffs) . use diffs

Save the differences dataset with alternative variable names . use firstEntry . cfout region-no_good_at_all using secondEntry, id(uniqueid) saving(diffs, variable(varname) masterval(master_value) usingval(using_value)) . use diffs

Save all comparisons to the differences dataset, not just differences . use firstEntry . cfout region-no_good_at_all using secondEntry, id(uniqueid) saving(diffs, all) . use diffs . count if diff

Add variable deo from firstEntry.dta to the differences dataset . use firstEntry . cfout region-no_good_at_all using secondEntry, id(uniqueid) saving(diffs, keepmaster(deo)) . use diffs

Save the storage types of the compared variables as an additional variable of the differences dataset . use firstEntry . cfout region-no_good_at_all using secondEntry, id(uniqueid) saving(diffs, properties(type)) . use diffs . generate isstrvar = strmatch(type, "str*")

Save the storage types of the compared variables with an alternative variable name . use firstEntry . cfout region-no_good_at_all using secondEntry, id(uniqueid) saving(diffs, properties(type(storage_type))) . use diffs

For data that has been entered twice, compare the first and second entries, calculating discrepancy rates for each pair of data entry operators. This yields the same results as the SSC program cfby. . use firstEntry . * Variable deo identifies the data entry operator. . rename deo deo1 . cfout region-no_good_at_all using secondEntry, id(uniqueid) saving(diffs, all keepmaster(deo1) keepusing(deo)) . use diffs, clear . rename deo deo2 . generate swap = deo1 > deo2 . generate t = deo1 if swap . replace deo1 = deo2 if swap . replace deo2 = t if swap . drop swap t . bysort deo*: generate total = _N . by deo*: egen total_diff = total(diff) . by deo*: generate error_rate = 100 * total_diff / total . format error_rate %9.2f . sort deo* . egen tag = tag(deo*) . list deo* total_diff total error_rate if tag, abbreviate(32) noobs

For twice entered data and a list of correct values, determine the error rates of individual data entry operators (not pairs as above) . use firstEntry . readreplace using correctedValues.csv, id(uniqueid) variable(question) value(correctvalue) . cfout region-no_good_at_all using firstEntry, id(uniqueid) saving(diff1, all keepusing(deo)) . cfout region-no_good_at_all using secondEntry, id(uniqueid) saving(diff2, all keepusing(deo)) . use diff1, clear . append using diff2 . bysort deo: generate total = _N . by deo: egen total_diff = total(diff) . by deo: generate error_rate = 100 * total_diff / total . format error_rate %9.2f . sort deo . egen tag = tag(deo) . list deo total_diff total error_rate if tag, abbreviate(32) noobs

Stored results

cfout stores the following in r():

Scalars r(N) number of values compared; includes only the variables of r(varlist) r(discrep) number of differences; includes only the variables of r(varlist) r(Nonlym) number of observations only in the master dataset r(Nonlyu) number of observations only in the using dataset

Macros r(varlist) variables compared; does not include r(varonlym), r(difftype), or (if option dropdiff is specified) r(alldiff) r(varonlym) variables only in the master dataset r(difftype) variables that are numeric in one dataset and string in the other r(alldiff) variables that differ on every observation

Acknowledgments

Christopher Robert of the Harvard Kennedy School suggested option nonumeric.

Authors

Ryan Knight Matthew White

For questions or suggestions, submit a GitHub issue or e-mail researchsupport@poverty-action.org.

Also see

Help: [D] cf, [D] compare, [P] dta_equal, [D] datasignature

User-written: readreplace, bcstats, mergeall