Parsing errors in CPS basic for version January 1998 to December 2002
rubenhm opened this issue · 5 comments
I'm getting all kinds of parsing failure notices and I think it's because of the dictionary file for that period.
The dictionary file should be http://thedataweb.rm.census.gov/pub/cps/basic/199801-/jan98dd.asc
The get_catalog_basic function only seems to look for *.txt files and misses the dictionary for that period, and instead is picking up the file https://thedataweb.rm.census.gov/pub/cps/basic/199801-/2000-2extract.txt, which does not seem to be the correct dictionary for that version.
The cps_dd_parser function may need special hardcoded fixes for that period.
Any suggestions on how to fix this?
thanks for reporting this! could you test out the latest version? not sure if we'd also need to edit the lodown:::cps_dd_parser function
Hi Anthony
Thanks for looking into this. The get_catalog_cpsbasic now builds the catalog df with the correct reference to the dictionary.
However, it seems that the parsing hangs, possibly caught in one of the while conditions.
The format of the dictionary is quite different from that of other vintages.
For example, here is the formula for finding PEEDUCA, the variable about educational attainment in the 1998-2002 dictionary.
D PEEDUCA 2 137
T Demographics-highest level of school
completed
What is the highest level of school you
have completed or the highest degree you
have received? Edited Universe:
PEAGE>=15 Valid Entries
V 31 .Less Than 1st Grade
V 32 .1st,2nd,3rd Or 4th Grade
V 33 .5th Or 6th Grade
V 34 .7th Or 8th Grade
V 35 .9th Grade
V 36 .10th Grade
V 37 .11th Grade
V 38 .12th Grade No Diploma
V 39 .High School Grad-Diploma Or
V .Equiv (GED)
V 40 .Some College But No Degree
V 41 .Associate
V .Degree-Occupational/Vocationl
V 42 .Associate Deg.-Academic Program
V 43 .Bachelor's Degree(ex:ba,ab,bs)
V 44 .Master's
V .Degree(ex:MA,MS,MEng,MEd,MSW)
V 45 .Professional School
V .Deg(ex:MD,DDS,DVM)
V 46 .Doctorate Degree(ex:PhD,EdD)
Notice that variable rows start with a descriptor D before the variable name and only the length and beginning position are indicated. The ending position of the variable is not included.
T is the description, and V indicates valid entries. The same variable in the 2017-2018 dictionary is defined as follows:
PEEDUCA 2 HIGHEST LEVEL OF SCHOOL 137 - 138
COMPLETED OR DEGREE RECEIVED
EDITED UNIVERSE: PRPERTYP = 2 0R 3
VALID ENTRIES
31 LESS THAN 1ST GRADE
32 1ST, 2ND, 3RD OR 4TH GRADE
33 5TH OR 6TH GRADE
34 7TH OR 8TH GRADE
35 9TH GRADE
36 10TH GRADE
37 11TH GRADE
38 12TH GRADE NO DIPLOMA
39 HIGH SCHOOL GRAD-DIPLOMA OR EQUIV (GED)
40 SOME COLLEGE BUT NO DEGREE
41 ASSOCIATE DEGREE-OCCUPATIONAL/VOCATIONAL
42 ASSOCIATE DEGREE-ACADEMIC PROGRAM
43 BACHELOR'S DEGREE (EX: BA, AB, BS)
44 MASTER'S DEGREE (EX: MA, MS, MEng, MEd, MSW)
45 PROFESSIONAL SCHOOL DEG (EX: MD, DDS, DVM)
46 DOCTORATE DEGREE (EX: PhD, EdD)
Thanks,
Ruben
Anthony,
I tried modifying the cps_dd_parser to address only the 1998-2002 case.
I think it produces the correct the_result
df with the variable names, width, positions, and decimal divisors.
Would you take a look?
Thanks,
# Parse CPS basic monthly files during 1998-2002
# dd_url <- "http://thedataweb.rm.census.gov/pub/cps/basic/199801-/jan98dd.asc"
cps_dd_parser_1998 <-
function( dd_url ){
# Parser for the vintage 1998-2002
# read in the data dictionary
dd_con <- file( dd_url , encoding = 'windows-1252' )
the_lines <- readLines ( dd_con )
close( dd_con )
# Remove parentheses
the_lines <- gsub( "\\(|\\)" , "" , the_lines )
# Fix linebreak around Implied Decimal
# find lines with "implied" after break
idx_implied <- grep( "^[ ]+(implied decimal.*)$", the_lines, ignore.case = TRUE)
# Copy string to previous line
the_lines[idx_implied-1] <- paste(the_lines[idx_implied-1],gsub("^[ ]+","",the_lines[idx_implied]))
the_lines[idx_implied] <- ""
# pull the lines into a temporary variable
the_dd <- stringr::str_trim( the_lines )
the_dd <- iconv( the_dd , "" , "ASCII//TRANSLIT" )
the_dd <- gsub( 'a?\"' , '-' , the_dd , fixed = TRUE )
# remove any goofy tab characters
the_dd <- gsub( "\t" , " " , the_dd )
# look for lines indicating divisor, a single integer prior to the string "implied"
idp <- grep( "([0-9]+)([ ]+)implied" , the_dd , ignore.case = TRUE )
decimal_lines <- sub( ".* ([0-9]{1})[ ]+implied.*" , "\\1" , the_dd[idp], fixed=FALSE, ignore.case = TRUE )
# keep only the variable lines
rows_to_keep <- grep( "^D ([A-Z])(.*)([0-9])$" , the_dd )
# This was not stopping. Now produces positions of variables with implied decimals
for( this_line in seq_along( idp ) ) {
print(paste("this line=",(this_line)))
while( !( idp[ this_line ] %in% rows_to_keep ) & this_line <= length(idp)) {
idp[ this_line ] <- idp[ this_line ] - 1
print(idp)
}
}
# Reduce the_dd and express idp in terms of rows_to_keep
the_dd <- the_dd[ rows_to_keep ]
idp <- match( idp , rows_to_keep )
# Remove spaces?
the_dd <- gsub( "( +)-( +)" , "-" , the_dd )
the_dd <- gsub( "-( +)" , "-" , the_dd )
the_dd <- gsub( "( +)-" , "-" , the_dd )
the_dd <- gsub( "( +)" , " " , the_dd ) # Only this case made any changes
# keep only the first three items in the line
the_dd <- gsub( "^D ([A-z0-9]+)[ ]+([0-9]+)[ ]+([0-9]+)" , "\\1 \\2 \\3" , the_dd )
# Not sure if the following applies, but the PROLDRRP variable exists in this vintage
# the_dd <- the_dd[ !grepl( "^D PROLDRRP" , the_dd ) ]
# break the lines apart by spacing
the_dd <- strsplit( the_dd , " " )
# store the variable name, width, and position into a data.frame
the_result <-
data.frame(
varname = sapply( the_dd , '[[' , 1 ) ,
width = as.numeric( sapply( the_dd , '[[' , 2 ) ) ,
start_position = as.numeric( sapply( the_dd , '[[' , 3 ) ) ,
end_position = as.numeric( sapply( the_dd , '[[' , 3 ) ) + as.numeric( sapply( the_dd , '[[' , 2 ) )-1 ,
divisor = 1 ,
stringsAsFactors = FALSE
)
the_result[ idp , 'divisor' ] <- 10^-as.numeric( decimal_lines )
# fillers should be missings not variables
the_result[ the_result$varname == 'FILLER' , 'width' ] <- -( the_result[ the_result$varname == 'FILLER' , 'width' ] )
the_result[ the_result$varname == 'FILLER' , 'varname' ] <- NA
# treat cps fields as exclusively numeric
the_result$char <- FALSE
the_result
}
this is helpful, thanks! i'm not sure we need a separate script for just this period. could you see if my edits this morning get us closer to a working data dictionary parser? thanks for your time with this
Anthony
I downloaded 1998-01 and I think it's working fine!
Thanks a lot for looking into this. I'm downloading the rest of the 1998-2002 vintages now.
There are some warnings about HRSAMPLE (the sample ID) when reading the file, but I think I saw those in other years and they dont' seem to be related to the 1998 dictionary and they don't seem to affect the resulting dataframe.
Thanks!
Ruben