ajdamico/lodown

Parsing errors in CPS basic for version January 1998 to December 2002

rubenhm opened this issue · 5 comments

I'm getting all kinds of parsing failure notices and I think it's because of the dictionary file for that period.
The dictionary file should be http://thedataweb.rm.census.gov/pub/cps/basic/199801-/jan98dd.asc
The get_catalog_basic function only seems to look for *.txt files and misses the dictionary for that period, and instead is picking up the file https://thedataweb.rm.census.gov/pub/cps/basic/199801-/2000-2extract.txt, which does not seem to be the correct dictionary for that version.
The cps_dd_parser function may need special hardcoded fixes for that period.

Any suggestions on how to fix this?

thanks for reporting this! could you test out the latest version? not sure if we'd also need to edit the lodown:::cps_dd_parser function

Hi Anthony
Thanks for looking into this. The get_catalog_cpsbasic now builds the catalog df with the correct reference to the dictionary.
However, it seems that the parsing hangs, possibly caught in one of the while conditions.
The format of the dictionary is quite different from that of other vintages.

For example, here is the formula for finding PEEDUCA, the variable about educational attainment in the 1998-2002 dictionary.

D PEEDUCA     2    137
T Demographics-highest level of school
  completed
     What is the highest level of school you 
     have completed or the highest degree you 
     have received? Edited Universe: 
     PEAGE>=15 Valid Entries 
V         31 .Less Than 1st Grade
V         32 .1st,2nd,3rd Or 4th Grade
V         33 .5th Or 6th Grade
V         34 .7th Or 8th Grade
V         35 .9th Grade
V         36 .10th Grade
V         37 .11th Grade
V         38 .12th Grade No Diploma
V         39 .High School Grad-Diploma Or
V            .Equiv (GED)
V         40 .Some College But No Degree
V         41 .Associate
V            .Degree-Occupational/Vocationl
V         42 .Associate Deg.-Academic Program
V         43 .Bachelor's Degree(ex:ba,ab,bs)
V         44 .Master's
V            .Degree(ex:MA,MS,MEng,MEd,MSW)
V         45 .Professional School
V            .Deg(ex:MD,DDS,DVM)
V         46 .Doctorate Degree(ex:PhD,EdD)

Notice that variable rows start with a descriptor D before the variable name and only the length and beginning position are indicated. The ending position of the variable is not included.
T is the description, and V indicates valid entries. The same variable in the 2017-2018 dictionary is defined as follows:

PEEDUCA		2		HIGHEST LEVEL OF SCHOOL 						137 - 138   
					COMPLETED OR DEGREE RECEIVED    

					EDITED UNIVERSE:	PRPERTYP = 2 0R 3				

					VALID ENTRIES

					31	LESS THAN 1ST GRADE
					32	1ST, 2ND, 3RD OR 4TH GRADE
					33	5TH OR 6TH GRADE
					34	7TH OR 8TH GRADE
					35	9TH GRADE
					36	10TH GRADE
					37	11TH GRADE
					38	12TH GRADE NO DIPLOMA
					39	HIGH SCHOOL GRAD-DIPLOMA OR EQUIV (GED)
					40	SOME COLLEGE BUT NO DEGREE
					41	ASSOCIATE DEGREE-OCCUPATIONAL/VOCATIONAL
					42	ASSOCIATE DEGREE-ACADEMIC PROGRAM
					43	BACHELOR'S DEGREE (EX: BA, AB, BS)
					44	MASTER'S DEGREE (EX: MA, MS, MEng, MEd, MSW)
					45	PROFESSIONAL SCHOOL DEG (EX: MD, DDS, DVM)
					46	DOCTORATE DEGREE (EX: PhD, EdD)

Thanks,
Ruben

Anthony,
I tried modifying the cps_dd_parser to address only the 1998-2002 case.
I think it produces the correct the_result df with the variable names, width, positions, and decimal divisors.
Would you take a look?
Thanks,

# Parse CPS basic monthly files during 1998-2002
# dd_url <- "http://thedataweb.rm.census.gov/pub/cps/basic/199801-/jan98dd.asc"
cps_dd_parser_1998 <-
  function( dd_url ){
    # Parser for the vintage 1998-2002
    # read in the data dictionary
    dd_con <- file( dd_url , encoding = 'windows-1252' )
    the_lines <- readLines ( dd_con )
    close( dd_con )
    # Remove parentheses
    the_lines <- gsub( "\\(|\\)" , "" , the_lines )
    # Fix linebreak around Implied Decimal
    # find lines with "implied" after break
    idx_implied <- grep( "^[ ]+(implied decimal.*)$", the_lines, ignore.case = TRUE)
    # Copy string to previous line
    the_lines[idx_implied-1] <- paste(the_lines[idx_implied-1],gsub("^[ ]+","",the_lines[idx_implied]))
    the_lines[idx_implied] <- ""
    # pull the lines into a temporary variable
    the_dd <- stringr::str_trim( the_lines )
    the_dd <- iconv( the_dd , "" , "ASCII//TRANSLIT" )
    the_dd <- gsub( 'a?\"' , '-' , the_dd , fixed = TRUE )
    # remove any goofy tab characters
    the_dd <- gsub( "\t" , " " , the_dd )
    # look for lines indicating divisor, a single integer prior to the string "implied"
    idp <- grep( "([0-9]+)([ ]+)implied" , the_dd , ignore.case = TRUE )
    decimal_lines <- sub( ".* ([0-9]{1})[ ]+implied.*" , "\\1" , the_dd[idp], fixed=FALSE, ignore.case = TRUE )
    # keep only the variable lines
    rows_to_keep <- grep( "^D ([A-Z])(.*)([0-9])$" , the_dd )
    # This was not stopping. Now produces positions of variables with implied decimals
    for( this_line in seq_along( idp ) ) {
      print(paste("this line=",(this_line)))
      while( !( idp[ this_line ] %in% rows_to_keep ) & this_line <= length(idp)) {
        idp[ this_line ] <- idp[ this_line ] - 1
        print(idp)
      }
    }
    # Reduce the_dd and express idp in terms of rows_to_keep
    the_dd <- the_dd[ rows_to_keep ]
    idp <- match( idp , rows_to_keep )
    # Remove spaces?
    the_dd <- gsub( "( +)-( +)" , "-" , the_dd )
    the_dd <- gsub( "-( +)" , "-" , the_dd )
    the_dd <- gsub( "( +)-" , "-" , the_dd )
    the_dd <- gsub( "( +)" , " " , the_dd ) # Only this case made any changes
    # keep only the first three items in the line
    the_dd <- gsub( "^D ([A-z0-9]+)[ ]+([0-9]+)[ ]+([0-9]+)" , "\\1 \\2 \\3" , the_dd )
    # Not sure if the following applies, but the PROLDRRP variable exists in this vintage
    # the_dd <- the_dd[ !grepl( "^D PROLDRRP" , the_dd ) ]
    # break the lines apart by spacing
    the_dd <- strsplit( the_dd , " " )
    # store the variable name, width, and position into a data.frame
    the_result <-
      data.frame(
        varname = sapply( the_dd , '[[' , 1 ) ,
        width = as.numeric( sapply( the_dd , '[[' , 2 ) ) ,
        start_position = as.numeric( sapply( the_dd , '[[' , 3 ) ) ,
        end_position = as.numeric( sapply( the_dd , '[[' , 3 ) ) + as.numeric( sapply( the_dd , '[[' , 2 ) )-1 ,
        divisor = 1 ,
        stringsAsFactors = FALSE
      )
    the_result[ idp , 'divisor' ] <- 10^-as.numeric( decimal_lines )
    # fillers should be missings not variables
    the_result[ the_result$varname == 'FILLER' , 'width' ] <- -( the_result[ the_result$varname == 'FILLER' , 'width' ] )
    the_result[ the_result$varname == 'FILLER' , 'varname' ] <- NA
    # treat cps fields as exclusively numeric
    the_result$char <- FALSE
    the_result
  }

this is helpful, thanks! i'm not sure we need a separate script for just this period. could you see if my edits this morning get us closer to a working data dictionary parser? thanks for your time with this

Anthony
I downloaded 1998-01 and I think it's working fine!
Thanks a lot for looking into this. I'm downloading the rest of the 1998-2002 vintages now.
There are some warnings about HRSAMPLE (the sample ID) when reading the file, but I think I saw those in other years and they dont' seem to be related to the 1998 dictionary and they don't seem to affect the resulting dataframe.
Thanks!
Ruben