kumc-bmi/naaccr-tumor-data

xsd:date YYYY-MM-DD vs NAACCR YYYYMMDD

Closed this issue · 3 comments

dckc commented

In today's gpc-dev call, @astoddard suggested that since NAACCR data is going to all be in XML by 2020 anyway, why not use XML Schema to define the data format?

It's a pretty good idea... in fact, XML Schema can express unions between integers and weird sentinel values such as XXX.5, XXX.8, XXX.9.

But it wouldn't let us treat 20010101 as a date.

https://stackoverflow.com/questions/1899581/modify-xsddatetime-simple-type-to-use-different-date-and-time-separator

dckc commented

In imsweb/layout#72 @depryf noted a relevant technique:

Another big different between this library and NAACCR XML is that for convenience all date fields (which are YYYYMMDD) are considered "group fields" and define the three parts of the date as individual fields. This is a concept that NAACCR doesn't support, but it's REALLY useful for calling software since most algorithms are based on year, rarely month/day.

dckc commented

I don't plan to argue for changes to the XML format.

I'm content with the parsing design in f1a2ad5

# ## NAACCR Dates
#
# - **ISSUE**: hide date flags in i2b2? They just say why a date is missing, which doesn't seem worth the screenspace.
# %%
def naaccr_dates(df: DataFrame, date_cols: List[str],
keep: bool = False) -> DataFrame:
orig_cols = df.columns
for dtcol in date_cols:
strcol = dtcol + '_'
df = df.withColumnRenamed(dtcol, strcol)
dt = func.substring(func.concat(func.trim(df[strcol]), func.lit('0101')), 1, 8)
# df = df.withColumn(dtcol + '_str', dt)
dt = func.to_date(dt, 'yyyyMMdd')
df = df.withColumn(dtcol, dt)
if not keep:
df = df.select(cast(Union[sq.Column, str], orig_cols))
return df
IO_TESTING and naaccr_dates(
_extract.select(['dateOfDiagnosis', 'dateOfLastContact']),
['dateOfDiagnosis', 'dateOfLastContact'],
keep=True).limit(10).toPandas()