Performance improvements
hadley opened this issue · 4 comments
Currently read_csvy
reads the complete file using readLines()
- this means it will be slow for large files. I'd recommend (and can possibly help with) writing a C/C++ read_yaml_header()
function that would parse from the first ---
to the next ---
. This metadata could then be used to generate the column specification that's passed to read.csv()
, read_csv()
, and fread()
. (Will probably still need some additional cleanup afterwards).
That would be awesome.
Not in C, but a first pass at this might look something like this. It uses the fact that if con <- file("/path/to/file", "r")
then readLines(con, n = 1)
reads a file one line at a time, automatically advancing to the next line.
get_yaml_header <- function(filename, yaml_rxp = "^#?---[[:space:]]*$") {
con <- file(filename, "r")
on.exit(close(con))
first_line <- readLines(con, n = 1)
if (!grepl(yaml_rxp, first_line)) {
warning("No YAML file found.")
return(NULL)
}
iline <- 2
closing_tag <- FALSE
tag_vec <- character()
while (!closing_tag) {
curr_line <- readLines(con, n = 1)
tag_vec[iline - 1] <- curr_line
closing_tag <- grepl(yaml_rxp, curr_line)
iline <- iline + 1
}
tag_vec[seq_len(iline - 2)]
}
parse_yaml_header <- function(yaml_header) {
if (all(grepl("^#", yaml_header))) {
yaml_header <- gsub("^#", "", yaml_header)
}
yaml::yaml.load(paste(yaml_header, collapse = "\n"))
}
raw_header <- get_yaml_header("iris.csvy")
metadata <- parse_yaml_header(raw_header)
You should then be able to do something like csv_file <- fread(filename, skip = length(tag_vec) + 2, ...)
.
If this looks OK, I can try to put together a more complete pull request later this week.
That would be awesome!