DiskFrame/disk.frame

Reading chunks won't save in external hard drive when given as outdir

lime-n opened this issue · 5 comments

I am using disk.frame to save chunks of a .csv file into my external hard drive where I have more memory available to store it.

However, when I propose the outdir, it seems to save the data in the temp folder.

Here is my code:

library(disk.frame)
library(tidyverse)
library(data.table)

setup_disk.frame(workers = 8)
options(future.global.maxSize = Inf)

zi.fl <- csv_to_disk.frame("species_all.csv", outdir = "/Volumes/Seagate/Work/Tickets/Third ticket/Extinction/species_all.df", in_chunk_size = 1e7)  %>%
  rbindlist.disk.frame() %>% filter(year > 2018)

Stage 1 of 2: splitting the file species_all.csv into smallers files:
Destination: /var/folders/mb/msr31f2d3vg6hsbtsjz06nn40000gn/T//RtmpUVcEQr/file3c3346c93c8

If you read in chunks, disk.frame needs a temporary folder to save each chunk before combining the chunks.

Also, I don't think you need the rbindlist.disk.frame as it's combined for you automatically. This should suffice

zi.fl <- csv_to_disk.frame("species_all.csv", outdir = "/Volumes/Seagate/Work/Tickets/Third ticket/Extinction/species_all.df", in_chunk_size = 1e7)  %>%
 filter(year > 2018)

I have a question about those chunks. If my file is 700GB, would those chunks that are split - say 10m rows each.- would they be split until the max 700GB is reached? Though, I presume it will be slightly less as I am only selecting for 6//50 columns.

Is there a way to which row to start with? for example, if the above is true and I reach 300gb (as my PC cannot hold more than that), can I select from the 300gb so say this is for example 300m rows and continue from here?

and how can I make use of removing NAs and skipping empty lines?

Thank you for your time!

Hmm, that's a hard question.

csv_to_disk.frame by default just uses data.table's fread behind the scenes. So any option (excluding IIRC headers) that works in fread should work.

Perhap you can checkout the documentation for fread and see if any options help?

This is quite interesting. If you want please DM and I can set up a call with you? i will get value out of this as well as I can see how a real world usage case from another user is like.

The other way is to use bigreadr to split the files and then read them.

I see you are using filter(year > 2018), so perhaps you can take advantage of that and reduce the amount of data that you need to process.

Hmm, that's a hard question.

csv_to_disk.frame by default just uses data.table's fread behind the scenes. So any option (excluding IIRC headers) that works in fread should work.

Perhap you can checkout the documentation for fread and see if any options help?

This is quite interesting. If you want please DM and I can set up a call with you? i will get value out of this as well as I can see how a real world usage case from another user is like.

I have dropped you a message to your @gmail.

Perhaps a call will be better as I am discovering a few more things but require your experience to know 'what may be happening'.

Though I am learning more about the functionality of this package, and I feel closer towards my solution.