[FR] Resume hydration
cjbarrie opened this issue · 2 comments
Yeah, sure. But we could easily avoid the previously errored tweet IDs by not doing as you suggest but instead something like:
resume_hydration <- function(ids, data_path, error = FALSE, verbose = TRUE) {
if(!dir.exists(file.path(data_path))){
stop("Directory ", data_path, " doesn't exist.")
}
existing_df <- bind_tweets(data_path, verbose = verbose)
lastid <- tail(existing_df, n =1)$id
pos <- match(lastid,ids)
uncollected_ids <- ids[pos:length(ids)]
hydrate_tweets(uncollected_ids, data_path = data_path, error = error, verbose = verbose)
}
Anyway, will move this to separate item TBC. Happy to merge but will wait a bit to see if @justinchuntingho has anything to add
Originally posted by @cjbarrie in #264 (comment)
@cjbarrie The reason why setdiff
is used in my suggestion is that the errors
option (proposed in #264) will reorder the id
column. Although most of the time it is true — especially when there is no error —, one can't always assume the id
of the last row is the last id
previously collected.
There are two things one could do:
- Sort
id
columns according to the order of the 100ids
in a batch before serializing the batch result asdata
json files and merging the batch df (with resortedid
) into the big data.frame. - Make
bind_tweets
aware of theerrors
jsons files, excluded errored ids (maybe depending on an option of whether or not to recollect errored ids) and thensetdiff
Ah, I understand better now. I'm assigning this to myself to look at. This will also force me to update my knowledge of package structure