cjbarrie/academictwitteR

[FR] Resume hydration

cjbarrie opened this issue · 2 comments

Yeah, sure. But we could easily avoid the previously errored tweet IDs by not doing as you suggest but instead something like:

resume_hydration <- function(ids, data_path, error = FALSE, verbose = TRUE) {
  if(!dir.exists(file.path(data_path))){
    stop("Directory ", data_path, " doesn't exist.")
  }
  existing_df <- bind_tweets(data_path, verbose = verbose)
  lastid <- tail(existing_df, n =1)$id
  pos <- match(lastid,ids)
  uncollected_ids <- ids[pos:length(ids)]
  hydrate_tweets(uncollected_ids, data_path = data_path, error = error, verbose = verbose)
}

Anyway, will move this to separate item TBC. Happy to merge but will wait a bit to see if @justinchuntingho has anything to add

Originally posted by @cjbarrie in #264 (comment)

@cjbarrie The reason why setdiff is used in my suggestion is that the errors option (proposed in #264) will reorder the id column. Although most of the time it is true — especially when there is no error —, one can't always assume the id of the last row is the last id previously collected.

There are two things one could do:

  1. Sort id columns according to the order of the 100 ids in a batch before serializing the batch result as data json files and merging the batch df (with resorted id) into the big data.frame.
  2. Make bind_tweets aware of the errors jsons files, excluded errored ids (maybe depending on an option of whether or not to recollect errored ids) and then setdiff

Ah, I understand better now. I'm assigning this to myself to look at. This will also force me to update my knowledge of package structure