Column Subset Inconsistency in Documentation and Code for load_tweets_json and load_twitter_users_json Functions
Closed this issue · 8 comments
Dear Nicola and Paul,
congrats for the update! I appreciate the well-conceived coordination mechanism and the polished architecture of the package.
I am currently preparing to test the system using a large dataset consisting of over 12M tweets. In order to fit this initial dataset into RAM, I am restricting my focus to only the necessary columns. Ideally, the load_tweets_json and load_twitter_users_json functions in the CooRTweet package would only load these necessary columns.
I attempted to modify these functions to achieve this goal, however, I had difficulty correctly applying the query parameter of RcppSimdJson. Despite this, I managed to manually subset the columns.
While doing so, I noticed a minor discrepancy between the documentation and the actual code found at https://github.com/nicolarighetti/CooRTweet/blob/master/R/preprocess_twitter.R. According to the documentation, the required columns are created_at, tweet_id, author_id, conversation_id, possibly_sensitive, lang, text, and in_reply_to_user_id. Yet, the code specifies a different set of required columns: entities, public_metrics, tweet_id, and created_at.
As a result of this inconsistency, omitting the necessary columns as per the code leads to an unhandled error. It's a super minor issue but maybe worth to be addressed.
Adding this issue here as it is related:
I'm encountering an error when calling the load_twitter_users_json function. The error message reads:
Error in .load_json(json = input, query = query, empty_array = empty_array):
NO_SUCH_FIELD: The JSON field referenced does not exist in this object.
This error appears to be caused by the query parameter within the RcppSimdJson function. Interestingly, this is the same error I encountered when I previously attempted to modify the load_twitter_users_json function to load only the necessary keys.
Hi Fabio,
thanks for raising this issue. I will look into the query parameter and check how we can reduce the amount of data loaded into memory.
The error message you posted appears to me that your JSON file has a different structure from those that we used for testing. Would you mind sending me a sample file where the error appears (paul.balluff@univie.ac.at)?
The inconsistency for the columns will be addressed soon. We will change that to a function parameter and then the package users can adjust the columns accordingly.
Thanks for sending me some sample data and detailed reporting. That makes debugging much easier.
I investigated each issue and could fix them
Loading Twitter data error
Error in .load_json(json = input, query = query, empty_array = empty_array):
NO_SUCH_FIELD: The JSON field referenced does not exist in this object.
The cause of this error was an empty JSON file (users_.json
) and RcppSimdJson raises an error if the query fails. The parameter query_error_ok
of RcppSimdJson::fload()
is FALSE
by default (which raises an erro). I changed that to ignore errors, in case there are malformed input files.
Caveat is that the package users do not know which files were skipped due to an error.
Discrepancy in preprocess_tweets
I fixed the documentation and also added tweets_cols
as an optional parameter where the package users can decide which columns to keep.
Keeping only neccessary columns
I played around with the query parameter of RcppSimdJson::fload()
, but it does not support wildcards, which makes it not suitable for loading all elements in a JSON list.
Instead, I am thinking of implementing a batched loading process, where the input JSON files are split into batches of 50 (?) files and the load_tweets_json
function drops unwanted columns before loading the next batch. This could also solve above mentioned caveat, because then the function could report back to the user which files failed parsing.
Note that loading the data that way would slow down the process, but probably conserve memory.
Thanks! I like the approach of loading JSON files in chunks and dropping unnecessary columns on the fly. Feel free to close the issue when merged.
Could you try the development version by installing it like this:
remotes::install_github("nicolarighetti/CooRTweet@development")
I am not sure how much more efficient the new function is...
EDIT: the new function name is load_many_tweets_json
I've just verified it; the function works as expected. By efficiency, are you referring to memory usage or processing speed?
Thank you!
I mean in terms of memory usage. I noticed that the resulting data.table is about 20% smaller in memory size
I understand your perspective. It appears that R's inefficient garbage collection might not be releasing all the RAM when columns are removed. The optimal solution would probably involve not loading the JSON keys from the files at all. Perhaps you could suggest to the maintainers of RcppSimdJson to explore the possibility of adding wildcard support to the query parameter in a future release.