/xtdb-kaggle

A small XTDB utility to download CSV datasets from Kaggle and turn them into XTDB transaction operations.

Primary LanguageClojureMIT LicenseMIT

XTDB Kaggle

A small XTDB utility to download CSV datasets from Kaggle and turn them into XTDB transaction operations.

At the moment, it’s only got a transformer for one dataset: https://www.kaggle.com/tmdb/tmdb-movie-metadata. If you do implement transformers for others, please do submit them as PRs!

Setup

xtdb-kaggle is a REPL based tool at the moment. To get set up:

  • Clone the repo

  • Get yourself a Kaggle API key file - create an account, head to your account settings and download an API key JSON file

  • Set a KAGGLE_KEY_FILE environment variable pointing to the key file

  • Start a REPL and connect to it in your usual way.

Then, find yourself an interesting dataset on Kaggle

You need to tell xtdb-kaggle which files you’d like to download, and then how to turn each file into XTDB operations - this is done using multimethods.

Using that movie dataset as an example - we have an :owner-slug of "tmdb", a :dataset-slug of "tmdb-movie-metadata", and two files: "tmdb_5000_movies.csv" and "tmdb_5000_credits.csv".

We define dataset-file-names to specify the files, and one instance of csv-row→ops-fn for each file:

(defmethod dataset-file-names ["tmdb" "tmdb-movie-metadata"] [_]
  #{"tmdb_5000_movies.csv" "tmdb_5000_credits.csv"})

(defmethod csv-row->ops-fn ["tmdb" "tmdb-movie-metadata" "tmdb_5000_movies.csv"] [_]
  (fn [{:strs [id title runtime budget revenue keywords genres] :as row}]
    [[::xt/put {:xt/id (keyword (name 'tmdb.movie) id)
                :tmdb/type :movie
                :tmdb.movie/id (Long/parseLong id)
                :tmdb.movie/title title
                :tmdb.movie/budget (some-> budget Long/parseLong)
                :tmdb.movie/revenue (some-> revenue Long/parseLong)
                :tmdb.movie/keywords (->> (json/read-value keywords)
                                          (into #{} (map #(get % "name"))))
                :tmdb.movie/genres (->> (json/read-value genres)
                                        (into #{} (map #(get % "name"))))}]]))

(defmethod csv-row->ops-fn ["tmdb" "tmdb-movie-metadata" "tmdb_5000_credits.csv"] [_]
  (fn [{:strs [movie_id cast] :as row}]
    (let [movie-id (Long/parseLong movie_id)]
      (->> (for [{cast-name "name", :strs [credit_id id character]} (json/read-value cast)]
             [[::xt/put {:xt/id (keyword (name 'tmdb.cast) (str id))
                         :tmdb/type :cast
                         :tmdb.cast/id id
                         :tmdb.cast/name cast-name}]
              [::xt/put {:xt/id (keyword (name 'tmdb.credit) credit_id)
                         :tmdb/type :credit
                         :tmdb.movie/id movie-id
                         :tmdb.cast/id id
                         :tmdb.cast/character character}]])
           (apply concat)))))

Then, we can stream the dataset to a local file of XTDB transaction ops using:

(->> (dataset->ops {:owner-slug "tmdb", :dataset-slug "tmdb-movie-metadata"})
     (ops->stream (io/output-stream (io/file "/tmp/movies.edn"))))

Have fun!