This repository is dedicated to the prediction of the popularity of micro-videos on the TikTok platform, using the audio features extracted from non-original audio using Spotify's API. The task is modelled first as a regression task, then a multiclass and then a binary classification task.
The project is structured into the following top-level directories, partially inspired by Cookie Cutter Data Science:
src
: Code for creating/updating the dataset and evaluation code for the modelsbin
: The command-line interface for the code insrc
, explained in the "Usage" section below.data
: The collected data. Theraw
folder contains the track and view data from TikTok, and the track data with the features fetched from the Spotify API. Theprocessed
folder contains the final dataset after removing missing or invalid values.notebooks
: This folder contains the exploration and modelling of the dataset using different approaches.
The only requirement for this package to work is Python >= 3.9. Using venv is suggested.
Run pip install -r requirements.txt
to install dependencies, then create a .env
file as a copy of the .env-example
file, and fill
out the values. To properly scrape TikTok data, 3 different values are needed, which you can get by opening
your browser to the front page of TikTok (no need to be logged in), and examining the outgoing requests:
TT_COOKIE
: The cookie value sent with all requests. You can find this in any request header.TT_DEVICE_ID
: The device id assigned to the requesting browser. It is available on the query parameters any request.TT_TOKEN
: It is not entirely clear if this value is needed, but it is contained in another cookie stored in the browser nameds_v_web_id
. Copy the entire cookie value, it should be in the formverify_*
.
Note that because this code is using the internal, undocumented TikTok API, the parameters and end endpoints are subject to change, and so there is no guarantee that it will continue to work.
For the SPOTIFY_* values
, an app must be created from
Spotify's dashboard for developers. The client ID and secret of this app
is used here to access the API.
Optionally, a dataset containing tags for the tracks can be downloaded from LastFM, by running download_tags
in the
same manner, and using the LASTFM_API_KEY
environment variable to access the API. There are usage limits to be aware
of, but exact values are not specified in the documentation.
The data collection is a two-step process. First, the view data along with the tracks and album names are downloaded from the TikTok API, and then the audio features of each track are retrieved from the Spotify API.
To run the view data collection, from the root directory run python -m bin.download_views
. You can use the -h
flag to display available options. The command will run until no new tracks are encountered.
To run the feature data collection, again from the root directory run python -m bin.download_features
.
There is also a bin.merge_views
command that can be used to merge view data files retrieved by the
download_views
command. This can be useful, for example, in a distributed setting where multiple scraping agents
are launched, in order to merge results from all of them into a single file.