pre_processing.py: Notebook to process the raw dataset downloaded from Kaggle. Running this script creates a directory named clean_dataset, which contains the cleaned .csv files. The script takes two arguments: frac and threshold. frac represents what fraction of the dataset to take from the raw data, and threshold will remove tags that appear less than threshold times.
create_rdf.py: Reads the cleaned data and uses Tarql to convert the .csv files into RDF triples. The CONSTRUCT queries in SPARQL are placed in the sparql folder. Running this script will create .ttl files. The script will place the .ttl files into the rdf_dataset directory. The .ttl files can now be loaded into a SPARQL endpoint to run queries.
SELECT ?category ?tag ?cnt WHERE {
{SELECT DISTINCT ?category (MAX(?cnt) as ?MaxCount) WHERE {
{SELECT DISTINCT ?category ?tag (COUNT(?tag) as ?cnt) WHERE {
?category a ex:Category .
?video ex:category ?category .
?video ex:hasTag ?tag
}
GROUP BY ?category ?tag
ORDER BYDESC(?cnt)}
} GROUP BY ?category
ORDER BYDESC(?MaxCount)}
{SELECT DISTINCT ?category ?tag (COUNT(?tag) as ?cnt) WHERE {
?category a ex:Category .
?video ex:category ?category .
?video ex:hasTag ?tag
}
GROUP BY ?category ?tag
ORDER BYDESC(?cnt)}
FILTER(?cnt = ?MaxCount)
}
4. Find the channels with at least 15 videos or categories that have fastest time for a video to become trending (time_to_trending = trending_date - publish_date).
## Query for categoriesSELECT ?category ?time_to_trending_average_hours WHERE{
{SELECT ?category (AVG(?time_in_seconds) as ?average_time_to_trending_seconds) WHERE {
?video a ex:Video .
?video ex:category ?category .
?video ex:title ?title .
?video ex:publish_timestamp ?publish_timestamp .
?video ex:trending_timestamp ?trending_timestamp
BIND(xsd:dateTime(?trending_timestamp) - xsd:dateTime(?publish_timestamp) AS ?time2trending)
BIND(day(?time2trending) AS ?days)
BIND(hours(?time2trending) AS ?hours)
BIND(minutes(?time2trending) AS ?minutes)
BIND(seconds(?time2trending) AS ?seconds)
BIND( (?days*86400+ ?hours*3600+ ?minutes*60+ ?seconds) AS ?time_in_seconds)
}GROUP BY ?category}
BIND(ceil(?average_time_to_trending_seconds/3600) AS ?time_to_trending_average_hours)
}ORDER BYASC(?time_to_trending_average_hours)
## Query for channels with at least 15 videosSELECT ?channel_title ?time_to_trending_average_hours WHERE{
{SELECT ?channel_title (AVG(?time_in_seconds) as ?average_time_to_trending_seconds) WHERE {
?video a ex:Video .
?video ex:channel_title ?channel_title .
?video ex:publish_timestamp ?publish_timestamp .
?video ex:trending_timestamp ?trending_timestamp
BIND(xsd:dateTime(?trending_timestamp) - xsd:dateTime(?publish_timestamp) AS ?time2trending)
BIND(year(?time2trending) AS ?years)
BIND(month(?time2trending) AS ?months)
BIND(day(?time2trending) AS ?days)
BIND(hours(?time2trending) AS ?hours)
BIND(minutes(?time2trending) AS ?minutes)
BIND(seconds(?time2trending) AS ?seconds)
BIND( (?days*86400+ ?hours*3600+ ?minutes*60+ ?seconds) AS ?time_in_seconds)
}GROUP BY ?channel_title HAVING(COUNT(*)>4)}
BIND(ceil(?average_time_to_trending_seconds/3600) AS ?time_to_trending_average_hours)
}ORDER BYASC(?time_to_trending_average_hours) LIMIT10