hackathon: A Jupyter Notebook repository from georgehuangcool

##### ACCESS THE KICK-OFF PRESENTATION ####

Welcome to the STIP Data Lab and OECD-TIP Data science for STI policy hackathon GitHub repository!

This GitHub page provides the most important information on how to use Github for the hackathon, how to access the data and a brief word on the expected outcome of the hackathon.

How to use Github for the hackathon

Once you have finalised your project, we would like to ask you to upload all final outputs (code, visualisations, complementary materials, etc.) to your team folder in this repository. Please do so by dragging all files to the respective team folder or by using the GitHub desktop environment. To be able to upload the data, please create a GitHub account and let us know your credentials so we can add you as a collaborator. Please note that GitHub will not accept individual files that are larger than 50MB.

In case you have any questions during the hackathon, please use this repository's issues tab and tag your post with one of the existing labels so the right persons are notified. Posting your questions this way will allow other participants to follow the Q&As. We will try to answer each question as soon as possible.

How to access the data

Please find below a short description of the two data sources as well as instructions on how to access the data.

STI strategies database

The TIP STI strategies database consists of a text corpus including more than 300 STI policy strategy documents (several million words overall) from across 24 OECD countries that covers the past several years, including both the duration of the COVID-19 pandemic and the period immediately prior. The documents have been collected in collaboration with national government officials working on STI policies in a range of public administrations and have been pre-processed and machine-translated to English by the OECD.

The dataset includes the following columns:

country: Name of the country that issued the document
year: Year when the document was issued
period: Indicator for whether the document was issued before or during the COVID-19 pandemic
doc_id: Identifier of the document
title: Title of the document
text original: Original text of the document
text translated: Translated text of the document
text clean: Translated and cleaned text of the document (no numbers and punctuation, no stopwords, lemmatization, n-grams)

You can download the data in .RData-format here. The dataset is quite large which is why we use the .RData-format. You can easily open the file in R by using the load()-command or by using the pyreadr package in Python.

STIP Compass policy database

The STIP Compass policy database includes qualitative data on national STI policies. It is made up of close to 7000 initiatives from 57 countries and the European Union. The database covers all areas of STI policy, including initiatives spread across different ministries and national agencies, with competence over domains as broad as research, innovation, education, industry, environment, labour, finance/budget, among others. Its data is collected from a survey addressed to national government officials working on STI policies in a range of public administrations.

A few essential details about the dataset:

The data model used to structure STIP Compass can understood by viewing this PDF file and the accompanying codebook.
The dataset has two header rows. The first row contains the variable names, whereas the second row includes a short description of the variable.
After the headers, each row provides data for a given initiative and instrument. As an initiative can have more than one instrument, subsequent rows can contain information on multiple instruments from the same initiative.
If you plan to load a CSV file, please select UTF-8 encoding and indicate the pipe character '|' (without quotes) as separator.

More detailed information about the database can be found here.

To load the data in Python into a Pandas dataframe you can use the following code:

import pandas as pd

#download the dataset
url = 'https://stip.oecd.org/assets/downloads/STIP_Survey.csv'
compass_df = pd.read_csv(url, sep='|', encoding='UTF-8-SIG', header=0, low_memory=False)

You can also easily load the data in R using the following code:

library(readr)

url <- 'https://stip.oecd.org/assets/downloads/STIP_Survey.csv'

#download the dataset
download.file(url, destfile = 'stip.csv', mode = 'wb')

#load the dataset into our working environment
stip <- read_delim('stip.csv', '|', escape_double = FALSE, trim_ws = TRUE)

You may inspect and re-use code found in these projects:

STI.Scoreboard

A possible source of complementary data is the STI.Scoreboard infrastructure. It contains over 1000 indicators on research and development, science, business innovation, patents, education and the economy, drawing on the very latest, quality assured statistics from OECD and partner international organisations. These indicators are accessible via a dedicated API that uses SDMX queries. The following Python and R tutorials provide more information and include the necessary code to access this infrastructure.

How to retrieve STI.Scoreboard indicators found in STIP Compass using Python and SDMX?

Expected outcome of the hackathon

For the 7 June closure event (3:00-5:00pm CEST), you should prepare a 10' presentation summarising your findings. We recognise that teams will only be able to work on the policy questions for a short time. We are not expecting definitive answers to the policy questions presented in the hackathon. Rather, we think that teams may propose one or more innovative approaches and possibly feature some initial observations from the data. We are excited to see what sort of avenues the teams will be coming up with!

Following the closure event, we will be organising a separate debriefing seminar where participating teams will be able to elaborate and exchange on their technical choices and experiences working during the hackathon.

georgehuangcool/hackathon