nih_time_series_nlp: A Jupyter Notebook repository from atlas-github

Note

Solution for days 1, 2, and 3 will be posted here. Code-CV is here.

Tip

Wide pivot table can be found here.

Tip

Concatenating two dataframes code can be found here.

Day 1

08:45 am: Introduction

Here's what I achieved so far.

09:00 am: Introduction to Python

Using either your new Google account or your personal account, open Google Colab in another tab
Google Colab's interface and functions:
- Tools >> Settings >>
  - Editor >> Show line numbers (check if you prefer)
  - Miscellaneous >> Corgi mode, Kitty mode (turn on if you like)
- Test with some basic code:
  - Click Connect at top right
  - Write simple definition
  - Test simple math problem
- Runtime settings
  - Run cells
  - Reset
- Code and text cells
- Save
More about Colab’s Markdown here
Open a new notebook on Google Colab
Data Visualization:
- matplotlib
- plotly
Machine Learning and AI
Scientific Computing
- scipy
- sympy
Automation and Web Scraping
- BeautifulSoup
- Scrapy
Database Access
- sqlalchemy
- psycopg2
Natural Language Processing (NLP)
- nltk
- SpaCy
Image Processing
- pillow
- opencv
Data Analysis and Manipulation
- pandas
- numpy

09:45 am: Practical session 1

pandas
- Series
  - Series Creation: Create a Pandas Series from a list of integers.
  - Indexing and Slicing: Access specific elements and slices from the Series.
  - Operations: Perform basic arithmetic operations on the Series.
  - Filtering: Filter the Series to include only elements greater than a certain value.
  - Missing Data: Introduce NaN values into the Series and handle them (e.g., fill with a value or drop).
- DataFrame
  - DataFrame Creation: Create a DataFrame from a dictionary where the keys are column names and the values are lists of column data.
  - Exploring Data: Display the first few rows, summary statistics, and data types of the DataFrame.
  - Indexing and Selection: Select specific columns, rows, and subsets of the DataFrame.
  - Adding Columns: Add a new column to the DataFrame based on existing columns.
  - Handling Missing Data: Introduce NaN values and demonstrate methods to handle missing data (e.g., fillna, dropna).
- Data Manipulation
  - Reading Data: Read a CSV file into a Pandas DataFrame.
  - Filtering Data: Filter rows based on a condition.
  - Sorting Data: Sort the DataFrame by a specific column.
  - Grouping Data: Group the DataFrame by a column and compute aggregate statistics.
  - Merging DataFrames: Merge two DataFrames on a common column.
numpy
- Basic operations
  - Array Creation: Create a 1D NumPy array of integers from 0 to 9.
  - Reshape: Convert the 1D array into a 2D array with 2 rows and 5 columns.
  - Slicing: Extract the first row and the second column of the 2D array.
  - Arithmetic Operations: Create another 2D array of the same shape and perform element-wise addition, subtraction, multiplication, and division.
  - Statistical Operations: Compute the mean, median, and standard deviation of the elements in the 2D array.

10:45 am: Break

11:00 am: Introduction to time series data

Understanding time series data
Common time series patterns and terminology
Loading and exploring time series data with Python

12:00 pm: Basic data cleaning techniques

Handling missing values, imputation and interpolation
Removing duplicates
Data type conversion and validation

1:00 pm: Lunch

2:00 pm: Practical session 2

Download the dengue csv file
Filter to only relevant columns
Convert date to datetime format
Identify postcodes with most complete/missing data
Create a pivot table
Visualize case counts data

3:00 pm: Advanced data cleaning techniques

Detecting and handling outliers
Smoothing time series data
Handling seasonal and trend components

4:00 pm: Practical session 3

Apply moving average smoothing and visualize
Use a for loop to run the code for a few postcodes

Day 2

09:00 am: Feature engineering for time series data

Creating lag features
Rolling statistics (moving average)
Fourier transform and other feature extraction

09:45 am: Time series data normalization and scaling

Normalization technique
Standardization technique
Effects of scaling on time series analysis

10:45 am: Break

11:00 am: Practical session 4

Cretae a plot showing the number of cases for a selected location, with a lag
Overlay external weather data

12:00 pm: Time series decomposition

Decomposing time series into trend, seasonality, and residuals
Additive vs. multiplicative decomposition

1:00 pm: Lunch

2:00 pm: Practical session 5

Machine learning cheat sheet
Facebook Prophet

4:00 pm: Resampling and time series frequency conversion

Downsampling
Upsampling
Resampling with aggregation
Frequency conversion

Day 3

09:00 am: Practical session 6

Split data into training and testing sets
Time series cross-validation techniques

09:45 am: Introduction to NLP data preprocessing

Tokenization
Stopword removal
Stemming
Lemmatization
Text normalization

10:45 am: Break

11:00 am: Text cleaning techniques

Removing special characters and numbers
Handling case sensitivity
Removing stopwords and punctuation

12:00 pm: Practical session 7

Geocoding using Google Cloud Platform
Compare Google Cloud Platform's address details with original dataset

1:00 pm: Lunch

2:00 pm: Text normalization techniques

Converting text to lowercase
Expanding contractions
Handling special characters and numbers
Normalizing workspace
Removing non-ASCII characters
Remalizing tect using lemmatization

3:00 pm: Feature extraction for NLP

Bag of Words
Term Frequency-Inverse Document Frequency (TF-IDF)
Word Embeddings (Word2Vec)
Document Embeddings (Doc2Vec)
N-grams

4:00 pm: Practical session 8

Geocode with OpenStreetMap
Geocode places with missing postcodes

atlas-github/nih_time_series_nlp

08:45 am: Introduction

09:00 am: Introduction to Python

09:45 am: Practical session 1

10:45 am: Break

11:00 am: Introduction to time series data

12:00 pm: Basic data cleaning techniques

1:00 pm: Lunch

2:00 pm: Practical session 2

3:00 pm: Advanced data cleaning techniques

4:00 pm: Practical session 3

09:00 am: Feature engineering for time series data

09:45 am: Time series data normalization and scaling

10:45 am: Break

11:00 am: Practical session 4

12:00 pm: Time series decomposition

1:00 pm: Lunch

2:00 pm: Practical session 5

4:00 pm: Resampling and time series frequency conversion

09:00 am: Practical session 6

09:45 am: Introduction to NLP data preprocessing

10:45 am: Break

11:00 am: Text cleaning techniques

12:00 pm: Practical session 7

1:00 pm: Lunch

2:00 pm: Text normalization techniques

3:00 pm: Feature extraction for NLP

4:00 pm: Practical session 8