/nih_time_series_nlp

Primary LanguageJupyter NotebookMIT LicenseMIT

Note

Solution for days 1, 2, and 3 will be posted here. Code-CV is here.

Tip

Wide pivot table can be found here.

Tip

Concatenating two dataframes code can be found here.

Day 1

08:45 am: Introduction

Here's what I achieved so far.

09:00 am: Introduction to Python

  1. Using either your new Google account or your personal account, open Google Colab in another tab
  2. Google Colab's interface and functions:
    • Tools >> Settings >>
      • Editor >> Show line numbers (check if you prefer)
      • Miscellaneous >> Corgi mode, Kitty mode (turn on if you like)
    • Test with some basic code:
      • Click Connect at top right
      • Write simple definition
      • Test simple math problem
    • Runtime settings
      • Run cells
      • Reset
    • Code and text cells
    • Save
  3. More about Colab’s Markdown here
  4. Open a new notebook on Google Colab
  5. Data Visualization:
  6. Machine Learning and AI
  7. Scientific Computing
  8. Automation and Web Scraping
  9. Database Access
  10. Natural Language Processing (NLP)
  11. Image Processing
  12. Data Analysis and Manipulation

09:45 am: Practical session 1

  1. pandas
    • Series
      • Series Creation: Create a Pandas Series from a list of integers.
      • Indexing and Slicing: Access specific elements and slices from the Series.
      • Operations: Perform basic arithmetic operations on the Series.
      • Filtering: Filter the Series to include only elements greater than a certain value.
      • Missing Data: Introduce NaN values into the Series and handle them (e.g., fill with a value or drop).
    • DataFrame
      • DataFrame Creation: Create a DataFrame from a dictionary where the keys are column names and the values are lists of column data.
      • Exploring Data: Display the first few rows, summary statistics, and data types of the DataFrame.
      • Indexing and Selection: Select specific columns, rows, and subsets of the DataFrame.
      • Adding Columns: Add a new column to the DataFrame based on existing columns.
      • Handling Missing Data: Introduce NaN values and demonstrate methods to handle missing data (e.g., fillna, dropna).
    • Data Manipulation
      • Reading Data: Read a CSV file into a Pandas DataFrame.
      • Filtering Data: Filter rows based on a condition.
      • Sorting Data: Sort the DataFrame by a specific column.
      • Grouping Data: Group the DataFrame by a column and compute aggregate statistics.
      • Merging DataFrames: Merge two DataFrames on a common column.
  2. numpy
    • Basic operations
      • Array Creation: Create a 1D NumPy array of integers from 0 to 9.
      • Reshape: Convert the 1D array into a 2D array with 2 rows and 5 columns.
      • Slicing: Extract the first row and the second column of the 2D array.
      • Arithmetic Operations: Create another 2D array of the same shape and perform element-wise addition, subtraction, multiplication, and division.
      • Statistical Operations: Compute the mean, median, and standard deviation of the elements in the 2D array.

10:45 am: Break


11:00 am: Introduction to time series data

  1. Understanding time series data
  2. Common time series patterns and terminology
  3. Loading and exploring time series data with Python

12:00 pm: Basic data cleaning techniques

  1. Handling missing values, imputation and interpolation
  2. Removing duplicates
  3. Data type conversion and validation

1:00 pm: Lunch


2:00 pm: Practical session 2

  1. Download the dengue csv file
  2. Filter to only relevant columns
  3. Convert date to datetime format
  4. Identify postcodes with most complete/missing data
  5. Create a pivot table
  6. Visualize case counts data

3:00 pm: Advanced data cleaning techniques

  1. Detecting and handling outliers
  2. Smoothing time series data
  3. Handling seasonal and trend components

4:00 pm: Practical session 3

  1. Apply moving average smoothing and visualize
  2. Use a for loop to run the code for a few postcodes
Day 2

09:00 am: Feature engineering for time series data

  1. Creating lag features
  2. Rolling statistics (moving average)
  3. Fourier transform and other feature extraction

09:45 am: Time series data normalization and scaling

  1. Normalization technique
  2. Standardization technique
  3. Effects of scaling on time series analysis

10:45 am: Break


11:00 am: Practical session 4

  1. Cretae a plot showing the number of cases for a selected location, with a lag
  2. Overlay external weather data

12:00 pm: Time series decomposition

  1. Decomposing time series into trend, seasonality, and residuals
  2. Additive vs. multiplicative decomposition

1:00 pm: Lunch


2:00 pm: Practical session 5

  1. Machine learning cheat sheet
  2. Facebook Prophet

4:00 pm: Resampling and time series frequency conversion

  1. Downsampling
  2. Upsampling
  3. Resampling with aggregation
  4. Frequency conversion
Day 3

09:00 am: Practical session 6

  1. Split data into training and testing sets
  2. Time series cross-validation techniques

09:45 am: Introduction to NLP data preprocessing

  1. Tokenization
  2. Stopword removal
  3. Stemming
  4. Lemmatization
  5. Text normalization

10:45 am: Break


11:00 am: Text cleaning techniques

  1. Removing special characters and numbers
  2. Handling case sensitivity
  3. Removing stopwords and punctuation

12:00 pm: Practical session 7

  1. Geocoding using Google Cloud Platform
  2. Compare Google Cloud Platform's address details with original dataset

1:00 pm: Lunch


2:00 pm: Text normalization techniques

  1. Converting text to lowercase
  2. Expanding contractions
  3. Handling special characters and numbers
  4. Normalizing workspace
  5. Removing non-ASCII characters
  6. Remalizing tect using lemmatization

3:00 pm: Feature extraction for NLP

  1. Bag of Words
  2. Term Frequency-Inverse Document Frequency (TF-IDF)
  3. Word Embeddings (Word2Vec)
  4. Document Embeddings (Doc2Vec)
  5. N-grams

4:00 pm: Practical session 8

  1. Geocode with OpenStreetMap
  2. Geocode places with missing postcodes