SenteraLLC/geoml

Create file management class that returns full X matrix for any feature group

Closed this issue · 2 comments

Class name: rtio ("research tools in/out")

This class should provide generic file management functionality for any input data source we use. In addition to reading and writing functionality of data and/or results, this class will hold all applicable training data that will be passed on to a training object class (yet to be developed).

Requirements:

Read and store predictors

  1. Cropscan
  2. Hyperspectral .spec files (.bil/.hdr files with only a single pixel)

Read and store response variables

  1. Some basic checks that response variables can joined to predictors (e.g., do both dataframes contain "study", "year", and "plot_id"?

Architecture

  1. Should rtio depend on join_tables, or should join_tables perhaps be included in rtio?

Note: there is not a need to duplicate the checks in the join_tables class.

This class is being designed so that complete X and y arrays are built for any feature group (see Issue #10 for protocol (with demo) for how we can define the dataframe columns to include in the X matrix for training).

That demo achieves what we intend for this class, but it requires several lines of code (see the demo code) that can get messy and may be prone to error. To be able to retrieve the relevant data in a more simple manner, this class initially loads all research data, then has functions that can filter, join, and manipulate the relevant tables and return the X matrix and y vector.

A class was created called rtio (in rtio.py). The main function of rtio() is get_feat_group_X(), which can be called with "group features" name (must be populated from feature_groups.py), and a ground truth (must be one of "vine_n_pct", "pet_no3_ppm", or "tuber_n_pct").

This allows us to get the X matrix, a y vector (and df and x_labels are also returned) using only a single function call.

Demo

from research_tools import feature_groups
from research_tools import rtio

base_dir_data = 'I:/Shared drives/NSF STTR Phase I – Potato Remote Sensing/Historical Data/Rosen Lab/Small Plot Data/Data'
my_rtio = rtio(base_dir_data)

group_feats = feature_groups.cs_test2
X, y, df, x_labels = my_rtio.get_feat_group_X(
    group_feats=group_feats, ground_truth='vine_n_pct',
    date_tolerance=3, random_seed=0)

print(group_feats)
{'dae': 'dae',
 'rate_ntd': {'col_rate_n': 'rate_n_kgha', 'col_out': 'rate_ntd_kgha'},
 'cropscan_wl_range1': [400, 900]}