- What you're using F# for today
- What you're looking to use F# for in the future
- Some specifics about "analytical" work you do
- Some things in the language and/or ecosystem you need to address the second bullet point
Data Scientists focus on two key workflows: Time to Insight and Time to Model. The tools they choose typically center on optimizing one or both of these activites. They can be broken down into these loops
- Import data (CSV/Excel/DB)
- Shape the data (Joins, Grouping, Aggregates)
- Compute features
- Explore the data
- Plot the data
- Perform ANOVA
- Correlation Analysis
- If we have insight, stop. If not, return to 1
Note: A pipeline is a series of Featurizers with a Model Trainer at the end
- Import the data (CSV/Excel/DB)
- Shape the data (Joins, Grouping, Aggregates)
- Compute input features
- Create Pipeline(s)
- Train Pipeline(s)
- Evaluate Pipeline(s)
- If Pipeline is "good", stop. If not, return to 1
- Export Pipeline: The trained Pipeline needs to be exported in a format that it can be reloaded and used for scoring in the future.
-
DataFrame like type/library to work with
- Fast for working in-memory but can also scale to larger data sets
- R's
data.table
library is best in class for in-memory operations - Spark is the industry standard distributed DataFrame. Dask is a Python competitor. Scaling up the size of the data set you are working with should be as "seamless" as possible.
-
Vectorized operations/functions for Columns
- As a Data Scientist, I am used to working with columns of data and performing operations along the entire column.
R
is a vector-based language so these types of operations are baked into the language. I am not sure if we can bake this into aColumn
type, but being able to project/broadcast a function along an entire row is quite powerful. R
is a vector-based language so these types of operations are baked into the language.- The
Numpy
package supports these kids of operations with theNDArray
type - Julia does this by adding a dot (.). The documentation referers to this as dot syntax
- I am not sure if we can bake this into a
Column
type, but being able to project/broadcast a function along an entire row is quite powerful. - Examples:
- Column Addition:
df.ColC = df.ColA + df.ColB
- Vectorized Function Call:
df.ColC = Stats.statsFunc df.ColA df.ColB
StatsFunc: float -> float -> float
- Could we do something like:
df.ColC = Series.map funcA df.ColA df.ColB
- But without having to create
map2
,map3
,map2withParam
- Feels very un-F# :/
- Column Addition:
- As a Data Scientist, I am used to working with columns of data and performing operations along the entire column.
-
Full suite of Joins/Aligning data
- Inner/Outer/Cross/Left Outer/Right Outer
- AsOf/Rolling Joins:
- Last Observation Carried Forward (LOCF), LOCF within range
- Nearest Preceding
- Nearest Succeeding
- Nearest
- Within Window (Preceding Window, Succeeding Window, Preceding or Succeeding)
- Nearest Observation, Nearest Observation within range
- Rolling Joins Example
- Overlaps Joins Example
- Best In Class:
data.table
or kdb+
-
Easy grouping of rows and performing analysis on the groups
-
Windowing of functions
-
A DataFrame where we could access columns with F# just knowing the type (similar to R
df.NewCol <- aSeries
)
- Build ecosystem is fractured and overwhelming
- Do I use Fake, msbuild, VSCode's tasks with Fake, VSCode's tasks with msbuild?
- Python has Anaconda and R has CRAN
- Package management is fractured and overwhelming
- Paket vs. Nuget