/Minimally-Sufficient-Pandas

There are always multiple ways to complete a task in Pandas. A minimal subset of the library is sufficient for almost everything.

Primary LanguageJupyter NotebookBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Become an Expert!

If you are looking to become an expert, check out my books:

They are all extremely comprehensive and offer lots of exercises with detailed solutions.

Mininally Sufficient Pandas Guidelines

This notebook contains a summary of all the guidelines in this tutorial along with a list of attributes and methods that provide nearly all of the functionality of Pandas.

Minimally sufficient Pandas

  • is simple, explicit, straightforward, and boring

  • has one obvious way to accomplish a task

  • uses this obvious way every single time

  • is easier to retain in memory

  • is easier to read and debug by yourself an others

  • uses less of the library by eliminating methods that provide no additional functionality

  • avoids Pandas bugs because of less code

  • doesn't rely on being tricky to impress friends

  • makes it easier to use in production

  • Selecting Subsets of Data

    • Select a single column of data with the brackets
    • Do not use dot notation
    • Be explicit and use loc and iloc
    • Never use ix
    • No need to use at or iat
  • Handling the SettingWithCopyWarning

    • Know the three cases when it appears
      • Correct assignment with side effects
      • No assignment
      • Correct assignment without side effects
    • To handle the warning, you will be in one of two scenarios
      • You want to work with a new independent DataFrame - use the copy method
      • You want to work with original DataFrame. Assign data with a single indexer, loc. Avoid chained indexing.
  • Method Duplication

    • Many methods are aliases or provide no extra functionality. Only use one
    • All operators have methods. Only use methods when necessary
    • Always use Pandas methods and not builtin Python functions
  • Say No to apply

    • apply is an automated for loop that passes each column or row to a user-defined function
    • Use apply as a method of last resort
    • Using apply with axis='columns' is one of the slowest operations in all of Pandas
  • Standardizing groupby

    • Know the three components
      • Grouping columns
      • Aggregating columns
      • Aggregating functions
    • Use the syntax `df.groupby('grouping columns').agg({'aggregating column': 'aggregating function'})
  • Handling a MultiIndex

    • A MultiIndex is difficult to make selections and further process
    • I suggest having a single level index
    • Rename the columns and reset the index after a groupby
  • Say no to apply with groupby

    • Can be extremely slow to use apply with groupby
    • Call all methods independent of the group, outside of the custom function
  • Similarity between groupby, pivot_table, crosstab

  • Similarity between melt, pivot, stack, unstack

Minimal set of DataFrame attributes and methods

Below is a short list of DataFrame attributes and methods that allows you maximum coverage of the library.

  • T
  • abs
  • all
  • any
  • append
  • asfreq
  • astype
  • clip
  • columns
  • copy
  • corr
  • count
  • cov
  • cummax
  • cummin
  • cumprod
  • cumsum
  • describe
  • diff
  • drop
  • drop_duplicates
  • dropna
  • dtypes
  • equals
  • expanding
  • fillna
  • groupby
  • head
  • idxmax
  • idxmin
  • iloc
  • index
  • interpolate
  • isin
  • isna
  • loc
  • max
  • mean
  • median
  • melt
  • merge
  • min
  • mode
  • nlargest
  • notna
  • nsmallest
  • nunique
  • pct_change
  • pivot_table
  • plot
  • prod
  • quantile
  • rank
  • rename
  • replace
  • resample
  • reset_index
  • rolling
  • round
  • sample
  • select_dtypes
  • shape
  • shift
  • sort_index
  • sort_values
  • std
  • sum
  • tail
  • to_csv
  • to_sql
  • values
  • var