Python_Use_Cases

Displaying functions and methods in pandas, matplotlib, seaborn

Pandas

Dictionaries

list of key value pairs seperated by : and surround by { }
- world = {'afghanistan': 30.55, 'albania':2.77}
  - to find population of Albanina
    - print(world['albania'])
Add data to dictionary
- type dictionary name followed by key = value
  - world['sealand'] = 0.000027
Delete dictionary value - del(world[sealand])

import pandas as pd

stores tabular data in dataframe
- brics = pd.dataframe(dictionary)
set index
- brics.index = ['BR', 'RU', 'IN', 'CH', 'SA]
- brics = pd.read_csv('path/to/bics.csv, index_col = 0)
  - index_col = 0 sets index to the value in first row (0)
Subsetting dictionaries
- to select all values for country need to use double [[ ]] brics[['country]] or brics[['country,'capital']]
Slicing rows in pandas - .loc or .iloc - brics.loc[["RU"]] still use [[ ]]
- selecting country, capital and slicing rows - brics.loc[['RU','IN','CH'],['country','capital']]

Inspecting a DataFame

.head() returns first few rows of df
.info() shows data type and number of missing values of each columnval
.isna() gives true / false values if na
.any adding isna().any() gives boolan of column if any na
.shape() returns number of rows and columns in df
.describe() calculaes a few summary stats for each column
.values() returns Numpy Array of values
.columns() index of column names
.index() the row index, either row number or names

Sorting and Subsetting

.sort_values(['col_1','col_2'], ascending = [True,False]) sorts asc then desc
subset by df['column']
- dogs[['breed','height_cm']]
- dogs[dogs['height_cm'] > 50]
- dogs[dogs['breed'] == 'Labrador']
- dogs[(dogs['height_cm'] > 50) & (dogs['breed'] == 'Labrador')]
  - or assign varialbes and use:
  - dogs[is_lab & is_tall]
filter on mutiple values of catagoical data
- .isin()
- is_black_or_brown = dogs['color'].isin(['Black','Brown'])
- dogs[is_black_or_brown]

New Columns

df['new_column'] = values or condition to produce values
dogs['height_m'] = dogs['height_cm'] / 100
lets find the tallest dogs with a BMI less than 100
- bmi_lt_100 = dogs[dogs['bmi'] <100]
- bmi_lt_100_height = bmi_lt_100.sort_values('height_cm', ascending = False)
- bmi_lt_100_height[['name','height_cm','bmi']]

Summary Stats

.mean() average of values
.median() middle of all values
.mode() number appearing most often
.min() smallest value - with dates .min() returns oldest
.max() largest value - with dates .max() retruns youngest
.var() how far each number is from the mean - measuers spread between numbers in data set - larger variance gives more risk or uncertinty - smaller variance is less risk or more certin
.std() square root of variance - low std = values closer to mean - high std = values farther range from mean
.quantile()

Cumulative Statistics

.cumsum() cumulative sum of values
.cummax() max value at that point
.cummin() min value at that point
.cumprod() cumulative product

Summarizing Catagorical Data

.drop_duplicates(subset = 'column') - if unique values is combination of two columns - unique_dogs = vet_visits.drop_duplicates(subset = ['name','breed']
.value_counts(sort = True, normalize = True) - counts catagorical data - sort = True puts largest count on top - normalize = True turns counts into % total
.group_by('column')['value'].mean() - .agg([min,max, sum]) aggregates values by all statistical measures - dogs.group_by(['color','breed'])[['weight','height']].agg([min,max,mean]]) - selects min max mean height and weight of dogs by color and breed

Pivot Tables

Pivot tables are data frames with sorted indexes
.pivot_table(values =, index = , columns = ,fill_values = , aggfunc =['np.mean','np.median'], margins = True)
- values = values wanted in pivot table
- index = column to set as index
- columns = 2nd column along top of pivot table
- fill_values = 0 relpaces all Nan with 0
- margins = True contains mean of all columns or rows at end
  - allows summary stats of weight, color, and all variables
- .agg['np.mean','np.median'] allows aggregation by multiple stats - dogs.pivot_table(values = 'weight', index = 'color', columns='breed', fill_vlaue = 0, margins = True) - places color as index and breed as columns with mean weight values as values last row on bottom and right provide summary stats of column or row

Indexing

.set_index('col_name') sets the column as the index instead of row number - can tell column is index because values are left aligned - dogs.set_index('name') sets name as index
.reset_index(drop = True) - returns row numbers as index - drop = True removes the index name from pivot table - dogs_ind.reset_index(drop = True) removes Name column
Why use indexing
- makes subsetting cleaner
  - without name as index
    - dogs[dogs['name'].isin(['Bella','Stella'])
  - with name as index
    - dogs_ind.loc[['Bella','Stella']] - index allows using .loc to subset
- can have multilevel indexing
  - dogs.set_index(['breed','color])
    - breed is outer level index color inner level
    - subset on inner level needs use of Tuples - dogs_ind3.loc[[('Labador','Brown'),('Chihuahua','Tan)]]
- .sort_index(level = ['color','breed'], ascending = [True,False])
- Slicing index
  - OUTER LEVEL indexes
    - dogs.loc['Chow Chow' : 'Poodle']
      - slices all index values from chow to poodle (poodle is included)
  - INNER LEVEL index
    - dogs.loc[("labrador','Brown') : ('Schnauser','Gray')]
- Slicing Columns
  - dogs.loc[:, 'name':'height']
    - sorts all rows (:) and columns name to height
  - dogs.loc[("labrador','Brown') : ('Schnauser','Gray'), 'name':'height']
    - sorts rows brown lab to gray schnauser and columns name to height

Matplotlib

import matplotlib.pyplot as plt

plt.plot(x,y) = line chart
plt.scatter(x,y) = scatter plot
- s = changes size of marker
- c = color
- alpha = transparancy where 1 is non-transparent
plt.xscale('log') = plots using a logarithmic scale

Histograms

plt.hist(variable, n bins,)

Customization Plots

plt.xlabel('Year') names x axis Year
plt.ylabel('Population') names y axis Population
plt.title('World Population Projections') adds this title to chart
plt.yticks([0,2,4,6,8,10],[0,2B,4B,6B,8B,10B])
- first list sets y axis scale starting at 0
- 2nd list adds axis labels so 2 will show 2B
Annotations
- plt.text(x,y,'text to insert') inserts text at the x,y coordinates

Time Series Analysis

Pandas
- from datetime import datetime - to manually create dates
- pd.timestamp(2023-03-15)
  - time will be midnight here 00:00:00
  - timestamp.year returns 2023
  - timestamp.day_name() Returns Wednesday
  - time periods - pd.Period("2017-01")* returns month end
  - period.asfreq('D') returns daily frequency
  - pd.Timestamp('2017-01-31', 'M') +1
    - date with Monthly frequency
    - +1 will return end of month in February 2017
  - Creating Time series
    - index = pd.date_range(start, end, periods, freq)
      - index = pd.date_range(start='2017-01-01, periods=12, freq='M')
      - defult is daily frequency
      - can check using pd.DataFrame({'data':index}).info()
      - Frequencies: H-hour D-day W-week M-Month Q-quater Y-year B-Business days

Indexing and Resampling time series

convert date string to datetime64
- pd.to_datetime(df['date'])
- convert date into index
  - df.set_index('date', inplace - True)
    - Inplace does not create copy
  - pd.read_csv('csv', parse_dates=['date'], idnex_col='date')
Can then filter and slice
- df.['2023'] returns dates for 2023
- df['2023-3':'2023-6'] slices data in that date range (inclusive of last date)
- df.loc['2023-3-15', 'price'] returns price on that date
Set frequency
- df.asfreq('M')
- df[df.'price'.isnull()] returns missing prices

Time Series Calculations

- df.shoft() moves time series to past or future
    - lead - df.shift() moves shifted date 1 period
    - lag - df.shift(periods = -1) shifts data back 1 period
- Finincial Return - calculate 1 period percent change
      - df.price.div(df.shifted)
          - divides price by the shifted price column
- Find relative change in percentage terms
      - df.change.sub(1).mul(100)
            - takes the change column subtracts 1 then gets percentage
      - df.price.diff() difference in value for two periods
      - ****df.column.pct_change(periods=n).mul(100)****

Time Series Growth Rates

Since prices start at different levels normalize price series to start at 100
- Divide all prices by first in series then multiply by 100
  - Provides same starting point
  - All prices relevent to starting point
  - Difference to starting point is in perentage points first_price = df.price.iloc[0] normalized = df.price.div(first_price).mul(100) normalized.plot()
  - Comparing against a benchmark index = pd.read_csv('benchmark_csv', parse_dates=['date'], index_col = 'date') prices = pd.concat([prices, index], axis=1).dropna() normalized = prices.div(prices.iloc[0]).mul(100) normalized.plot() diff = normalized[tickers].sub(normalized['SP500']),axis = 0)

Changing Frequency

DateTImeIndex set using .asfreq() or .reindex()
Frequency conversions affect data
- Upsampling: fill or interpolate missing data
  - Upsampling has higher granularity which shows nulls
  - In example of changing frequency from quarterly to monthly monthly = monthly.to_frame('baseline') - converts series to DataFrame
    - Fill methods monthly['ffill'] = quarterly.asfreq('M', method='ffill') monthly['bfill'] = quarterly.asfreq('M', method='bfill') monthly['value']= quarterly.asfreq('M', fill_value=0)
- Downsampling: reduce rows need to aggregate existing data

bhammy27/Python_Use_Cases

Python_Use_Cases

Pandas

import pandas as pd

Inspecting a DataFame

Sorting and Subsetting

New Columns

Summary Stats

Cumulative Statistics

Summarizing Catagorical Data

Pivot Tables

Indexing

Slicing index

Slicing Columns

Matplotlib

import matplotlib.pyplot as plt

Histograms

Customization Plots

Time Series Analysis

Indexing and Resampling time series

Time Series Calculations

Time Series Growth Rates

Changing Frequency