About_Python

About Python

Some good materail from arnaldog12

https://github.com/arnaldog12/Jupyter-Tutoriais

CONDA

Rename a conda environment

conda create --name new_name --clone old_name
Root environment is named as base, You can use following command:
conda create --name <env_name> --clone base

List all the packages installed in conda environment and his version

conda list -n <env_name>

Create another conda environement (name_of_my_env)

conda create -n name_of_my_env python=2.7
'>activate name_of_my_env

Verify instaled Conda environements

CMD.exe (as admnistrator )
conda info --envs

PYTHON

Help python

python /?

Install pip (PIP is a package manager for Python packages, or modules if you like.)

cmd.exe (as administrator)
easy_install pip

Associate Jupyter extension files in Windows

pip install nbopen
or python -m nbopen.install_win

Install another version of package

conda install pandas=0.20.3
or
'>pip install numpy==1.10.4

To install: Use the command:
The latest version > pip install foo --user
A particular version (e.g., foo 1.0.3)> pip install foo==1.0.3 --user
A minimum version (e.g., foo 2.0) > pip install 'foo>=2.0' --user

Verify Python version

run cmd comand as an admnistrator
>python --version
Python 3.4.3

To Install a package version

python install pandas==version --user commands

As you already created a notebook file, you can easily convert it to an html file. In this format it will be easy for you to share it or put it on a website. So from the prompt :

jupyter nbconvert --to html --execute YOUR_FILE.ipynb --output OUTPUT.html

Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages

https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3

Get column names

column_names = df.columns
print(column_names)

Get column data types

df.dtypes

Also check if the column is unique

for i in column_names:
print('{} is unique: {}'.format(i, df[i].is_unique))

Meanwhile in my old notebooks ...

import os
currDir = os.path.dirname(os.path.realpath("file"))
rootDir = os.path.abspath(os.path.join(currDir, '..'))
sys.path.insert(1, rootDir + '/src/funct') # import func package

Now I can finally import Foo from the funct package

from func import Foo

Check the index values

df.index.values

Check if a certain index exists

'foo' in df.index.values

If index does not exist

df.set_index('column_name_to_use', inplace=True)

Create list comprehension of the columns you want to lose

columns_to_drop = [column_names[i] for i in [1, 3, 5]]

Drop unwanted columns

df.drop(columns_to_drop, inplace=True, axis=1)

What To Do With NaN

Fill NaN with ' '

df['col'] = df['col'].fillna(' ')

Fill NaN with 99

df['col'] = df['col'].fillna(99)

Fill NaN with the mean of the column

df['col'] = df['col'].fillna(df['col'].mean())

Propagate

Propagate non-null values forward or backwards by putting method=’pad’ as the method argument. It will fill the next value in the dataframe with the previous non-NaN value. Maybe you just want to fill one value (limit=1)or you want to fill all the values. Whatever it is make sure it is consistent with the rest of your data cleaning.
df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})
col1
0 NaN
1 NaN
2 2.0
3 3.0
4 4.0 # This is the value to fill forward
5 NaN
6 NaN
df.fillna(method='pad', limit=1)
col1
0 NaN
1 NaN
2 2.0
3 3.0
4 4.0
5 4.0 # Filled forward
6 NaN

Limited to forward filling, but also backfilling with bfill.

Fill the first two NaN values with the first available value
df.fillna(method='bfill')
col1
0 2.0 # Filled
1 2.0 # Filled
2 2.0
3 3.0
4 4.0
5 NaN
6 NaN

Drop any rows which have any nans

df.dropna()

Drop columns that have any nans

df.dropna(axis=1)

Only drop columns which have at least 90% non-NaNs

df.dropna(thresh=int(df.shape[0] * .9), axis=1)
The parameter thresh=N requires that a column has at least N non-NaNs to survive.

np.where(if_this_is_true, do_this, else_do_that)

Follow this syntax:
np.where(if_this_condition_is_true, do_this, else_this)
Example:
df['new_column'] = np.where(df[i] > 10, 'foo', 'bar)

Assert

Simple dataframe to test:
df = pd.DataFrame(data={'col1':np.random.randint(0, 10, 10), 'col2':np.random.randint(-10, 10, 10)})

Index col1 col2
0 0 6
1 6 -1
2 8 4
3 0 5
4 3 -7
5 4 -5
6 3 -10
7 9 -8
8 0 4
9 7 -4

test if all the values in col1 are >= 0 by using the built in method assert
assert(df['col1'] >= 0 ).all() # Should return nothing

Let’s test is any of the values are strings.
assert(df['col1'] != str).any() # Should return nothing

Testing the two columns to see if they are equal
assert(df['col1'] == df['col2']).all()

Package pandas also includes a testing package

import pandas.util.testing as tm
tm.assert_series_equal(df['col1'], df['col2'])

AssertionError: Series are different
Series values are different (100.0 %)
[left]: [0, 6, 8, 0, 3, 4, 3, 9, 0, 7]
[right]: [6, -1, 4, 5, -7, -5, -10, -8, 4, -4]

The beautifier package

Is able to help you clean up some commonly used patterns for emails or URLs. It’s nothing fancy but can quickly help with clean up.
pip3 install beautifier
from beautifier import Email, Url
email_string = 'foo@bar.com'
email = Email(email_string)
print(email.domain)
print(email.username)
print(email.is_free_email)
>>
bar.com
foo
False
url_string = 'https://github.com/labtocat/beautifier/blob/master/beautifier/__init__.py'
url = Url(url_string)
print(url.param)
print(url.username)
print(url.domain)
>>
None
{'msg': 'feature is currently available only with linkedin urls'}
github.com

Dedupe

This is a library that uses machine learning to perform de-duplication and entity resolution quickly on structured data.
More information in https://medium.com/district-data-labs/basics-of-entity-resolution-with-python-and-dedupe-bc87440b64d4

Set fields
fields = [
{'field' : 'Source', 'type': 'Set'},
{'field' : 'Site name', 'type': 'String'},
{'field' : 'Address', 'type': 'String'},
{'field' : 'Zip', 'type': 'Exact', 'has missing' : True},
{'field' : 'Phone', 'type': 'String', 'has missing' : True},
{'field' : 'Email Address', 'type': 'String', 'has missing' : True},
]

Pass in our model
deduper = dedupe.Dedupe(fields)
Check if it is working
deduper
>>
<dedupe.api.Dedupe at 0x11535bbe0>
Feed some sample data in ... 15000 records
deduper.sample(df, 15000)