About Python
Some good materail from arnaldog12
conda create --name new_name --clone old_name
Root environment is named as base, You can use following command:
conda create --name <env_name> --clone base
conda list -n <env_name>
conda create -n name_of_my_env python=2.7
'>activate name_of_my_env
CMD.exe (as admnistrator )
conda info --envs
python /?
cmd.exe (as administrator)
easy_install pip
pip install nbopen
or python -m nbopen.install_win
conda install pandas=0.20.3
or
'>pip install numpy==1.10.4
To install: Use the command:
The latest version > pip install foo --user
A particular version (e.g., foo 1.0.3)> pip install foo==1.0.3 --user
A minimum version (e.g., foo 2.0) > pip install 'foo>=2.0' --user
run cmd comand as an admnistrator
>python --version
Python 3.4.3
python install pandas==version --user commands
As you already created a notebook file, you can easily convert it to an html file. In this format it will be easy for you to share it or put it on a website. So from the prompt :
jupyter nbconvert --to html --execute YOUR_FILE.ipynb --output OUTPUT.html
column_names = df.columns
print(column_names)
df.dtypes
for i in column_names:
print('{} is unique: {}'.format(i, df[i].is_unique))
import os
currDir = os.path.dirname(os.path.realpath("file"))
rootDir = os.path.abspath(os.path.join(currDir, '..'))
sys.path.insert(1, rootDir + '/src/funct') # import func package
from func import Foo
df.index.values
'foo' in df.index.values
df.set_index('column_name_to_use', inplace=True)
columns_to_drop = [column_names[i] for i in [1, 3, 5]]
df.drop(columns_to_drop, inplace=True, axis=1)
df['col'] = df['col'].fillna(' ')
df['col'] = df['col'].fillna(99)
df['col'] = df['col'].fillna(df['col'].mean())
Propagate non-null values forward or backwards by putting method=’pad’ as the method argument. It will fill the next value in the dataframe with the previous non-NaN value. Maybe you just want to fill one value (limit=1)or you want to fill all the values. Whatever it is make sure it is consistent with the rest of your data cleaning.
df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})
col1
0 NaN
1 NaN
2 2.0
3 3.0
4 4.0 # This is the value to fill forward
5 NaN
6 NaN
df.fillna(method='pad', limit=1)
col1
0 NaN
1 NaN
2 2.0
3 3.0
4 4.0
5 4.0 # Filled forward
6 NaN
Fill the first two NaN values with the first available value
df.fillna(method='bfill')
col1
0 2.0 # Filled
1 2.0 # Filled
2 2.0
3 3.0
4 4.0
5 NaN
6 NaN
df.dropna()
df.dropna(axis=1)
df.dropna(thresh=int(df.shape[0] * .9), axis=1)
The parameter thresh=N requires that a column has at least N non-NaNs to survive.
Follow this syntax:
np.where(if_this_condition_is_true, do_this, else_this)
Example:
df['new_column'] = np.where(df[i] > 10, 'foo', 'bar)
Simple dataframe to test:
df = pd.DataFrame(data={'col1':np.random.randint(0, 10, 10), 'col2':np.random.randint(-10, 10, 10)})
Index col1 col2
0 0 6
1 6 -1
2 8 4
3 0 5
4 3 -7
5 4 -5
6 3 -10
7 9 -8
8 0 4
9 7 -4
test if all the values in col1 are >= 0 by using the built in method assert
assert(df['col1'] >= 0 ).all() # Should return nothing
Let’s test is any of the values are strings.
assert(df['col1'] != str).any() # Should return nothing
Testing the two columns to see if they are equal
assert(df['col1'] == df['col2']).all()
import pandas.util.testing as tm
tm.assert_series_equal(df['col1'], df['col2'])
AssertionError: Series are different
Series values are different (100.0 %)
[left]: [0, 6, 8, 0, 3, 4, 3, 9, 0, 7]
[right]: [6, -1, 4, 5, -7, -5, -10, -8, 4, -4]
Is able to help you clean up some commonly used patterns for emails or URLs. It’s nothing fancy but can quickly help with clean up.
pip3 install beautifier
from beautifier import Email, Url
email_string = 'foo@bar.com'
email = Email(email_string)
print(email.domain)
print(email.username)
print(email.is_free_email)
>>
bar.com
foo
False
url_string = 'https://github.com/labtocat/beautifier/blob/master/beautifier/__init__.py'
url = Url(url_string)
print(url.param)
print(url.username)
print(url.domain)
>>
None
{'msg': 'feature is currently available only with linkedin urls'}
github.com
This is a library that uses machine learning to perform de-duplication and entity resolution quickly on structured data.
More information in https://medium.com/district-data-labs/basics-of-entity-resolution-with-python-and-dedupe-bc87440b64d4
Set fields
fields = [
{'field' : 'Source', 'type': 'Set'},
{'field' : 'Site name', 'type': 'String'},
{'field' : 'Address', 'type': 'String'},
{'field' : 'Zip', 'type': 'Exact', 'has missing' : True},
{'field' : 'Phone', 'type': 'String', 'has missing' : True},
{'field' : 'Email Address', 'type': 'String', 'has missing' : True},
]
Pass in our model
deduper = dedupe.Dedupe(fields)
Check if it is working
deduper
>>
<dedupe.api.Dedupe at 0x11535bbe0>
Feed some sample data in ... 15000 records
deduper.sample(df, 15000)
dedupe.consoleLabel(deduper)
[]
(http://www.youtube.com/watch?v=YOUTUBE_VIDEO_ID_HERE "Video Title")