Python Tips
A collection of python tips in my everyday coding
Table of Contents
Python
1. parallel computing
I found that the joblib package is very good for a loop version of multiprocessing computing. For example, the following code is a simple for loop.
from math import sqrt
[sqrt(i ** 2) for i in range(10)]
With joblib, we can do it in a multiprocess fashion.
from math import sqrt
from joblib import Parallel, delayed
Parallel(n_jobs=2)(delayed(sqrt)(i ** 2) for i in range(10))
Just wrap the function name in the delayed function as delayed(function_name). The returned values will be a list and each element is the output of each loop iteration. For more complicated usage, check here
2. use pickle to save data
Pickle can be used to save any kind of data.
import pickle
with open('train_test_id.pickle', 'wb') as f:
pickle.dump(train_test_id, f)
with open('train_test_id.pickle', 'rb') as f:
train_test_id = pickle.load(f)
3. Use h5py to save data
import h5py
hdf5_file = h5py.File('test.h5', 'w')
hdf5_file.create_dataset('some_keys', data=some_numpy_array, dtype=np.uint8)
hdf5_file.close()
4. Choose conda env in Jupyter Notebook
Need to install the following package in conda
conda install nb_conda
5. Python 2 and Python3 code compatibility
from __future__ import absolute_import, division, print_function
6. Plot numpy data in grid
## assume numpy_data has dimension: W, H, C
n_rows = 5
n_cols = 5
plt.figure(figsize=(n_cols*4, n_rows * 4))
for row in range(n_rows):
for col in range(n_cols):
index = n_cols * row + col
if index >= numpy_data.shape[2]: break
plt.subplot(n_rows, n_cols, index + 1)
plt.imshow(numpy_data[:,:,index])
plt.axis('off')
plt.title('Fig %s' % (index + 1))
plt.show()
7. Shutdown Jupyter Notebook
## first list the running notebook
jupyter notebook list
## then shutdown the jupyter notebook with the specific port
jupyter notebook stop 8888
8. Install Jupyter Notebook extensions
pip install jupyter_contrib_nbextensions
Numpy
1. Save numpy array
### without compression
np.save(file_name,array_name)
### with compression
np.savez_compressed(file_name, array_name1=array1, array_name2=array2)
### load saved array
np.load(file_name)
Pandas
1. groupby excludes NA
df = pd.DataFrame({'col1':[1,2,np.NaN,1,np.NaN,2],'col2':[0,0,1,2,2,2]})
df.GroupBy(['col1']).size()
It will exclude the NA in col1. We can fill the Nan with string "NA".
df = pd.DataFrame({'col1':[1,2,np.NaN,1,np.NaN,2],'col2':[0,0,1,2,2,2]})
df.fillna('NA').GroupBy(['col1']).size()
2. rename columns
We can rename the entire column names by
df.columns = ['a', 'b']
Or we can rename specific columns by
df.rename(columns = {'a':'b'}, inplace = True)
3. select data by dates
Make sure the date variable is in datetime format
df['date_var'] = pd.to_datetime(df['date_var'])
mask = (df['date_var'] >= '2006-01-01') & (df['date_var'] <= '2006-12-30')
df.loc[mask]
4. show all columns
Change the pandas settings to show all columns.
import pandas as pd
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 500)
pd.set_option('display.max_rows', 100)
pd.set_option('display.height', 500)
Keras
1. Check GPU
Sometimes, we want to know if the GPU has been recognized. The following code assume Tensorflow as the backend.
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
2. Plot the fitting curve
## save the fitting history
history = model.fit(...)
## list all variables in history
print(history.history.keys())
## plot accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
## plot loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
3. Preprocess images for pre-trained models
When use the pre-trained models in Keras, the images need to be preprocessed differently. Keras provides different preprocess functions for different pre-trained models. Just load the preprocess_input function from the pre-trained model that you want to use. For example, for inception_v3
from keras.applications.inception_v3 import InceptionV3,preprocess_input
x = preprocess_input(x)
For VGG16, load the preprocess_input from keras.applications.vgg16 instead
from keras.applications.vgg16 import VGG16,preprocess_input
x = preprocess_input(x)
The keras pre-trained models can be found here
Pytorch
1. Use multiple GPUs
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = ... # define your model here
if torch.cuda.device_count() > 1:
print('--'*10)
print("Use", torch.cuda.device_count(), "GPUs")
print('--' * 10)
model = nn.DataParallel(model)
model.to(device)
2. Define the Dataset for dataloader
class ImageDataset(Dataset):
def __init__(self, some_parameters):
super(ImageDataset, self).__init__()
def __len__(self):
return len_of_the_dataset
def __getitem__(self, index):
3. Change channel order
Pytorch uses BCWH (batch, channel, width, height) instead of BWHC, which is different from Tensorflow and Keras. Use the permute function to change the channel last order to channel first order.
image.permute(0, 3, 1, 2)
4. Save and load Pytorch model
The following code will load and save the whole model
torch.save(model, 'model.pt')
model = torch.load('model.pt')
The following code will only load and save the weights
torch.save(model.state_dict(), file_path)
model.load_state_dict(torch.load(file_path))
5. Use tensorboard with Pytorch
Use tensorboard-pytorch, first install tensorboardX
pip install tensorboardX
You may also need to install tensorflow and tensorboard. In the Pytorch code
import torch
from tensorboardX import SummaryWriter
writer = SummaryWriter()
for iter in range(100):
## train your model
saved_images = torchvision.utils.make_grid(train_image, nrow=6)
writer.add_image('Image', saved_images, iter)
## be carefull, not feed the Pytorch tensor
writer.add_scalar('loss', loss, iter)
writer.add_scalar('acc', acc, iter)
## if you want to show train and test loss in the same figure
writer.add_scalars('loss', {'train': train_loss,
'test': test_loss),
}, iter)
## check the learned weights
for name, param in model.named_parameters():
writer.add_histogram(name, param.clone().cpu().data.numpy(), iter)
writer.off()
Here is an exmaple of other outputs you can visualize with tensorboard. The outputs will be written into the runs folder.
Then launch the tensorboard
tensorboard --logdir runs
Go to localhost:6006
6. Change learning rate during training
from torch.optim.lr_scheduler import ReduceLROnPlateau
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optim, 'min', factor=0.2, patience=5, verbose=True)
for epoch in range(10):
## train your model
## calclulate val_loss
scheduler.step(val_loss)
Check this page for details
7. Print model details (layer name, number of parameters, trainable)
for name, para in model.named_parameters():
print(name, para.numel(), para.requires_grad)
8. Match weight names
A great headache for me when using Pytorch is sometimes the weight name mismatch. This occurs when I train a model with multi-GPUs and use the trained model with single GPU. Somehow the weight names are different with multi-GPU and single GPU. I happen to run into the densenet source code and find the following code.
def densenet121(pretrained=False, **kwargs):
r"""Densenet-121 model from
`"Densely Connected Convolutional Networks" <https://arxiv.org/pdf/1608.06993.pdf>`_
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
"""
model = DenseNet(num_init_features=64, growth_rate=32, block_config=(6, 12, 24, 16),
**kwargs)
if pretrained:
# '.'s are no longer allowed in module names, but pervious _DenseLayer
# has keys 'norm.1', 'relu.1', 'conv.1', 'norm.2', 'relu.2', 'conv.2'.
# They are also in the checkpoints in model_urls. This pattern is used
# to find such keys.
pattern = re.compile(
r'^(.*denselayer\d+\.(?:norm|relu|conv))\.((?:[12])\.(?:weight|bias|running_mean|running_var))$')
state_dict = model_zoo.load_url(model_urls['densenet121'])
for key in list(state_dict.keys()):
res = pattern.match(key)
if res:
new_key = res.group(1) + res.group(2)
state_dict[new_key] = state_dict[key]
del state_dict[key]
model.load_state_dict(state_dict)
return model
Tensorflow
1. Unit test in tensorflow
import tensorflow as tf
class EvalUtilTest(tf.test.TestCase):
def test_function(self):
some_code_here
if __name__ == '__main__':
tf.test.main()
2. Set the verbosity level
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.DEBUG)
#tf.logging.set_verbosity(tf.logging.ERROR)
#tf.logging.set_verbosity(tf.logging.INFO)
#tf.logging.set_verbosity(tf.logging.FATAL)
#tf.logging.set_verbosity(tf.logging.WARN)
3. Use tensorboard in notebook
%load_ext tensorboard.notebook
4. Install cuda and cudnn by conda
## if you get the following error, try install cuda and cudnn
## ImportError: libcublas.so.10.0: cannot open shared object file: No such file or directory
conda install cudatoolkit
conda install cudnn
Docker
1. Remove unused containers
docker system prune
## remove a specific container
docker container ls -a
docker container rm xxxxxxx
## remove all stopped containers
docker container ls -a --filter status=exited --filter status=created
docker container prune
## remove a running container
docker rm --force container_name
Other
1. Use R in Jupyter Notebook
Install some R packages in R console first.
install.packages(c('repr', 'IRdisplay', 'evaluate', 'crayon', 'pbdZMQ', 'devtools', 'uuid', 'digest'))
devtools::install_github('IRkernel/IRkernel')
IRkernel::installspec()
### if you want to install other R packages
### make sure to add the new package to the anaconda R library path
##install.packages("newpackage", "/home/user/anaconda3/lib/R/library")
Use conda to install some necessary R packages in your Python env.
### if you want to run R in Jupyter Notebook
conda install -c r r-essentials
### if you want to run both R and Python in the same notebook
conda install -c r rpy2
Run both R and Python code in the same notebook. Note that the Python code and R code need to be in separated cells.
%load_ext rpy2.ipython
%R require(ggplot2)
import pandas as pd
df = pd.DataFrame({'group': ['a', 'b', 'c', 'd','e', 'f', 'g', 'h','i'],
'A': [1, 2, 5, 5, 1, 6, 7, 5, 9],
'B': [0, 2, 3, 6, 7, 6, 5, 9, 13],
'C': [3, 2, 3, 1, 3, 3, 4, 2, 1]})
%%R -i df
ggplot(data=df) + geom_point(aes(x=A, y=B, color=C))
I found that there are also magic functions can run R code within notebook. Check this link
2. Profile code
I found that the %prun and %lprun functions in the Jupyter notebook are quite usefel to profile the code and make it more efficient. First, install the follow package
pip install line-profiler
Then, in the notebook, load the package
import line_profiler
%load_ext line_profiler
Assume we want to profile the function called function_to_be_profile
%lprun -f function_to_be_profile function_to_be_profile(para)
The above code will generate a line-by-line profiling result. You can find which part of your code is the bottleneck.
3. Create env
conda create -n env_name python=3.6
4. Use SSH tunneling for Jupyter Notebook
ssh -f -N -L 1234:localhost:8788 user@server.com
1234 is the local port and 8788 is the remote port used by Jupyter Notebook. Now you can connect to Jupyter Notebook by localhost:1234
in the local browser
5. Add SSH public keys to the remote server
ssh-copy-id -i ~/.ssh/id_rsa user@remote_host
6. Git add manually deleted files
git add -u .
7. Reload bashrc without logging out
source ~/.bashrc
8. unrar with wildcard
Somehow the unrar does not support wildcard. The following solution works for me.
unrar x '*.rar'
9. update nvidia driver
## it's better to remove the older drivers
sudo apt-get purge nvidia*
sudo add-apt-repository ppa:graphics-drivers
sudo ubuntu-drivers autoinstall
## change 340 to whatever version you want
sudo apt install nvidia-340
10. Verify Checksum
sha256sum -c SHA256