Guide for Theoretical/Computational Chemists

What is computational chemistry?

tl;dr

Computational chemistry describes the use of computer modelling and simulation - including ab initio approaches based on quantum chemistry, and empirical approaches - to study the structures and properties of molecules and materials. Computational chemistry is also used to describe the computational techniques aimed at understanding the structure and properties of molecules and materials.

Research topics

  • Chemical reactions, kinetics
  • Photochemistry
  • Catalyst design
  • Potential energy surface
  • Vibrational and electronic spectroscopy
  • Solid-state chemistry
  • Drug discovery, biomolecular docking
  • Cheminformatics
  • Machine learning
  • Quantum computing

Community

Skills needed to be a theoretical/computational chemist

Core skills

  • Understanding of theoretical principles including kinetics, thermodynamics, and electronic structure
  • Various levels of programming, code development, and software architecture skills
  • Problem-solving skills and an interest in solving basic and applied research problems
  • Skills in adapting and integrating computer software to solve new categories of problems
  • Critical thinking for analyzing and interpreting computational results and statistical data
  • Googling and stackoverflowing 🙃

General technical skills

  • For Windows:
    • Install programs and modify system variables such as PATH
    • Install Nvidia CUDA toolkit and driver
    • Setup VPN and local network
    • Setup dual boost for Linux, or install WSL for Ubuntu
  • For Linux and macOS:
    • Basic/intermediate commands: ls, cd, cp, rm, ssh, scp, and many more
    • Know some important files/folders: .bashrc, .ssh
    • Understand some environmen variables: $PATH, $LD_LIBRARY_PATH
  • Scripting programming language
    • Bash, awk, perl, Python + Jupyter notebook
  • Cluster / HPC
    • Understand terminology: master node, compute node, scheduler, CPU cores, processes, memory management
    • Scheduler: Slurm, PBS, SGE
    • Software manager: module (avail, load, unload, switch)
  • Quantum chemistry software
    • Commercial: Gaussian, Q-Chem, ADF, MOLPRO, MOLCAS, TURBOMOLE and many more
    • Non-commercial: PySCF, Psi4, OpenMOLCAS, GAMESS, ORCA, NWChem, DIRAC, DALTON, CP2K, LAMMPS, VASP, Quantum Espresso and many more
    • Full list is here
  • Graphic visualization
    • JMol, Molden, Gaussview, Avogadro, UCSF Chimera, VMD, Ovito, PyMol
    • 2D and 3D plots
  • Other useful tools
    • ASE, MDTraj, Pymatgen, RDKit, OpenBabel
  • Writing
    • Microsoft Word
    • LaTeX
      • Compiler: pdflatex, xelatex, lualatex
      • Distribution: TeX Live, MikTeX
      • Editor: OverLeaf, TeXstudio, Texmaker
  • Presentation
    • Powerpoint
    • LaTeX (LuaLaTeX) Beamer

Essential skills for method development

  • Linear algebra
    • Vectors and matrices
      • Geenral properties: Complex conjugate, transpose and conjugate transpose
      • Diagonalization
      • Matrix multiplication (Dense and sparse)
    • Operators & commutators
    • Eigen-problem
      • Jacobi iteration
      • Eigenvalue & eigenvector
      • Singular value decomposition
    • Optimization algorithms
    • Numerical analysis
  • Calculus
    • Numerical methods
    • Differential equation & ODE
    • Vector calculus
  • Data fitting
    • Taylor expansion
    • Polynomial interpolation
    • Least squeare approximation
  • Finding roots
    • Bilinear interpolation
    • Newton-Raphson method
    • PDE
  • Quantum chemistry
    • Wavefunctions and molecular orbitals
      • Wavefunction and its properties, Hilbert space, linearity, Bra-Ket notation
      • Born-Oppenheimer approximation, Slater determinant, linear combination of atomic orbtials (LCAO)
      • Basis functions, basis sets (Gaussian-type orbitals, GTOs)
    • Ab initio (wavefunction-based) method
      • HF, MPn, CI, CC, MRCI, MSSCF, CASSCF, CASPT2
      • DMRG (matrix product states), FCIQMC
    • Density functional theory method
      • KS equation, exchange and correlation functionals
      • (Real-time) TDDFT
      • Gaussian and plane wave method (GPW, GAPW)
      • Pseudopotential
        • Effective core potential (ECP)
    • Semi-empirical
      • AM1, PM3, PM6
      • Tight-binding methods (e.g. DFTB, xTB)
    • Excited state, transition state, atomic/molecular bond
      • Adiabatic state, non-adiabatic state, Delta-SCF, constrained DFT
      • Surface hopping, quantum dynamics
    • Vibrational spectroscopy
      • IR, Raman
      • Linear response (first, second response)
      • Perturbation theory
    • Ab initio molecular dynamic (AIMD)
      • Car-Parrinello MD (CPMD)
      • Born-Oppenheimer MD (BOMD)
    • Other methods
      • QM/MM
      • Energy decomposition analysis
  • Molecular dynamics
    • Classical mechanics
    • Force field
    • Statistical mechanics
    • Enhanced sampling: Free energy, unbrella sampling, etc.
    • Monte Carlo method
  • Material simulation
    • Multiscale modeling
    • Coarse grained
    • Condense matter simulation
  • Programming (for mathematical proof)
    • Scripting language: Bash, Python
    • Intensive subroutine with OOP: C++, Fortran
    • Symbolic programming (Mathematica, SymPy)

Essential skills for software development

  • Code editors
    • Vi/Vim, Nano
    • VS Code, Atom, Eclipse, Sublime, Notepad++
  • File format
    • XML, JSON
  • General programming skills
    • Type of variables
    • Loops and conditional statement
    • Input/output
  • High-level programming
    • Python
      • Pip and conda: Python helper
      • NumPy: Array (vector, matrix) computation
      • Numba: JIT compiler for NumPy
      • Jax: autograd of NumPy array
      • SciPy: a collection of math functions/routines
      • Scikit-learn: statistics routines, optimization, curve fitting
        • Intel Scikit-learn is 10x faster than the standard one
      • Matplotlib / Plotly for plotting graph
      • Theano: numerical computation
      • SCOOP: distributed modules for parallel programming
      • NetworkX: Graph library
  • Low-level programming
    • C
      • Function, pointer, storage class
      • Enum, struct, union
      • Preprocessor
      • Operator, memory management, array
      • File handling
    • C++
      • C++ 11 or newer
      • Type of variable: signed, unsigned, long, double, etc.
      • Loops, conditional statement
      • Standard libraries: vector, rand
      • Understanding header (.hpp) and source file (.cpp or .cc)
      • Preprocessor (#if, #ifdef, #ifndef, #define, etc.)
      • Function, class, struct, template
      • Declaration
        • namespace, const, attribute, pointer, pass by reference, static_assert
      • Initialization
      • Misc: casting, lambda expression, encapsulation, file handling, exception handling
    • Fortran
      • Learn either F77 or F90 or modern fortran (2003, 2008, 2018)
      • Module, subroutine, function
      • Array (allocatable and multidimentional) and string
      • Operator overloading
      • Flow control
      • Derived type
      • Callback
      • Interfacing to other language e.g. Python or C++
    • GNU library
      • GSL
      • Many more libraries here
  • Memory allocation
    • Stack, heap, global memory
  • Math libraries
    • BLAS (OpenBLAS)
    • LAPACK for linear algebra
    • ScaLAPACK - a higher level LAPACK
    • Intel MKL (Intel oneAPI)
    • FFTW: for computing the discrete Fourier transform in one or more dimensions, real and complex data
    • Eigen: linear algebra library
    • Boost: a collection of C++ functions e.g. regex, serialization
  • QM libraries
    • libxc: XC function library
    • libint: For computing Gaussian integral
    • libcint: general GTO integrals
  • Code optimization
    • Benchaming/scaling
    • Complexity (Big O)
  • GNU
    • Static and dynamic libraries
    • Archive
    • Compiling (g++, gcc) and linking (ld)
    • Useful flags for compiler and linker e.g. -O2, -O3, -fPIC
  • Compilng tools
    • autoconf
    • configure
    • Make, cmake, automake
  • Debugging
    • gdb for general debugging
    • Valgrind for memory leak analysis
  • Git (source code control)
    • Basic/intermediate commands
    • GitHub & GitLab
  • Documentation
    • Sphinx (for markdown and reStructuredText)
    • Doxygen

Essential skills for HPC

  • Architecture
    • Memory management
    • Threading, multithreading
    • Block
  • Parallel computing (SPMD)
    • Shared memory: OpenMP
    • Distributed memory: MPI
      • Implementations: OpenMPI, Intel MPI, MVAPICH
  • Intel ecosystem
    • OpenMP compiler: icc, ifort
    • MPI compiler: mpicc, mpiicc (for Intel C compiler), mpicxx (for C++), mpiifort (for Fortran)
  • Cloud computing (bonus)
  • Server and database
  • Networking

Essential skills for coding GPU

  • Intermediate/advanced C or C++ skills
  • Programming model: Kernels, thread hierarchy, memory hierarchy, heterogeneous hierarchy, asynchronous SIMT
  • CUDA
    • Understand CUDA operation:
      1. Declare and allocate host and device memory.
      2. Initialize host data.
      3. Transfer data from the host to the device.
      4. Execute one or more kernels.
      5. Transfer results from the device to the host.
    • CUDA C and CUDA C++ API
    • Compiler: nvcc

Essential skills for machine learning (bonus)

  • Basic math: linear algebra and calculus
  • Programming
    • Python, R, Julia, Matlab
    • TensorFlow, PyTorch, Scikit-learn
  • Python lib
    • NumPy
    • Pandas
  • Terminology: regression, classification, descriptor, feature, kernel, activation function
  • Data analysis/engineering: EDA, ETL
  • Graphical representation
    • Histogram, bar plot, heatmaps
  • ML algorithms
    • Decision tree
    • Random forest
    • Support vector machine
    • Principal component analysis
    • Kernel-ridge method
    • Neural network
      • Feedforward NN
        • Autoencoder
      • CNN
      • RNN (LSTM)
      • GNN
      • Adversarial NN
        • GAN
  • Model training and optimization
    • Hyperparameter optimization
  • Techniques to prevent overfittingTechniques
    • Data augmentation, early stopping, regularization, dropout, batch normalization
  • Deploying model

Essential skills for machine learning chemistry (bonus)

  • Atomic and molecular representation
    • Structural-based: SMILES, one-hot encoding, 1D/2D fingerprint
    • Electronic-based:
      • Coulomb matrix, BoB
      • Sine matrix, Ewald sum matrix
      • Smooth Overlap of Atomic Positions (SOAP)
      • Symmetry and Gaussian functions, and many more
      • Many-body tensor representation
  • Configurational space, chemical space
  • Target prediction
    • Energy and force
    • Molecular properties:
      • (transition) dipole moment, polarizability
      • Electron transfer matrix element
  • Database
    • PubChem
    • GDB
    • DrugBank
    • QM: QM7, QM8, QM9