/Statistics_and_Data_Analysis

Statistics_and_Data_Analysis course Reichman uni

Primary LanguageJupyter NotebookMIT LicenseMIT

Statistics and Data Analysis

This repository serves as a comprehensive resource for various topics in statistics and data analysis. It is an integral part of the curriculum for the "Statistics and Data Analysis" course in the M.Sc. studies in Computer Science at Reichman University.

Repository Content

The repository houses four distinct Jupyter notebooks, each delving into specific areas of statistics and data analysis. The notebooks include:

1. Distributions (HW1_Distributions.ipynb)

This notebook provides an in-depth exploration of probability distributions. Key highlights include:

  • Discussion and implementation of both discrete and continuous probability distributions.
  • Detailed exploration of various distributions, including normal, binomial, and Poisson.
  • Examination of mixture distributions.
  • Practical implementation of the Expectation-Maximization (EM) algorithm for fitting mixture distributions.

2. Data Exploration and Visualization (HW2_Data_exploration_and_visialization.ipynb)

This notebook concentrates on the initial yet crucial stages of data analysis: data exploration and visualization. It covers:

  • Techniques for understanding the structure and composition of data.
  • Identification and handling of missing data and outliers.
  • Data transformation procedures.
  • A showcase of data visualization techniques such as histograms, box plots, scatter plots, and heatmaps.

3. Correlations (HW3_Correlations.ipynb)

This notebook delves into the concept of correlations, covering:

  • Calculation and interpretation of Pearson's correlation coefficient and Spearman's rank correlation coefficient.
  • Discussion on the significance and application of these correlation measures.
  • Visualizations including correlation matrices and heatmaps.

4. Differential Gene Expression in Acute Myocardial Infraction (HW4_Differential_Gene_Expression_in_Acute_Myocardial_Infraction.ipynb)

This notebook presents a detailed case study on analyzing gene expression data in the context of acute myocardial infarction. The notebook includes:

  • Data processing procedures: data loading, handling of missing values, and data preparation for subsequent analysis.
  • High-level data analysis: Overview of the dataset, including key statistics such as the number of genes profiled, total number of samples, and samples distribution across classes.
  • Gene expression visualization: A random selection of 20 genes are taken, and their expression levels across the two classes are visually compared using box plots.

Usage

To run the notebooks, ensure that Jupyter Notebook is installed on your system. Each notebook can be run independently as they are self-contained and provide all the necessary instructions and dependencies.

Dependencies

The notebooks use the following libraries, which should be installed in your Python environment:

  • Numpy
  • Pandas
  • Matplotlib
  • Seaborn
  • Scipy
  • Statsmodels
  • Scikit-learn

Contributing

This repository was created by Lior-Baruch as part of an academic project. Contributions, feature requests, and issues are always welcome!

License

This project is licensed under the terms of the MIT License.