Algorithms and Big Data in Chemistry and Materials

The aim of this project is to investigate data processing procedure, which is an essential part of all data-driven approaches. The main steps of data processing are performed: data mining, data curation, data visualization and statistics, clustering and feature engineering.

Data mining and steps in data mining

Data mining is the process of discovering hidden patterns and relationships in large datasets using advanced statistical and computational techniques. The process involves several steps:

  1. Data Preparation: This stage involves collecting, cleaning, and preparing the data for analysis. This includes tasks such as data integration, data cleaning, and data transformation. You need to remove duplicate records, fill in missing values, identify incorrect data types, analyze value distribution, and identify outliers

  2. Data Exploration: This stage involves exploring and analyzing the data to gain insights and identify patterns. This includes tasks such as data visualization and statistical analysis. To choose a suitable method of visualization, you can use Andrei Abella's charts biuwer_how_to_choose_the_right_chart_for_your_data_205f79394b

  3. Data Preprocessing: This stage involves preparing the data for modeling. This includes tasks such as feature engineering: Feature Creation, Encoding, Extraction, Selection, Transformations and Aggregation. After preprocessing the data is usually normalized and standardized to be able to use them in model Building.

  4. Model Building: This stage involves building and training predictive models using the prepared data. This includes tasks such as choosing a suitable algorithm, defining model parameters, and training the model.

  5. Model Evaluation: This stage involves evaluating the performance of the trained model. This includes tasks such as testing the model on new data and measuring its accuracy, precision, recall, and other metrics.

  6. Deployment: This stage involves deploying the model into a production environment. This includes tasks such as integrating the model into an application or system and making it available for use.

  7. Maintenance: This stage involves monitoring the model's performance in the production environment and updating it as needed. This includes tasks such as tracking changes in the data and retraining the model periodically to ensure its accuracy and reliability.