A key focus of astronomical research is finding Earth-like exoplanets in habitable zones, areas around stars where conditions are just right for liquid water to exist. The search for exoplanets, planets beyond our solar system, is driven by questions about their existence, diversity, and potential for life.
In this notebook, we will use data science techniques
and machine learning
to predict potential exoplanets in star systems using light intensity curves data derived from
observations made by the NASA Kepler space telescope.
- Background
- Exploratory Data Analysis
- Data Preprocessing
- Summary of Machine Learning Model Evaluation
- Observations
- Built Using
- Improvements
- Contributors
The Keplar Dataset, publicly available from NASA, includes flux readings from over 3000 stars, each labelled as either housing an exoplanet or not. We will be analysing this data from the Kepler mission to identify potentially habitable exoplanets.
As you can imagine, planets themselves do not emit light, but the stars that they orbit do. If the said star is watched over several months or years, there may be a regular 'dimming' of the flux (the light intensity). This is evidence that there may be an orbiting body around the star; such a star could be considered to be a 'candidate' system. Further study of our candidate system, for example by a satellite that captures light at a different wavelength, could solidify the belief that the candidate can in fact be 'confirmed'.
Huge disproportion in the data: 99.3% not exoplanet while only 0.7% is an exoplanet.
We see clear periodic motion: We still see clear anomalies from detection errors, but there is periodic motion evident in all the plots. Even star 35 shows periodic motion, albeit on a smaller amplitude. This is due to the fact that there is a planet orbiting in front of the star periodically, therefore decreasing the flux received.
Boxplots were created for different FLUX features against the 'LABEL' column to identify outliers. These plots help to visually detect outliers as points that lie beyond the whiskers of the boxplot.
- A heatmap was generated using Seaborn to visualize any missing values in the dataset.
- The heatmap indicated that there are no missing values in the dataset.
- Data points where 'FLUX.1' exceeded 250,000 were considered outliers and removed from the dataset.
- SMOTE was employed to address the class imbalance issue in the dataset.
- This technique generates synthetic samples for the minority class, resulting in an equal number of observations for each class.
- After applying SMOTE, both classes (LABEL 1 and LABEL 0) had an equal count of 5049 samples.
- The normalization process was applied to the input features to scale the values to a common range without distorting differences in value ranges.
- This step is crucial for many algorithms to perform optimally.
- Gaussian filters were used to smooth the input features.
- This step is based on the concept of the Gaussian distribution, commonly used in statistics and various sciences.
- Feature scaling was performed using the StandardScaler.
- This step ensures that all input features have comparable ranges, which is important for many machine-learning algorithms.
- PCA was applied to reduce the dimensionality of the dataset while retaining 90% of the variance.
- The number of components required to retain this variance was determined to be 23.
- Finally, PCA was re-applied with 23 components to transform the dataset.
- The processed dataset was split into training and testing sets.
- The split was done with a test size of 33% and a specific random state for reproducibility.
Criteria: Accuracy, Precision, Recall, and F1-Score.
- Accuracy: 99.94%
- Precision, Recall, F1-Score: 100% for both classes (0 and 1).
- Accuracy: 99.97%
- Precision, Recall, F1-Score: 100% for both classes.
- Accuracy: 99.97%
- Precision, Recall, F1-Score: 100% for both classes.
- Accuracy: 99.97%
- Precision, Recall, F1-Score: 100% for both classes.
- Accuracy: 100%
- Precision, Recall, F1-Score: 100% for both classes.
- Accuracy: 99.97%
- Precision, Recall, F1-Score: 100% for both classes.
- Accuracy: 99.94%
- Precision, Recall, F1-Score: 100% for both classes.
- High Performance: All models displayed exceptionally high accuracy and perfect precision, recall, and F1 scores.
- Consistency: There was a remarkable consistency across different types of models in terms of performance metrics.
- Languages:
Python
- Tools:
Jupyter Notebook
- Frameworks:
Numpy
,Pandas
,Seaborn
,Matplotlib
,Plotly
,Scipy
,Sklearn
,Imblearn
,Keras
andXGBoost
.
- Explore more complex models, feature engineering techniques, and larger datasets to further enhance the detection of exoplanets.
- Concerns about overfitting, especially in a real-world scenario where perfect prediction is rare.
Abhishek Aggarwal |
Sakshi |