/BZCAT-Clustering

Clustering the blazars from the BZCAT catalog based on machine learning algorithms

Primary LanguageJupyter Notebook

BZCAT Clustering

1. About the project

This is a repository for the cluster analysis of the blazars from the Roma-BZCAT catalog (Massaro et al., 2015, Ap&SS, 357, 75). The aim is to divide the objects into groups with more or less similar properties to further analyze the differences between them and possibly obtain some insighths into the nature of this type of active galactic nuclei (AGNs).

The feature space for the clustering has been choosen with a simple general approach: all possible features related to the physics of the objects. We discuss the selection of characteristics for the model dataset in the paper mentioned below.

It is worth noting that blazars is quite a homogenious class of AGNs, which has been some kind of a challenge for the project. The other problem is a deficiency of data: out of the 3561 blazars from the catalog, only about 800 objects had all the model dataset characteristics measured. To fill the missing data, the probabilistic PCA (pPCA) approach have been used. We analyze both the "short" version of the catalog and the total result with the pPCA-imputed values, comparing the results (about 90% consistency).

The general clustering workflow is as follows:

  • probabilistic PCA to guess the missing values;
  • PCA dimensionality reduction
  • k-means clustering

Some other algorithms have also been tested to look for the best metrics, but they are not included in the final Jupiter notebooks for clarity.

A paper with more detailed description of the dataset, clustering, and results is accepted in Research in Astronomy and Astrophysics. The preprint is also available on arXiv:astro-ph.

Citation:

@article{10.1088/1674-4527/ad3d14,
	author={Kudryavtsev, D. and Sotnikova, Yu. and Stolyarov, V. and Mufakharov, T. and Vlasyuk, V. and Khabibullina, M. and Mikhailov, A. and Cherepkova, Yu.},
	title={Cluster analysis of the Roma-BZCAT blazars},
	journal={Research in Astronomy and Astrophysics},
	url={http://iopscience.iop.org/article/10.1088/1674-4527/ad3d14},
	year={2024},
}

2. The data

The dataset has been combined from the following sources:

We do not provide here the initial data due to the limitations on the use of disk space. The final dataset with cluster labels will be available on CDS VizieR after the accepted paper is processed by the publisher.

3. Project files

  • The ./scrapers/ folder contains the scripts used to get the data from the PanSTARRS, WISE, GALEX, NED, and SDSS catalogs. A script for the Selenium web driver has been also developed to mine the data on spectral energy distributions, "hiding behind the buttons" on the SED Builder web page. Thanks to Selenium, we can now winkle it out. :)

  • ./aver_spectra.ipynb Construction of averaged SED spectra.

  • ./data_comb.ipynb A notebook that combines the data.

  • ./feature_engeneering.ipynb A general preprocessing with some transformations made and new features created. Most of them are of astronomical kind, and some are further used in the clustering feature space while the others are only for the sake of possible further analysis.

  • ./final_df.ipynb Preparation of the final dataset (.csv) for publication (available on CDS VizieR).

  • ./main_model.ipynb This is the core of the project with the clustering, metrics, and preliminary analysis.

  • ./requirements.txt The requirements.

  • sweetviz_report.ipynb makes a Sweetviz HTML report used for faster data reviewing and cleansing.

The raa_revision branch of the repository containes the version where we checked the influence of gamma ray characteristics according to the recommendations of the referee and got a ~80% similarity of the results. Although it would have been better to use gamma rays in the clustering, the data on them was scarse, which had reduced the dataset dramatically in the number of object (see details in the paper.)