This project focuses on downloading, processing, and analyzing abstracts of academic papers related to solar physics and machine learning using Natural Language Processing (NLP) techniques. The project is structured into different directories to keep data, code, and results organized.
.
├── README.md
├── data
│ └── solar_ml_abstracts.csv # Contains the downloaded abstracts data
├── models
├── notebooks
│ └── Astrophysics_NLP_Sentiment_Analysis.ipynb # Jupyter Notebook for analysis
├── requirements.txt # List of dependencies
├── results
└── src
├── fetch_solar_articles.py # Script to fetch articles from arXiv
├── fetch_solar_articles..py # (Possible duplicate, remove if unnecessary)
└── model_training.py # Placeholder for model training script
Before running the project, ensure you have Python installed on your system. It's recommended to use a virtual environment to manage dependencies.
-
Create and Activate a Virtual Environment:
python3 -m venv venv source venv/bin/activate
-
Install the Required Packages:
Install the necessary Python packages listed in the
requirements.txt
file:pip install -r requirements.txt
If you haven't created a
requirements.txt
yet, you can do so by running:pip freeze > requirements.txt
-
Install Additional Dependencies:
If
matplotlib
or other packages are not installed, you can add them using:pip install matplotlib
The script fetch_solar_articles.py
is used to download abstracts from arXiv related to solar physics and machine learning.
-
Run the Fetch Script:
python3 src/fetch_solar_articles.py
This script will download up to 1000 articles and save the data in
data/solar_ml_abstracts.csv
.
-
Start Jupyter Notebook:
Launch Jupyter Notebook to begin analyzing the downloaded data:
jupyter notebook
-
Load the Data:
In your notebook (e.g.,
Astrophysics_NLP_Sentiment_Analysis.ipynb
), load the CSV data:import pandas as pd # Load the data data = pd.read_csv('../data/solar_ml_abstracts.csv') # Display the first few rows of the dataset data.head()
To visualize the distribution of articles over time:
-
Ensure
matplotlib
is Installed:If
matplotlib
is not installed, add it to your environment:pip install matplotlib
-
Plot the Data:
import matplotlib.pyplot as plt # Convert the 'published' column to datetime format data['published'] = pd.to_datetime(data['published']) # Plot the distribution of articles over time plt.figure(figsize=(10, 6)) data['published'].hist(bins=30) plt.title('Distribution of Articles Over Time') plt.xlabel('Publication Date') plt.ylabel('Number of Articles') plt.show()
If you encounter issues with importing packages or running scripts, consider the following:
-
Verify the Python Environment:
Ensure that the packages are installed in the correct Python environment. You can check which Python environment is being used by running:
import sys print(sys.executable)
-
Restart Jupyter Kernel:
After installing new packages, restart the Jupyter kernel:
Kernel > Restart Kernel
With the data loaded, you can proceed with various NLP tasks such as sentiment analysis, topic modeling, or word cloud generation. These analyses can be documented and expanded upon in your Jupyter Notebook.