Python script provides a comprehensive set of functions for preprocessing Twitter data. It's designed to clean and normalize tweet text, making it suitable for further analysis or machine learning tasks.
- Remove Unicode characters
- Remove hashtags from words
- Remove @user mentions
- Remove URLs and links
- Remove emojis and emoticons
- Normalize punctuation (e.g., multiple exclamation marks)
- Remove stopwords
- Convert text to lowercase
- Remove short words (2-3 letters)
- Identify and extract location entities
- NLTK
- spaCy
- locationtagger
-
Clone the repository: ggit clone https://github.com/NGswati/Disaster_Automated_Form_Filling .git cd Disaster_Automated_Form_Filling unzip zip se project.zip
-
Install the required dependencies: pip install nltk spacy locationtagger
-
Download necessary NLTK data: import nltk nltk.download(['punkt', 'stopwords', 'maxent_ne_chunker', 'words', 'treebank', 'maxent_treebank_pos_tagger', 'averaged_perceptron_tagger'])
Download spaCy model: Copypython -m spacy download en_core_web_sm
Place your input CSV file in the same directory as the script. Update the input and output file names in the script: pythonCopyfile_ = open("your_input_file.csv", "r", encoding="utf8", errors='replace').read() pythonCopywith open("your_output_file.csv", "w", encoding="utf8", errors='replace') as file: Run the script: python tweet_preprocessing.py
The processed tweets will be saved in the output file specified.
You can customize the preprocessing steps by modifying the processing_functions list in the script. Add or remove functions as needed for your specific use case.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
This README provides a comprehensive overview of the tweet preprocessing tool, including its features, setup instructions, and usage guide. It's formatted in a way that's easy to read on GitHub and provides potential users with all the necessary information to get started with the tool.
A comprehensive machine learning project for classifying disaster-related tweets using various algorithms.
- Data preprocessing and TF-IDF vectorization
- Implementation of multiple classification algorithms
- Comparison of model performances
- pandas
- scikit-learn
- nltk
- tensorflow
- keras
- lightgbm
- xgboost
- Clone the repository: git clone https://github.com/NGswati/Disaster_Automated_Form_Filling .git cd Disaster_Automated_Form_Filling cd Classification_models.ipynb
- Install the required dependencies: pip install -r requirements.txt
Update the csv_file_path variable in the script with the path to your dataset. Run the script: python disaster_classification.py
The script expects a CSV file with the following columns:
tweets: containing the tweet text category: containing the category label for each tweet
The following models are implemented and compared:
Decision Tree Random Forest SGD Classifier Naive Bayes (Multinomial) Logistic Regression Deep Neural Network (DNN) Long Short-Term Memory (LSTM) AdaBoost LightGBM XGBoost
The script outputs accuracy scores and classification reports for each model. Example output: Model: Decision Tree Accuracy: 0.85 ...
Model: LSTM Accuracy: 0.92 ...
Contributions are welcome! Please feel free to submit a Pull Request.
Create your feature branch (git checkout -b feature/AmazingFeature) Commit your changes (git commit -m 'Add some AmazingFeature') Push to the branch (git push origin feature/AmazingFeature) Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details. This README follows GitHub's recommended format, including a table of contents, clear sections, and co
Python script implements a question answering system for disaster assessment, designed to process tweets and answer questions based on the tweet content.
- Parses tweets from a file
- Processes questions and extracts relevant information
- Uses natural language processing techniques to analyze text
- Implements various rules for different question types (what, when, where, why)
- Scores and ranks potential answers
- Outputs the best answer for each question
- nltk
- spacy
- string
-
Clone the repository: git clone https://github.com/NGswati/Disaster_Automated_Form_Filling .git cd Disaster_Automated_Form_Filling unzip Disaster_Automated_Form_Filling.zip
-
Install the required dependencies: pip install nltk spacy
-
Download the necessary NLTK and spaCy data: import nltk nltk.download('punkt') nltk.download('averaged_perceptron_tagger') nltk.download('stopwords') import spacy spacy.cli.download("en_core_web_sm")
-
Prepare the input files:
Tweet file: A CSV file containing the tweets Question file: A CSV file containing the questions Semantic class files: Text files containing lists of names, locations, months, etc.
Update the file paths in the main() function:
- input_path: Path to the directory containing input files
- semantic_classes(): Path to the directory containing semantic class files
- questions_file: Path to the questions CSV file
- tweet_dict: Path to the tweets CSV file
python disaster_qa.py
The script will process each question and output the question ID along with the best answer found.
parse_tweet(): Parses the tweet file cat_tweet(): Categorizes tweets extract_emphasized_phrases(): Extracts important phrases from questions AddTagPOS(): Adds part-of-speech tags to tweets semantic_classes(): Loads semantic class data when_rule(), what_rule(), why_rule(), where_rule(): Implement scoring rules for different question types wordMatch(): Calculates word matching score between question and potential answer data_forward(): Main processing function for questions and answers
This script is designed for a specific format of input data and may require modifications to work with different data structures or sources.