Challenge Details:
Develop a text classification system to categorize movie reviews as positive or negative.
• Employ regular expressions to clean the text data and perform text normalization techniques
• Construct N-grams from the text data. Discuss the choice of N-grams for feature extraction.
• Implement a Naive Bayes classifier to categorize the reviews. Optimize the classifier for better performance in sentiment analysis.
• Evaluate the classifier using metrics such as precision, recall, and F-measure. Perform cross-validation to assess the model’s robustness.
• Represent text data using Vector Semantics or word embeddings. Use cosine similarity to compare vectors and refine feature representation.
• Integrate neural networks or pretrained word embeddings into the model to enhance classification performance.
Programming Language: Python
• Libraries: NLTK, scikit-learn, TensorFlow/PyTorch (for neural networks), gensim (for word embeddings)
• Environment: Jupyter Notebooks
Sample Data for Training
Review | Sentiment
The movie was fantastic, I loved every minute of it! | Positive
Terrible acting and a predictable plot, definitely not worth the watch. | Negative