AI-vs-Human-Generated-Text

AIPD

AIPD - AI Persona Detector, This open-source tool is designed to detect whether text content has been generated by AI algorithms or authored by humans.

Detecting Human vs AI Generated Text

Team Members

  • Shreyas Aswar
  • Hymavathi Gummudala
  • Hasaranga Jayathilake

Objective

In the recent past, more concern was given to plagiarism and not utilizing proper citations of the original works in academia. However, due to technological advancements in the current world, more people are using AI-generated texts for their day-to-day tasks to shorten the time consumed in typing work, like writing reply emails. Moreover, in the field of academia, using AI-generated work to finalize tasks has become a very generalized approach (Ge & Wang 2024). This creates a major issue where people could not check the authenticity of the thoughts coming from the academic field, as it becomes a gray area to determine whether it is generated by a human or what percentage of the AI was used for the work. Because of that, currently, more and more new AI detecting mechanisms are being developed to identify AI-generated works.

Research Gap

From the literature, it was identified that there is a research gap in understanding the utilization of statistical linguistic features of the language into a neural network framework (Ge & Wang 2024). Since under natural language processing, word classification is still not that extensively researched area, where still, most of the models have very low accuracy levels compared to other areas in NLP technology (Papers with Code, 2023).

Research Question and Goal

How to combine statistical linguistic features with a Neural Network in order to develop a model to identify and differentiate the AI-generated and human-generated text classification work is considered the research goal and the research question of this project.

Data Flow of the Project

  1. Data Collection:

    • Data was collected from a Kaggle website, consisting of two columns: Text and Labels (0 for human, 1 for AI).
  2. Preprocessing:

    • Checked for missing values in the dataset.
    • Assessed dataset balance to ensure it's not imbalanced.
  3. Data Splitting:

    • The dataset was split into training (80%) and testing (20%) sets.
  4. Tokenization and Padding:

    • Sentences were tokenized, breaking them into tokens.
    • All sentences were standardized to equal length using padding.
  5. Feature Extraction:

    • Linguistic features such as average word length, text length, etc., were extracted from the text data.
  6. Model Building:

    • Two models were developed:
      • A CNN model for processing text data.
      • An MLP (Multi-Layer Perceptron) model for handling linguistic features.
    • The outputs of both models were concatenated.
  7. Additional Layer:

    • An additional feedforward layer with sigmoid activation was added.
  8. Model Compilation:

    • The model was compiled using the Adam optimizer and binary cross-entropy loss function.
  9. Model Training:

    • The model was trained on the training data for 10 epochs.
  10. Model Evaluation:

  • The model's performance was evaluated on the test data.
  • Achieved an accuracy of 99%.
  1. Testing on New Data:
  • A dataset containing AI and human-generated text was downloaded from GitHub.
  • Some samples of text were taken and tested with the trained model.
  • The model correctly predicted whether the text was AI-generated or human-generated.

Technologies Used to develop the project.

Technology Purpose
Python Programming language for writing the entire project
pandas Data manipulation and analysis
sklearn Machine learning model training and testing
TensorFlow/Keras Building and training deep learning models
numpy Numerical computing and feature engineering
spaCy Natural language processing for POS tagging

Limitation of Research

  • The main challenge is having only a limited amount of time to finish the project.

Future Research Direction

  • Future research could explore enhancing model accuracy by integrating deeper linguistic analysis and expanding POS tagging features, utilizing advanced NLP libraries like spaCy.

Reference

Papers with Code. (2023). Leaderboard: Text Classification on MTEB. Retrieved March 24, 2024, from https://paperswithcode.com/sota/text-classification-on-mteb

Ge, Z., & Wang, Y. (2024). GPT-generated Text Detection: Benchmark Dataset and Tensor-based Detection Method. arXiv preprint arXiv:2403.07321. Retrieved March 23, 2024, from https://arxiv.org/abs/2403.07321