/Text-Mining-with-TF-IDF-and-Cosine-Similarity

A simple python repository for developing perceptron based text mining involving dataset linguistics preprocessing for text classification and extracting similar text for a given query.

Primary LanguageJupyter NotebookMIT LicenseMIT

Text Mining with TF-IDF & Cosine Similarity

A simple python repository for developing perceptron based text mining involving dataset linguistics preprocessing for text classification and extracting similar text for a given query.

New Implementation: Added PyTorch based optimization handling buggy loading of sparse 'csr_matrix' to cuda tensor.

Outcomes

  1. Numpy implementation,

    Vanilla Optimization Optimization with L2-Regularization

    Top 5 weighted terms,

    Terms Weights Terms: L2 Weights: L2
    langeweile 7.094 top 5.8911
    geilo 7.0535 langeweile 5.8396
    best 6.7828 geilo 5.7615
    love 6.376 perfekt 5.6325
    exzellent 6.3534 super 5.6279
  2. PyTorch implementation,

    Vanilla Optimization Optimization with L2-Regularization

    Histogram:Weights Penalized Weights

    Top 5 weighted terms,

    Terms Weights Terms: L2 Weights: L2
    erfolgreichen 20.5452 cool 8.8814
    anmeldungen 20.0064 geil 8.0933
    angemessene 19.658 super 6.7332
    eonfach 19.5906 top 5.4004
    verarbeitung 19.5136 gut 4.8924

Dependencies

Install dependencies using:

pip3 install -r requirements.txt 

Contact