String randomness score generator

A lightweight, 0 dependency package to generate a randomness score for a string. Used to identify if a string is gibberish or word-like. Some applications include -

Identify if a user is typing something or just banging the keyboard
Determine if a string is an API Key, Access Token, etc
Check if a string is something randomly generated by a computer

Usage

The tool returns back a randomness score for a string. You can tune the conditions according to your use case, but, generally, a score above 4 signifies that the input string is random. If you want to try out the tool just for a word, you can also test it out on https://randomness-score-generator.web.app/

NPM package

Install the npm package
Import and use it in your code like

const Model = require('./Model');

// Remember to load the model before using it
Model.loadModel();

const score = Model.score("helloWorld");

How does it work?

Training

At its core the model uses a bigram model to calculate the probability of the next character, given a character (Using a n-gram model would give better results, but its WIP).
We parse through a comprehensive list of words in the English language to create a 2D table which stores the occurrence of each character following the current character.
While generating this table, we also add a special <.> character at the start and end of each word to get the count of words starting & ending with a character. This table is then row-normalized to make the data uniform. This gives us the probability of a character following the current character. These probabilities are used in score calculation.

Score Calculation

We first parse the word to convert it to lowercase and remove any extra characters.
Then, since we have a bigram model, we break down the word into pairs of 2. ( including the special start and end <.> character )
Next, we get the log of the probability of this pair (As these probabilities are minute, their log is a better uniform measure)
We add these log values for all the pairs in the word.
As this sum is a negative number, we invert it to get a positive value.
We divide this score by the number of characters in the word to get the final score.

Contribute

Create a fork and clone it.
To contribute to the model generation part, navigate to the modelGenerator/ folder . This contains a python notebook used for generating the model. Feel free to suggest improvements to the model
To contribute to the npm package, go into the modelGenerator/ directory which contains the source code for the npm package, as well as the latest model being used for calculation

Bug Reporting

Report your issues at https://github.com/Pranav2612000/string_randomness_score_generator/issues

Gotchas & Improvements

The model is trained on English words and may not work for other languages.
To reduce training complexity the model is case-insensitive.
The current model is not very accurate for very short strings.
The dataset the model is built on does not have first class support for numbers and some special characters, so strings involving these can be inaccurate.
The dataset does not include keyboard-common strings like "qwerty", so the results may not be correct for strings of these category.
The current model is a bigram. We can use Deep Learning to replace this with a n-gram model for better results.

Maintainer

Pranav Joglekar

License

This project is licensed under the terms of the MIT open source license. Please refer to LICENSE.md for the full terms.

Pranav2612000/string_randomness_score_generator