A lightweight, 0 dependency package to generate a randomness score for a string. Used to identify if a string is gibberish or word-like. Some applications include -
- Identify if a user is typing something or just banging the keyboard
- Determine if a string is an API Key, Access Token, etc
- Check if a string is something randomly generated by a computer
The tool returns back a randomness score for a string. You can tune the conditions according to your use case, but, generally, a score above 4 signifies that the input string is random. If you want to try out the tool just for a word, you can also test it out on https://randomness-score-generator.web.app/
- Install the npm package
- Import and use it in your code like
const Model = require('./Model');
// Remember to load the model before using it
Model.loadModel();
const score = Model.score("helloWorld");
- At its core the model uses a bigram model to calculate the probability of the next character, given a character (Using a n-gram model would give better results, but its WIP).
- We parse through a comprehensive list of words in the English language to create a 2D table which stores the occurrence of each character following the current character.
- While generating this table, we also add a special
<.>
character at the start and end of each word to get the count of words starting & ending with a character. This table is then row-normalized to make the data uniform. This gives us the probability of a character following the current character. These probabilities are used in score calculation.
- We first parse the word to convert it to lowercase and remove any extra characters.
- Then, since we have a bigram model, we break down the word into pairs of 2. ( including the special start and end
<.>
character ) - Next, we get the log of the probability of this pair (As these probabilities are minute, their log is a better uniform measure)
- We add these log values for all the pairs in the word.
- As this sum is a negative number, we invert it to get a positive value.
- We divide this score by the number of characters in the word to get the final score.
- Create a fork and clone it.
- To contribute to the model generation part, navigate to the modelGenerator/ folder . This contains a python notebook used for generating the model. Feel free to suggest improvements to the model
- To contribute to the npm package, go into the modelGenerator/ directory which contains the source code for the npm package, as well as the latest model being used for calculation
Report your issues at https://github.com/Pranav2612000/string_randomness_score_generator/issues
- The model is trained on English words and may not work for other languages.
- To reduce training complexity the model is case-insensitive.
- The current model is not very accurate for very short strings.
- The dataset the model is built on does not have first class support for numbers and some special characters, so strings involving these can be inaccurate.
- The dataset does not include keyboard-common strings like "qwerty", so the results may not be correct for strings of these category.
- The current model is a bigram. We can use Deep Learning to replace this with a n-gram model for better results.
- Pranav Joglekar
This project is licensed under the terms of the MIT open source license. Please refer to LICENSE.md for the full terms.