/URL-Malware-Analyzer

Security tool that scans URLs and predicts if they are malicious or not, based on a Logistic Regression algorithm.

Primary LanguagePython

URL-Malware-Analyzer

Security tool that scans URLs and predicts if they are malicious or not. The prediction is based on a list of URLs, the respective labels(there is an extensive amount of lists available online that can be used) and the Logistic Regression algorithm model (scikit learn).

Vectorization

Vectorization is the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation.
Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

sklearn.feature_extraction.text
Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.