/spam-detection

Email spam detection experiments

Primary LanguageJupyter Notebook

Spam Detection via deep forest

It uses machine learning models to predict whether a term is spam.

Datasets

The experiments are performed on the following two data sets

  • UCI-youtube
  • UCI-sms

Text Processing

The text processing methods including:

  • Remove the stop words
  • Build word count vector
  • TF-IDF
  • Texts to sequences

Examples of building the word count vector:

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.vocabulary_)
# {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
print(X.toarray())
# out: [[0 1 1 1 0 0 1 0 1]
#		[0 2 0 1 0 1 1 0 1]
#   	[1 0 0 1 1 0 1 1 1]
#       [0 1 1 1 0 0 1 0 1]]

Examples of Texts to sequences approach:

from keras.preprocessing.text import Tokenizer
text1='Some ThING to eat !'
text2='some thing to drink .'
texts=[text1,text2]
print(texts)
#out:['Some ThING to eat !', 'some thing to drink .']
tokenizer = Tokenizer(num_words=100) #num_words:None或整数,处理的最大单词数量。少于此数的单词丢掉
tokenizer.fit_on_texts(texts)
print( tokenizer.word_counts) 
#out:OrderedDict([('some', 2), ('thing', 2), ('to', 2), ('eat', 1), ('drink', 1)])
print( tokenizer.word_index) 
#out:{'some': 1, 'thing': 2, 'to': 3, 'eat': 4, 'drink': 5}
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print(sequences)
#out:[[1, 2, 3, 4], [1, 2, 3, 5]] 转换为序列,注意这里句子等长,所以输出一样,但是不等长句子输出的长度是不一样的
print('Found %s unique tokens.' % len(word_index))
#out:Found 5 unique tokens.
SEQ_LEN = 10
data = pad_sequences(sequences, maxlen=SEQ_LEN)
print(data)
#out:[[0 0 0 0 0 0 1 2 3 4]
# [0 0 0 0 0 0 1 2 3 5]]

Tree

.
├── [1.9K]  README.md
├── [ 30K]  ResultsRecord.xlsx
├── [ 160]  code
├── [ 352]  data
├── [ 448]  gcforest
├── [1.1K]  img
├── [ 704]  notebook
├── [ 640]  pkl
└── [ 190]  requirements.txt

data & pkl Extraction code: vt1f