Antivirus: A Java repository from aligeekk

====================
Antivirus 
====================

An antivirus program written in Java that can scan a file and detect if it is
a virus using Bayesian analysis.


USAGE
--------------------
There are 5 buttons:

1) Open Directory - Choose a directory. If "virusdb.ser" exists in the 
directory, the previous save state will automatically be loaded. Otherwise,
a new database will be created at runtime.

2/3) Learn Benign Files/Viruses - Choose a directory containing the known
viruses/benign files in order to train the program.

4)Clear Database - Clears the current working database and chosen directory. 
No files will be deleted.

5) Scan File - Choose a file. The program will then scan the file and calculatethe ratio of virus/benign based on the PROBABILITY CALCULATION method below. 
Then, the program predicts whether the file is a virus or not based on the
ratio.


In order to use the program, you have to train it. Start by clicking "Learn Benign Files" and "Learn Viruses." These buttons will prompt you to choose a 
directory, in which the  known viruses/normal files are stored. Then, the 
program will scan the files and count the n-grams for each file (my program
uses 4 character sequences). When the program is learning, there will be no
output until the end. For some reason, it waits until the end of the learning
to print anything to the console. It may take up to 5 secs for the program to
finish and it will prompt you when it is done.

On exit, the program will ask you if you want to save. If you want to save, 
you must first choose a directory by clicking "Open Directory." The serialized
data will be saved as "virusdb.ser" in the chosen directory.

The top-right panel contains the current directory as well as the number of
files that have been used to train the program in the current session.


PROBABILITY CALCULATION
-----------------------
I calculated probabiilites using this method:

http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Document_Classification

When a file is scanned, I compute the natural log of the ratios. The formula
is as follows:

ln[(p(virus|file)/p(not virus|file)] = sum[p(word|virus)/p(word|not virus)]

If the sum of the logs is greater than 0, then the file is a virus. If the 
sum is less than 0, then the file is benign.

N-grams that have not been seen in the training phase are skipped.

Overall, this method is okay at categorizing files. There are quite a few fals negatives, meaning that virus files are classified as benign.
I believe that this is caused by the unevenness of the two training
directories. Although there are more virus files, there are more n-grams in thebenign directory. Therefore, the counts are generally higher in the benign hashtable, skewing the results a bit towards the benign side in cases where viruses

OTHER INFO
----------------------
When the state is saved as "virusdb.ser", the VirusDB object is serialized.
VirusDB contains two hash tables, one virus and one benign files, a list of thefiles used for training, the number of files used for training, and the 
directory the file was saved in.
aligeekk/Antivirus