FiletypeDetection

You can read more about the project on its corresponding JIRA issue. File Byte Histogram Machine Learning Classification

This code is used to generate a model for Tika for Content based mime detection with byte frequency histograms.

Steps

Identify the type of file you want Tika to detect using this method.
Collect files of the type and also files which are not of this type.
Create three datasets
- Training
- Testing
- Validation The dimensionality for each set is as follows. m*(256+1), where m indicates the number of training/validation/test examples; 256 is the size of features (i.e. byte frequency histogram which is not preprocessed with a companding function) + 1 for the labeled output.
These can be generated as csv files.
Run main.r
The output of main.r is tika.model which can be used in Tika.

For more detailed documentation, download the Documenation_NNModelIntegrationWithTika.docx in this project.