You can read more about the project on its corresponding JIRA issue. File Byte Histogram Machine Learning Classification
This code is used to generate a model for Tika for Content based mime detection with byte frequency histograms.
- Identify the type of file you want Tika to detect using this method.
- Collect files of the type and also files which are not of this type.
- Create three datasets
- Training
- Testing
- Validation
The dimensionality for each set is as follows.
m*(256+1)
, where m indicates the number of training/validation/test examples; 256 is the size of features (i.e. byte frequency histogram which is not preprocessed with a companding function) + 1 for the labeled output.
- These can be generated as csv files.
- Run main.r
- The output of main.r is tika.model which can be used in Tika.
For more detailed documentation, download the Documenation_NNModelIntegrationWithTika.docx in this project.
Send them to Chris A. Mattmann.