NLP-FOREX-PREDICT

Foreign exchange market prediction using natural language processing.

Library

450,341 news from Bloomberg and 109,110 news from Reuters, according to README, however I count 448395 Bloomberg files.
Total size of news is 2.3 GB for Bloomberg and 555 MB for Reuters.
After preprocessing and concatinating, the number of lines in the corpus file is 737222, which is much bigger than expected. I suspect extra new line charactors are increasing the number of lines. File size reduced to 1.3 GB (40% reduction? really?).
Training on a 770 MB file consumes about 35 GB memory.
Some news include garbage in its original text, see the bottom of 2010-06-14/u-a-e-central-bank-head-sees-economy-growing-4-in-2010-after-contraction. The garbage starts from "Enlarge image"
The preprocess script makes corruption of text?

Multicore doesn't work on Cent OS7, as it can be seen from the CPU % of top command. Typically, the CPU % is at 100% per thread running the dox2vec training. On my home mac, the CPU % increases as more workeres are added (4 workers -> 250 CPU%)
On linux, I tried Anaconda and vanila python both but the max CPU% was about 200% for both case.
Hyper parameter tuning:

Word count of 'h= http' on text output of original "read.py" -> 8551441. Thats 26 lines less than the initial naive line count.
save result as non decorated text file. When I do this, entries with \n will have "" but not for those without. This can be used for testing the data cleaning.

It is better to do with lots of text for word embedding, use mixed source.
Clean title.txt by removing duplicates etc.
Unify timzones.
Check all txt files can be loaded as pandas dataframe. currently timestamp doesn't load properly.
Make histogram of number of news.
Use Bloomberg dataset.

Training was successful on the server with the doc2vec-lee.
Using script "doc2vec-reuters.ipnb". Original: doc2vec-lee.ipynb (available in gensim repo).
model.train() takes very long. Use servers for this.
Check availability of BLAS on python which I installed on the server -> training takesw a few seconds, which means BLAS is available.
Parameter tuning vocabulary (number, 名詞、動詞、形容詞) vector dimensionality Doc2vec only? Word2vec also? epoch model type? (DBOW, DM) window size of corpus
Saving trained model: model.save("filename")
Loading trained model: np.load("filename") # returns an iterable