data_science

implicit signal beats explicit ones (almost always): clickbait, rating psychology
your model will learn what you teach it to learn: feature, function, f score
sup + unsup = life
everything is ensemble
model sequences: output of the model is input of others
FE: reusable, transformable, interpretable, reliable
ML infra: experimentation phase: easiness, flexibility, reusability. production phase: performance, scalable
Debugging feature values
you don't need to distribute ML algo
DS + ML engineering = perfection

31.5

pycon2016: https://www.youtube.com/channel/UCwTD5zJbsQGJN75MwbykYNw
andreas, intro ML/sklearn for DS: https://github.com/amueller/introduction_to_ml_with_python
Berkeley ds intro: https://data-8.appspot.com/sp16/course

30.5

dirichlet process: http://stiglerdiet.com/blog/2015/Jul/28/dirichlet-distribution-and-dirichlet-process/
pycon 2016: https://github.com/justmarkham/pycon-2016-tutorial/
romance in word2vec: http://www.ghostweather.com/files/word2vecpride/
topic quality coherence: http://palmetto.aksw.org/palmetto-webapp/
https://spacy.io/docs
https://spacy.io/docs/tutorials/twitter-filter
http://sebastianraschka.com/Articles/2014_naive_bayes_1.html
https://github.com/justmarkham/pycon-2016-tutorial

29.5

cry analysis: http://www.robinwe.is/explorations/cry.html
spacy preprocessing: https://github.com/cemoody/lda2vec/blob/master/lda2vec/preprocess.py
spacy Tweet: https://spacy.io/docs/tutorials/twitter-filter
lda2vec: full http://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term=
probalistic approach: http://chirayukong.github.io/infsci2725/resources/09_Probabilistic_Approaches.pdf
lda curation: https://datawarrior.wordpress.com/2016/04/20/local-and-global-words-and-topics/
why hdbscan: http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb
auto ml: http://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf
http://www.kdnuggets.com/2016/05/five-machine-learning-projects-cant-overlook.html
topic2vec: https://www.cs.cmu.edu/~diyiy/docs/naacl15.pdf

26.5

25.5

In summary, here is what I recommend if you plan to use word2vec: choose the right training parameters and training data for word2vec, use avg predictor for query, sentence and paragraph(code here) after picking a dominant word set and apply deep learning on resulted vectors.

===

For SGNS, here is what I believe really happens during the training: If two words appear together, the training will try to increase their cosine similarity. If two words never appear together, the training will reduce their cosine similarity. So if there are a lot of user queries such as “auto insurance” and “car insurance”, then “auto” vector will be similar to “insurance” vector (cosine similarity ~= 0.3) and “car” vector will also be similar to “insurance” vector. Since “insurance”, “loan” and “repair” rarely appear together in the same context, their vectors have small mutual cosine similarity (cosine similarity ~= 0.1). We can treat them as orthogonal to each other and think them as different dimensions. After training is complete, “auto” vector will be very similar to “car” vector (cosine similarity ~= 0.6) because both of them are similar in “insurance” dimension, “loan” dimension and “repair” dimension. This intuition will be useful if you want to better design your training data to meet the goal of your text learning task.

===

for short sentences/phrases, Tomas Mikolov recommends simply adding up individual vector words to get a "sentence vector" (see his recent NIPS slides).

For longer documents, it is an open research question how to derive their representation, so no wonder you're having trouble :)

I like the way word2vec is running (no need to use important hardware to process huge collection of text). It's more usable than LSA or any system which requires a term-document matrix.

Actually LSA requires less structured data (only a bag-of-words matrix, whereas word2vec requires exact word sequences), so there's no fundamental difference in input complexity.

24.5

TSNE:

Conferences:

word2vec tree: https://github.com/pvthuy/word2vec-visualization
flask, api, mongo, d3: http://adilmoujahid.com/posts/2015/01/interactive-data-visualization-d3-dc-python-mongodb/
https://github.com/RaRe-Technologies/movie-plots-by-genre
wmd: http://vene.ro/blog/word-movers-distance-in-python.html
word2vec viz: https://ronxin.github.io/wevi/
news analytics in finance: https://vimeo.com/67901816
table2vec: http://www.slideshare.net/SparkSummit/using-data-science-to-transform-opentable-into-delgado-das
data by the bay: http://data.bythebay.io/schedule.html
pydataberlin: http://pydata.org/berlin2016/

20.5

scatter with images: https://gist.github.com/lukemetz/be6123c7ee3b366e333a

19.5

wise 203 classes, vocab = 300k, sample = 64k, test = 34j=k, http://alexanderdyakonov.narod.ru/wise2014-kaggle-Dyakonov.pdf
yelp review to multi label: food, deal, ambience,... http://www.ics.uci.edu/~vpsaini/
instagram: http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji
emoji embedding http://www.danielforsyth.me/nba-twitter-emojis-and-word-embeddings/
tweetmap in websummit event: http://blog.aylien.com/post/133931414053/analyzing-tweets-from-web-summit-2015-semantic
topic2vec: http://arxiv.org/pdf/1506.08422.pdf
http://googleresearch.blogspot.com/2016/05/chat-smarter-with-allo.html
https://en.wikipedia.org/wiki/Limited-memory_BFGS

18.5

building data processing at budget: http://www.slideshare.net/GaelVaroquaux/building-a-cuttingedge-data-processing-environment-on-a-budget
https://radimrehurek.com/gensim/wiki.html
calibration: http://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html#example-calibration-plot-calibration-curve-py
http://glowingpython.blogspot.com/2014/02/terms-selection-with-chi-square.html
which feature selection: http://sebastianraschka.com/faq/docs/feature_sele_categories.html
which learning algos: http://sebastianraschka.com/faq/docs/best-ml-algo.html
for intepretability use tree: http://sebastianraschka.com/faq/docs/model-selection-in-datascience.html
LR vs NB: http://sebastianraschka.com/faq/docs/naive-bayes-vs-logistic-regression.html
yelp review classifier: https://github.com/parulsingh/FlaskAppCS194
ngsg is not mf yet: https://building-babylon.net/2016/05/12/skipgram-isnt-matrix-factorisation/
http://blog.aylien.com/post/133931414053/analyzing-tweets-from-web-summit-2015-semantic
http://aylien.com/web-summit-2015-tweets-part1

sentifi:

https://github.com/bdhingra/tweet2vec
tweet2vec https://arxiv.org/abs/1605.03481
syntaxnet: https://github.com/tensorflow/models/tree/master/syntaxnet
hijack compromise user account http://www.icir.org/vern/papers/twitter-compromise.ccs2014.pdf
user classification: name + loc http://www.cs.jhu.edu/~vandurme/papers/broadly-improving-user-classfication-via-communication-based-name-and-location-clustering-on-twitter.pdf
chrispot: http://sentiment.christopherpotts.net/tokenizing.html
https://github.com/cbuntain/TwitterFergusonTeachIn
mining tweet: https://rawgit.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/Chapter%201%20-%20Mining%20Twitter.html
NE: https://noisy-text.github.io/pdf/WNUT10.pdf
tokenizer: http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py
twitter tokenizer online: http://sentiment.christopherpotts.net/tokenizing/results/
cs224u understanding nlp: http://nbviewer.jupyter.org/github/cgpotts/cs224u/
https://spacy.io/blog/german-model?utm_source=News&utm_campaign=87a64aae50-German_release_newsletter&utm_medium=email&utm_term=0_89ad33e698-87a64aae50-64293797
jupyter theme: http://sherifsoliman.com/2016/01/11/theming-ipython-jupyter-notebook/
noisy text need to be normalized: https://noisy-text.github.io/norm-shared-task.html
understanding user profile/twitter: https://blog.twitter.com/2015/guest-post-understanding-users-through-twitter-data-and-machine-learning
word2vec with numba: https://d10genes.github.io/blog/2016/05/03/word2vec/
analyzing text data at Firefox: http://web.stanford.edu/~rjweiss/public_html/MozFest2013/
pretrained word2vec https://github.com/3Top/word2vec-api
twitter music word2vec: http://www.netbase.com/blog/understanding-beliebers-word2vec-twitter/
text + images with CNN: https://www.scribd.com/doc/305710656/Convolutional-Neural-Networks-for-Multimedia-Sentiment-Analysis
feature pivot: http://www.hpl.hp.com/techreports/2011/HPL-2011-98.pdf
nlp with cnn: http://www.slideshare.net/devashishshanker/deep-learning-for-natural-language-processing
event detection http://www.hpl.hp.com/techreports/2011/HPL-2011-98.pdf
http://www.zdnet.com/article/big-data-what-to-trust-data-science-or-the-bosss-sixth-sense/
tf is winning: https://medium.com/@mjhirn/tensorflow-wins-89b78b29aafb#.6lebzwbyx
a vc blog: http://avc.com
hijacking: http://www.icir.org/vern/papers/twitter-compromise.ccs2014.pdf
us president prediction: http://www.aioptify.com/predictinguselection.php
https://thestack.com/world/2015/05/08/three-steps-to-building-a-twitter-driven-trading-bot/
http://file.scirp.org/pdf/SN_2015070917142293.pdf
tweet latent attributes: http://boingboing.net/2014/09/01/twitter-uses-an-algorithm-to-f.html
user gender inference: http://www.aclweb.org/anthology/W14-5408
https://blog.bufferapp.com/the-5-types-of-tweets-to-keep-your-buffer-full-and-your-followers-engaged
classifying user latent attributes: http://www.cs.jhu.edu/~delip/smuc.pdf
http://myownhat.blogspot.com/
http://bugra.github.io/work/notes/2015-01-17/mining-a-vc/
NER with w2v, 400M tweet: http://www.fredericgodin.com/software/

http://davidrosenberg.github.io/ml2016/#home

pydatalondon 2016:

spotify:

lda asyn, auto alpha: http://rare-technologies.com/python-lda-in-gensim-christmas-edition/

mapk: https://github.com/benhamner/Metrics/tree/master/Python/ml_metrics

ilcr2016: https://tensortalk.com/?cat=conference-iclr-2016

l.m.thang

https://github.com/jxieeducation/DIY-Data-Science

http://drivendata.github.io/cookiecutter-data-science/

http://ofey.me/papers/sparse_ijcai16.pdf

Spotify:

skflow:

a few useful things to know about ML:

tdb: https://github.com/ericjang/tdb

dask for task parallel, delayed: http://dask.pydata.org/en/latest/examples-tutorials.html

skflow:

pip install git+git://github.com/tensorflow/skflow.git
http://www.kdnuggets.com/2016/02/scikit-flow-easy-deep-learning-tensorflow-scikit-learn.html

http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/

https://medium.com/a-year-of-artificial-intelligence/lenny-2-autoencoders-and-word-embeddings-oh-my-576403b0113a#.ecj0iv4n8

https://github.com/andrewt3000/DL4NLP/blob/master/README.md

tf:

tf chatbot: https://github.com/nicolas-ivanov/tf_seq2seq_chatbot

deep inversion : https://github.com/TaddyLab/gensim/blob/deepir/docs/notebooks/deepir.ipynb
encoder decoder with attention: http://arxiv.org/pdf/1512.01712v1.pdf
keras tut: http://web.cs.hacettepe.edu.tr/~aykut/classes/spring2016/bil722/tutorials/keras.pdf

Bayesian Opt: https://github.com/fmfn/BayesianOptimization/blob/master/examples/visualization.ipynb

click-o-tron rnn: http://clickotron.com auto generated headline clickbait: https://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-neural-networks/

http://blog.computationalcomplexity.org/2016/04/the-master-algorithm.html http://jyotiska.github.io/blog/posts/python_libraries.html

LSTM: http://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/

CS224d:

Sota of sa, mikolo and me :)

Thang M. L: http://web.stanford.edu/class/cs224n/handouts/cs224n-lecture16-nmt.pdf

CS224d reports:

classify online forum answer/non-answer: https://cs224d.stanford.edu/reports/AbajianAaron.pdf
gender classification: https://cs224d.stanford.edu/reports/BartleAric.pdf
job prediction: https://cs224d.stanford.edu/reports/BoucherEric.pdf
text sum: https://cs224d.stanford.edu/reports/ChaiElaina.pdf
email spam: https://cs224d.stanford.edu/reports/EugeneLouis.pdf
jp2en: https://cs224d.stanford.edu/reports/GreensteinEric.pdf
improve PV: https://cs224d.stanford.edu/reports/HongSeokho.pdf
twitter sa: https://cs224d.stanford.edu/reports/YuanYe.pdf
yelp sa: https://cs224d.stanford.edu/reports/YuApril.pdf
author detector: https://cs224d.stanford.edu/reports/YaoLeon.pdf
IMDB to Yelp: https://cs224d.stanford.edu/reports/XingMargaret.pdf
Reddit: https://cs224d.stanford.edu/reports/TingJason.pdf
Quora: https://cs224d.stanford.edu/reports/JindalPranav.pdf

QA in keras:

Chinese LSTM + word2vec:

DL with SA: https://cs224d.stanford.edu/reports/HongJames.pdf

MAB:

mab book: http://pdf.th7.cn/down/files/1312/bandit_algorithms_for_website_optimization.pdf
yhat: http://blog.yhat.com/posts/the-beer-bandit.html
test significance with AB, conversation rate opt with MAB: https://vwo.com/blog/multi-armed-bandit-algorithm/
when to use multiarmed bandits: http://conversionxl.com/bandit-tests/
multibandit: http://stevehanov.ca/blog/index.php?id=132

cnn nudity detection: http://blog.clarifai.com/what-convolutional-neural-networks-see-at-when-they-see-nudity/#.VxbdB0xcSko

sigopt: https://github.com/sigopt/sigopt_sklearn

first contact with TF: http://www.jorditorres.org/first-contact-with-tensorflow/

eval of ML using A/B or multibandit: http://blog.dato.com/how-to-evaluate-machine-learning-models-the-pitfalls-of-ab-testing

how to make mistakes in Python: www.oreilly.com/programming/free/files/how-to-make-mistakes-in-python.pdf

keras tut: https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/keras_tutorial.pdf

Ogrisel word embedding: https://speakerd.s3.amazonaws.com/presentations/31f18ad0522c0132b9b662e7bb117668/Word_Embeddings.pdf

Tensorflow whitepaper: http://download.tensorflow.org/paper/whitepaper2015.pdf

Arimo distributed tensorflow: https://arimo.com/machine-learning/deep-learning/2016/arimo-distributed-tensorflow-on-spark/

Best ever word2vec in code: http://nbviewer.jupyter.org/github/fbkarsdorp/doc2vec/blob/master/doc2vec.ipynb

TF japanese: http://www.slideshare.net/yutakashino/tensorflow-white-paper

TF tut101: https://github.com/aymericdamien/TensorFlow-Examples

Jeff Dean: http://learningsys.org/slides/NIPS-Learning-Systems-Workshop-TensorFlow-Jeff-Dean.pdf DL: http://www.thoughtly.co/blog/deep-learning-lesson-1/ Distributed TF: https://www.tensorflow.org/versions/r0.8/how_tos/distributed/index.html

playground: http://playground.tensorflow.org/

Hoang Duong blog: http://hduongtrong.github.io/ Word2vec short explanation: http://hduongtrong.github.io/2015/11/20/word2vec/

ForestSpy: https://github.com/jvns/forestspy/blob/master/inspecting%20random%20forest%20models.ipynb

keras for mnist: https://github.com/wxs/keras-mnist-tutorial/blob/master/MNIST%20in%20Keras.ipynb
lasagne installation https://martin-thoma.com/lasagne-for-python-newbies/

Netflix:

Lessons learned

WMD:

word mover distance: https://github.com/mkusner/wmd
gensim wmd: https://speakerdeck.com/tmylk/same-content-different-words

Hanoi trip:

tensorflow scan: learn the cum sum https://nbviewer.jupyter.org/github/rdipietro/tensorflow-notebooks/blob/master/tensorflow_scan_examples/tensorflow_scan_examples.ipynb
https://jayantj.github.io/posts/project-gutenberg-word2vec
stacking: http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/
Learn and think like human: http://arxiv.org/pdf/1604.00289v1.pdf
predictive modeling + AI: https://speakerd.s3.amazonaws.com/presentations/30ad41b99258471f9485118f904f8cfb/predictive_modeling_and_deep_learning.pdf
sklearn vs tf: https://github.com/rasbt/python-machine-learning-book/blob/master/faq/tensorflow-vs-scikitlearn.md
advances in DL for NLP: http://cs.nyu.edu/~zaremba/docs/Advances%20in%20deep%20learning%20for%20NLP.pdf
Xavier, 10 lessons learned: https://medium.com/@xamat/10-more-lessons-learned-from-building-real-life-ml-systems-part-i-b309cafc7b5e#.klowhfq10
pizza analysis: http://yoavz.com/potd/
R at airbnb: https://medium.com/airbnb-engineering/using-r-packages-and-education-to-scale-data-science-at-airbnb-906faa58e12d#.deo3t37vr
450 hours in data science: http://studiy.co/path/data-science/
LR + SGD + FM: https://gist.github.com/kalaidin/9ea737ad771fcf073e57
libFM: http://www.ics.uci.edu/~smyth/courses/cs277/papers/factorization_machines_with_libFM.pdf
intro FM: http://www.slideshare.net/0x001/intro-to-factorization-machines
fastFM: https://github.com/ibayer/fastFM
winning data science competition: https://speakerdeck.com/datasciencela/jeong-yoon-lee-winning-data-science-competitions-data-science-meetup-oct-2015
python for data analyst: https://www.kevinsheppard.com/images/0/09/Python_introduction.pdf
risk modeling: https://risk-engineering.org/static/PDF/slides-stat-modelling.pdf
pyfm: https://github.com/coreylynch/pyFM
mlss2014: http://www.mlss2014.com/materials.html
xavier: https://www.slideshare.net/slideshow/embed_code/key/gt6HuUzZ4Z7flf
Pedro: http://www.thetalkingmachines.com/blog/
Machine Intelligence 2.0: https://cdn-images-1.medium.com/max/2000/1*A9exqeQ69XjjSJgMyDEo6Q.jpeg
Quora - all about data scientits: https://www.quora.com/What-are-the-best-blogs-for-data-scientists-to-read
World of though vector: http://www.pamitc.org/cvpr15/files/lecun-20150610-cvpr-keynote.pdf
newbie nlp lab: https://github.com/piskvorky/topic_modeling_tutorial/
why and when log-log is used: http://www.forbes.com/sites/naomirobbins/2012/01/19/when-should-i-use-logarithmic-scales-in-my-charts-and-graphs/#41c6dc0c3cd8
lzma: https://parezcoydigo.wordpress.com/2011/10/09/clustering-with-compression-for-the-historian/
Tom Vincent: http://insightdatascience.com/blog/tom_vincent_qanda.html
Normalized Compression Distance: http://tamediadigital.ch/2016/03/20/normalized-compression-distance-a-simple-and-useful-method-for-text-clustering-2/
Yoav Goldberg: https://www.youtube.com/watch?v=xw5HL5h1wxY
Sklearn production on Dato: https://www.youtube.com/watch?v=AwjeRg1u5VI

VinhKhuc:

how many k for CV: k = N e.g. LOOCV http://vinhkhuc.github.io/2015/03/01/how-many-folds-for-cross-validation.html
backprop http://vinhkhuc.github.io/2015/03/29/backpropagation.html
qa bAbI task: https://github.com/vinhkhuc/MemN2N-babi-python
lstm/rnn: http://vinhkhuc.github.io/2015/11/19/rnn-lstm.html

RS:

Data science bootcamp: https://cambridgecoding.com/datascience-bootcamp#outline

CambridgeCoding NLP:

Annoy:

RPForest: https://github.com/lyst/rpforest LightFM: https://github.com/lyst/lightfm Secure because of math: https://www.youtube.com/watch?v=TYVCVzEJhhQ Talking machines: http://www.thetalkingmachines.com/ Dive into DS: https://github.com/rasbt/dive-into-machine-learning

DS process: https://www.oreilly.com/ideas/building-a-high-throughput-data-science-machine Friendship paradox: https://vuhavan.wordpress.com/2016/03/25/ban-ban-ban-nhieu-hon-ban-ban/

AB test:

EMNLP 2015:

semantic sim of embedding: https://www.cs.cmu.edu/~ark/EMNLP-2015/tutorials/34/34_OptionalAttachment.pdf
social text analysis: https://www.cs.cmu.edu/~ark/EMNLP-2015/tutorials/3/3_OptionalAttachment.pdf
personality research in NLP: https://www.cs.cmu.edu/~ark/EMNLP-2015/tutorials/2/2_OptionalAttachment.pdf

To read:

Idols:

Alex Pinto: MLSec
Peadar Coyle: https://peadarcoyle.wordpress.com/, https://github.com/springcoil/pydataamsterdamkeynote, http://slides.com/springcoil/dataproducts-11#/27, https://medium.com/@peadarcoyle/three-things-i-wish-i-knew-earlier-about-machine-learning-54cb0d23ca29#.uc6e049rl
Radmim: gensim
Delip Rao: http://deliprao.com/archives/129
Alex: http://alexanderdyakonov.narod.ru/engcontests.htm
Yorav: https://www.cs.bgu.ac.il/~yoavg/uni/
Andreij: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Sebastian: http://www.kdnuggets.com/2016/02/conversation-data-scientist-sebastian-raschka-podcast.html
Joel Grus: http://joelgrus.com/
Bugra: http://bugra.github.io/

IPython/Jupyter:

https://docs.google.com/presentation/d/1PHnnkKYgjq1lcSDaVyhZP0Fs7qC70iA07b2Jv0uisUE/mobilepresent?slide=id.g10d199ad72_0_20

LSTM:

RNN for music: http://erikbern.com/2014/06/28/recurrent-neural-networks-for-collaborative-filtering/
skflow: https://github.com/tensorflow/skflow/tree/master/examples
dropout: http://arxiv.org/abs/1409.2329
seq2seq: http://arxiv.org/abs/1409.3215
simple char rnn: https://gist.github.com/karpathy/d4dee566867f8291f086
https://www.tensorflow.org/versions/r0.7/tutorials/recurrent/index.html#the-model
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

RNN:

Unicode:

EVENTS:

April 8-10 2016: PyData Madrid
April 15-17 2016: PyData Florence
May 6-8 2016: PyData London hosted by Bloomberg
May 20-21 2016: PyData Berlin
September 14-16 2016: PyData Carolinas hosted by IBM
October 7-9 2016: PyData DC hosted by Capital One
November 28-30 2016: PyData Cologne

Other Conference Dates Coming Soon!

PyData Chicago
PyData NYC
PyData Paris
PyData Silicon Valley
pydata amsterdam: http://pydata.org/amsterdam2016/schedule/ https://speakerdeck.com/maciejkula/hybrid-recommender-systems-at-pydata-amsterdam-2016
gcp 23-24 March
pycon sg: June 23-25
emnlp: june, austin, us
pydata

QUOTES:

My name is Sherlock Homes. It is my business to know what other people dont know.
Take the first step in faith. You don't have to see the whole staircase, just take the first step. [M.L.King. Jr]
"Data data data" he cried impatiently. I can't make bricks without clay. [Arthur Donan Doyle]

STATS:

http://vietsciences.free.fr/vietnam/bienkhao-binhluan/tuoithovuachuavn.htm

BOOKS:

CLUSTER:

EMBEDDING:

https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/
http://www.offconvex.org/2016/02/14/word-embeddings-2/
improving sem embedding words rep: https://levyomer.wordpress.com/2015/03/30/improving-distributional-similarity-with-lessons-learned-from-word-embeddings/
whiskey: http://wrec.herokuapp.com/methodology
lda: topic eva: http://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html
lda2vec: http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec-57135994
http://nbviewer.jupyter.org/github/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda.ipynb
text2vec: http://dsnotes.com/articles/glove-enwiki
Swivel Submatrix Wise Vector Embedding Learner http://arxiv.org/pdf/1602.02215v1.pdf
https://sense2vec.spacy.io/?natural_language_processing%7CNOUN

Linux:

http://randyzwitch.com/gnu-parallel-medium-data/

BENCHMARK:

DIY:

Products:

Full stack:

Must seen:

Must read:

Curated:

Cool blogs:

Visualizations:

cohort analysis: https://blog.clevertap.com/how-to-use-cohort-analysis-to-improve-retention/
bokeh 101: http://felipegalvao.com.br/blog/2016/03/15/data-visualization-python-now-with-bokeh/
4 story telling strategies: http://annkemery.com/four-storytelling-strategies/
http://cs.stanford.edu/people/karpathy/svmjs/demo/
https://www.oreilly.com/ideas/jupyter-at-oreilly

Writing:

Teaching:

imclab/data_science

data_science