Based on MOMA, Kadist & Tate datasets, this project attempts to explore art through NLP and Machine Learning.
The first steps is to enrich the MOMA's dataset by adding additional text for a work , this isa rich source of interpretive text that describes the work beyond the meta data (size, date etc.) Unfortunately not many of the works contain additional text, but for those that do I scrape the site and add the data to the csv.
This textual data will be used for
- key phrase extraction
- to build word clouds, result from MOMA scraped extra txt:
- word cloud from Kadist Collection
- scrape and enrich the meta data ✓
- ingest into elastic search ✓
- write a Kibana app to explore the data ✓
- investigate gensim LDA
MoMA requests that you actively acknowledge and give attribution to MoMA wherever possible. If you use the dataset for a publication, please cite it using the digital object identifier . Attribution supports efforts to release other data. It also reduces the amount of “orphaned data,” helping retain links to authoritative sources.
for gensim installation the following are prereqs:
-
$ sudo apt-get install liblapack-dev
-
$ sudo apt-get install gfortran
-
Some interesting background on art and machine learning
-
Re word2vec models:
The en_1000_no_stem can only be opened using this format:
from gensim.models import Word2Vec
model = Word2Vec.load("en_1000_no_stem/en.model")
model.similarity('woman', 'man')
the other models can be opened with the usual:
model = Word2Vec.load_word2vec_format(model_path, binary=True)
(or remove binary for non binary models)
LSI provides a way to expand terms, synonyms, hyper/hyponyms, experiment with LSI and pyLDAvis for visualization