/art-dataset-nlp-experiments

Based on datasets from MOMA, Kadist, Brooklyn Museum, a project to explore art through NLP and Machine Learning

Primary LanguageJupyter NotebookMIT LicenseMIT

Project status

Art Project - Experiment with Art Datasets using NLP

Based on MOMA, Kadist & Tate datasets, this project attempts to explore art through NLP and Machine Learning.

The first steps is to enrich the MOMA's dataset by adding additional text for a work , this isa rich source of interpretive text that describes the work beyond the meta data (size, date etc.) Unfortunately not many of the works contain additional text, but for those that do I scrape the site and add the data to the csv.

This textual data will be used for

  • key phrase extraction
  • to build word clouds, result from MOMA scraped extra txt:

  • word cloud from Kadist Collection

Plan

  • scrape and enrich the meta data ✓
  • ingest into elastic search ✓
  • write a Kibana app to explore the data ✓
  • investigate gensim LDA

Other Datasets

Required Attribution:

MoMA requests that you actively acknowledge and give attribution to MoMA wherever possible. If you use the dataset for a publication, please cite it using the digital object identifier DOI. Attribution supports efforts to release other data. It also reduces the amount of “orphaned data,” helping retain links to authoritative sources.

Notes

for gensim installation the following are prereqs:

The en_1000_no_stem can only be opened using this format:

from gensim.models import Word2Vec
model = Word2Vec.load("en_1000_no_stem/en.model")
model.similarity('woman', 'man')

the other models can be opened with the usual:

model = Word2Vec.load_word2vec_format(model_path, binary=True)

(or remove binary for non binary models)

LSI

LSI provides a way to expand terms, synonyms, hyper/hyponyms, experiment with LSI and pyLDAvis for visualization