msc-data-science-project

See TODO.md for a todo list of reading.

This README documents some of the actual code / processing / data science that's been completed or planned. Skip ahead to section 2c for some interactive code.

1. Source and analyze the data

1a. Data source

This project will use this dataset provided by Reuters: Reuters News Archive (30 Days). It is briefly described as:

Reuters’ Text Archive provides the full corpus of English articles that have been published. This will include breaking news in the financial and general news space as well as global coverage in politics, sports, entertainment, and technology. This comprehensive corpus of content makes this dataset ideal for any natural language processing (NLP) algorithms or ML applications.

2a. Corpus size

There are 59,542 documents in this corpuse.

$ find ./ -type f | wc -l
59542

2b. XML

Each file is an XML document in the IPTC NewsML-G2 format.

TODO: review NewsML-G2
TODO: document this XML structure
TODO: map XML structure to human-meaningful description
TODO: ontology? review XML limitations and consider how to expand the semantics of this dataset

2c. CSV and dimensions

The item_xml_docs_to_csv.py script coupled with the Jupyter notebook some early insights into the data and its dimensions:

import pandas as pd

df = pd.read_csv('../output.csv')
df.head()

df.describe()
# rough estimates about the text body:
#  - text body tends to be about 595 words long, but with extreme outliers (46297!)
#  - text body tends to be about 2904 chars long, but with extreme outliers (203K!)
#  - average word length is 4-5 chars long (bodyLengthChars / bodyLengthWords, for mean and max)

df.info()
# rough observations:
#  - genres are missing for the majority of items (could disregard genre, or this could be an
#    interesting problem: predict genre items where it's missing)
#  - subjects are missing for a small number of items, about 6.5%

414 subjects in total Top 50 are overwhelmingly geographic: Americas, North America, Europe, United States, Emerging Market Countries

2. Ingest the documents into Neo4j

Proof of concept: follow Neo4j tutorial
Sample Cypher queries

3. Text processing

Consider:

evaluating and improving information retrieval (search)
text classification - [ ] Neo4j: items have phrases, phrases have topics, same topics are same class? - [ ] similarity/clustering

4. [optional] Deploy data / pipeline / API?

Neo4j cloud hosting
Expose Cypher query interface
Expose information retrieval search API
Chatbot

5. Finalize written report

finalize code/pipeline -- identify key code snippets
diagram of data flow: XML docs -> CSV/Neo4j -> NLP pipeline
define the problem: improve information retrieval with Neo4j? apply Neo4j to improve NLP applications? use Neo4j to improve semantic value of text data? compare Neo4j to non-graph DBs for information retrieval or NLP applications?

5x. Challenges

Notes about challenges faced (and solved):

heychrisek/msc-data-science-project