See TODO.md for a todo list of reading.
This README documents some of the actual code / processing / data science that's been completed or planned. Skip ahead to section 2c for some interactive code.
This project will use this dataset provided by Reuters: Reuters News Archive (30 Days). It is briefly described as:
Reuters’ Text Archive provides the full corpus of English articles that have been published. This will include breaking news in the financial and general news space as well as global coverage in politics, sports, entertainment, and technology. This comprehensive corpus of content makes this dataset ideal for any natural language processing (NLP) algorithms or ML applications.
There are 59,542 documents in this corpuse.
$ find ./ -type f | wc -l
59542
Each file is an XML document in the IPTC NewsML-G2 format.
- TODO: review NewsML-G2
- TODO: document this XML structure
- TODO: map XML structure to human-meaningful description
- TODO: ontology? review XML limitations and consider how to expand the semantics of this dataset
The item_xml_docs_to_csv.py
script coupled with the Jupyter notebook some early insights into the data and its dimensions:
import pandas as pd
df = pd.read_csv('../output.csv')
df.head()
df.describe()
# rough estimates about the text body:
# - text body tends to be about 595 words long, but with extreme outliers (46297!)
# - text body tends to be about 2904 chars long, but with extreme outliers (203K!)
# - average word length is 4-5 chars long (bodyLengthChars / bodyLengthWords, for mean and max)
df.info()
# rough observations:
# - genres are missing for the majority of items (could disregard genre, or this could be an
# interesting problem: predict genre items where it's missing)
# - subjects are missing for a small number of items, about 6.5%
414 subjects in total Top 50 are overwhelmingly geographic: Americas, North America, Europe, United States, Emerging Market Countries
- Proof of concept: follow Neo4j tutorial
- Sample Cypher queries
Consider:
- evaluating and improving information retrieval (search)
- text classification - [ ] Neo4j: items have phrases, phrases have topics, same topics are same class? - [ ] similarity/clustering
- Neo4j cloud hosting
- Expose Cypher query interface
- Expose information retrieval search API
- Chatbot
- finalize code/pipeline -- identify key code snippets
- diagram of data flow: XML docs -> CSV/Neo4j -> NLP pipeline
- define the problem: improve information retrieval with Neo4j? apply Neo4j to improve NLP applications? use Neo4j to improve semantic value of text data? compare Neo4j to non-graph DBs for information retrieval or NLP applications?
Notes about challenges faced (and solved):
- text encoding: emojis in tweets (�� -> left-facing fist emoji) for tag:reuters.com,2019:newsml_CqtHM2P1a. solutions: (1) handle exceptions, (2) sub &#\d\d\d\d\d; with �, (3) TBD correct solution
- text encoding: apostrophes become
’
instead of'
(see breakingviews CSV file) - scale of data -- had to modify script so it doesn't take hours to run
- how to store / process text files (CSV?)
- loading into Neo4j
- how to split words -- 1 or more spaces so that tag:reuters.com,2019:newsml_L3N2602HC:991614233.XML doesn't have 76783 words (still has 294233 chars)
- weird document (table, lots of whitepace): tag:reuters.com,2019:newsml_L3N2602HC:991614233.XML
- weird document (full transcript of a committee hearing): tag:reuters.com,2019:newsml_CqtYP8GSa:1548902168.XML
- how to compute average word length
- weird docuument: Greek: tag:reuters.com,2019:newsml_L5N25X2BB:219326942.XML
- cleaning/categorizing of subjects and genres (see data-dimensions.ipynb notebook)
- populating wikidata URLs (manually) and wikipedia URLs (wikidata_to_wikipedia_url.py script) for all subjects and genres