/data_incub

Challenge questions for The Data Incubator

Primary LanguageJupyter Notebook

data_incub

Challenge questions for The Data Incubator

  1. Run Challenge.ipynb for challenge questions 1 and 2.
  2. Run Project.ipynb for challenge question 3, the project proposal.

Images generated by Challenge.ipynb are displayed inline in the IPython notebooks, while images generated by Project.ipynb are displayed both inline and as .png files in the imgs directory.

Project proposal: arX-Live

As a doctoral student, I was used to feeling hopeless about keeping up to date on all the latest research in my field. At different times, I've had both the experience of being scooped on a project, and of discovering after many hours of grueling labor that a newly developed technique would have saved me time. There have also been moments after finishing a project when I had a choice of what direction to take my research in next, but I simply didn't know what topics were in high enough demand to obtain funding. With all these relatable problems in mind, I propose arX-Live, the arXiv-trend predictive engine! Using the arXiv API, I was able to scrape a year's worth of eprint abstracts from the arXiv servers in my chosen category (high energy physics - theory, but it could be anything). I then cleaned the data by removing MathJax/LaTeX tags, extra whitespace, punctuation, and common stopwords, and then by fixing the case of the remaining text. My next goal was then to filter out any non-technical terminology in each of the abstracts. However, technical terms are often phrases constructed from more mundane words (usually no more than three), so we must be careful here! Using Python's Natural Language Toolkit (NTLK), I analyzed the remaining text from each abstract to determine the most likely bigram and trigram word collocations, and merged each occurrence into a single monogram. Now, I could safely filter out NTLK's convenient corpus of English-language words, after reducing each matching word to its stem. At this stage, each abstract has been reduced to a normalized bag of buzzwords. Computing the frequencies of each buzzword in a rolling window of 6 months, offset by a week between iterations to reduce the volume of data, I obtained a series of buzzword frequencies over time. These can be visualized in a variety of ways, but I chose first to look at both a word cloud and at a heatmap showing the evolution of buzzword frequencies over time. Much of the work that has already been done is in the realm of data scraping and munging, and this is now all automated in preparation for implementation in an app. However, there is still a great deal of analysis to be done. I propose to train a neural network on (buzzword, date) -> frequency data in order to predict future trends in research. One can also incorporate number of citations and references, or even the full text of the eprint, as well as analyze the frequency with which groups of buzzwords appear in the same abstract. My end goal for this project would be to create an app (aided by parallel processing capability from PySpark, as the data wrangling stage can be expensive) that will aid researchers in planning the trajectory of their work.

Thank you, and enjoy!