Machine Learning Trend Analysis using Big Data Techniques

Short Description:

This project centers around using advanced big data programming tools and techniques, like Bash scripting, Hadoop, MapReduce, Spark, and various database management systems (MySQL/MongoDB) to capture, manage, and analyze data. The main data source is historical versions of the Wikipedia page on Machine Learning. The goal is to uncover the progression and expansion of the machine learning domain over recent years.

The project unfolds in several phases:

  1. Data Collection: Bash scripts are employed to fetch historical iterations of the "Machine Learning" Wikipedia page, offering insights into the field's progression.
  2. Data Management: After refining the data, the team determines the optimal database model (be it relational or document-based), designs the framework, and fills the database (I picked SQL based on the data).
  3. Data Processing with Hadoop: During this phase, MapReduce jobs are executed to delve into word co-occurrences and n-grams within the Wikipedia content.
  4. Data Processing with Spark: This runs in parallel to the Hadoop segment but harnesses PySpark for analogous textual evaluations.

This endeavor showcases the effective use of big data programming instruments, the judicious selection of data models, and the skill to glean insights from vast data repositories.