/Big-Data

Primary LanguageJupyter Notebook

Big Data

This repository contains code examples and experiments for performing word count operations on large datasets using PySpark. The goal is to demonstrate how to handle big data efficiently by leveraging Apache Spark's distributed computing capabilities. Every code is runned 3 times to check the execution time and CPU usage

Usually I use this repo to upload and save the results so I can view them and compare the differences anytime... must of the codes are the same

Introduction

Word count is a common task in big data analysis, often used to demonstrate the capabilities of data processing frameworks like Apache Spark. This repository provides sample code to read large text datasets from CSV files, process them using PySpark, and perform a word count using a MapReduce-like approach.

Requirements

  • Python 3.6+
  • Google Colab or a local Spark installation
  • PySpark