Every day, more than 2.5 quintillion bytes of data are produced, according to studies, and the number is continuously growing. In fact, by 2020, every individual on the world is expected to create 1.7MB of data each second . Google is the largest player in the sector, with 87.35 percent of the global search engine market share in 2021. This equates to 1.2 trillion searches each year and over 40,000 queries every second . This produced huge amounts of data, known as big data. Massive data sets generated from a number of sources are referred to as big data. These data sets, due to their magnitude and complexity, cannot be collected, stored, or analysed using any of the existing traditional procedures. As a result, several tools, including as NoSQL, Hadoop, and Spark, are used to investigate massive data sets. Using big data analysis technologies, we collect various types of data from more diverse sources, such as digital media, web services, business applications, machine log data, and so on. Artificial intelligence, particularly machine learning, has become increasingly important in today's fast-paced world of information technology. A diverse set of resources has been discovered using a variety of techniques in order to process products and services quickly while maintaining high standards. Machine learning methods, on the other hand, are difficult to implement because to their high cost and massive storage requirements for data, CPUs, and memory. To deal with huge amounts of data, sophisticated systems began to emerge. Spark is a well-known distributed data research computer that boasts impressive capabilities such as improved performance in huge datasets. While Spark supports a number of programming languages, for this study Python was chosen because of its unique features such as screen analysis, data visualisation, and quicker framework processing. Thereafter, PySpark, a Spark and Python hybrid, is used for managing Spark data in this study. Jupyter Notebook is a free online tool for writing and sharing live code, equations, visualisations, and text documents. Jupyter Project is in charge of Jupyter Notebook upkeep. Jupyter Notebooks is an IPython fork with an IPython Notebook project. Jupyter is called for the three primary programming languages that it supports: Julia, Python, and R. Jupyter comes with the IPython kernel, which allows you to write Python programmes, however there are presently over 100 additional kernels available. This open source tool includes Jupyter Notebook, Jupyter Lab, and Jupyter Hub, as well as a variety of plug-ins and modules to aid in the development of a collaborative application. The data is based on a Google Play Store app. This data set was derived from the Kaggle data set and utilised in this study to analyse and visualise massive amounts of data. While the data obtained comprised 24 attributes , only attributes were chosen as the most appropriate for this study. PySpark was started using the installation step number once the dataset was chosen. Following that, the data set's pre-loading was done. After that, duplicates were eliminated, as well as columns with incomplete data. After then, the investigation was carried out. The discussion section includes an explanation of the graphics and associated algorithms.
Monish6864/Big-Data-Analytics-and-Visualization-Using-PySpark
Every day, more than 2.5 quintillion bytes of data are produced, according to studies, and the number is continuously growing. In fact, by 2020, every individual on the world is expected to create 1.7MB of data each second . Google is the largest player in the sector, with 87.35 percent of the global search engine market share in 2021. This equates to 1.2 trillion searches each year and over 40,000 queries every second . This produced huge amounts of data, known as big data. Massive data sets generated from a number of sources are referred to as big data. These data sets, due to their magnitude and complexity, cannot be collected, stored, or analysed using any of the existing traditional procedures. As a result, several tools, including as NoSQL, Hadoop, and Spark, are used to investigate massive data sets. Using big data analysis technologies, we collect various types of data from more diverse sources, such as digital media, web services, business applications, machine log data, and so on. Artificial intelligence, particularly machine learning, has become increasingly important in today's fast-paced world of information technology. A diverse set of resources has been discovered using a variety of techniques in order to process products and services quickly while maintaining high standards. Machine learning methods, on the other hand, are difficult to implement because to their high cost and massive storage requirements for data, CPUs, and memory. To deal with huge amounts of data, sophisticated systems began to emerge. Spark is a well-known distributed data research computer that boasts impressive capabilities such as improved performance in huge datasets. While Spark supports a number of programming languages, for this study Python was chosen because of its unique features such as screen analysis, data visualisation, and quicker framework processing. Thereafter, PySpark, a Spark and Python hybrid, is used for managing Spark data in this study. Jupyter Notebook is a free online tool for writing and sharing live code, equations, visualisations, and text documents. Jupyter Project is in charge of Jupyter Notebook upkeep. Jupyter Notebooks is an IPython fork with an IPython Notebook project. Jupyter is called for the three primary programming languages that it supports: Julia, Python, and R. Jupyter comes with the IPython kernel, which allows you to write Python programmes, however there are presently over 100 additional kernels available. This open source tool includes Jupyter Notebook, Jupyter Lab, and Jupyter Hub, as well as a variety of plug-ins and modules to aid in the development of a collaborative application. The data is based on a Google Play Store app. This data set was derived from the Kaggle data set and utilised in this study to analyse and visualise massive amounts of data. While the data obtained comprised 24 attributes , only attributes were chosen as the most appropriate for this study. PySpark was started using the installation step number once the dataset was chosen. Following that, the data set's pre-loading was done. After that, duplicates were eliminated, as well as columns with incomplete data. After then, the investigation was carried out. The discussion section includes an explanation of the graphics and associated algorithms.
Jupyter Notebook