/portfolio_PDA_Data_Science

This is a repository for the data analysis with Python part of the course

Primary LanguageJupyter Notebook

portfolio_PDA_Data_Science

This is a repository with a summary of the work performed within the part of the course for a PDA in Data Science with Code Division on Data Analysis with Python (with futureCodersSE)

In this course we also worked on data manipulation with Microsoft Excel, a very useful software with the great advantage of being known by everyone, at any level, in countries accross the world. As a very user friendly tool, it allows performing basic and not that basic tasks related to data analysis manipulation and data visualization. However, as the task in hand becomes more and more complicated (e.g. with regard to the size of the data or to the type of computational operations required), Excel requires more attention/expertise/help from other tools (such as Power BI), add-ins (such as the Data Analysis ToolPak) or the use of its own programming language (VBA).

At this point, it's worth it to explore other alternatives for data manipulation... and nothing better than exploring alternatives with open source accesibility... like Python...

In the first place, writing a bit of Python code facilitates A LOT the import and manipulation of large sizes/less common types of data in comparison with Excel. While the latter is great for manageable datasets, Python is very good both for manageable and more difficult datasets. With the same/very similar code.

Processes can become very time-consuming in Excel. The longer the process the more time-consuming... the higher the size of the data the slower everything seems to get...
The difficulty/required time for writing Python code doesn't increase with the size of the dataset... at all... and the processing speed doesn't depend on the size/characteristics of your dataset.

Data manipulation processes in Excel can be automated, but eventually certain knowledge of some programming in VBA will be required for it (this implies the advantage of being that user-friendly might be gone at this point.. a Python vs VBA comparison can be found here) On the other hand, automation is reeeeally easy in Python, just require knowing a few concepts really worth knowing. Why is automation important? On top of saving time on repetitive tasks, for reproducibility. And why is reproducibility important?

  • to be able to explain (every detail) of your work whenever required.
  • to facilitate "double-checking" the work, confirm it, find errors/fix them quickly and auditing.
  • to be able to show evidence of the reliability of your work.

Sometimes, an error in Excel is really clear, transparent and localized, but that's not the norm... error handling in Python is way more transparent by default, and can get even more clear with a bit of customization.

In addition, Excel has statistical capabilities (specially with specific add-ins) and allow the creation of a good variety of graphs (again specially with complementary tools). However, the potencial goes far beyond in both fields with Python and its specific libraries. Yhat's a good time to remember:

All modern versions of Python are copyrighted under a GPL-compatible license certified by the Open Source Initiative.

The programming language includes thousands of third-party modules available in the Python Package Index PyPI, which provides popular standards for different expertise.

Python has an enormous user community, which can be translated as having high chances there are strong people already working on a solution to any issue you might have.

Finally, Python scripts can be run in any platform. Who doesn't know someone that had compatibility issues while trying to work on Excel on different operating systems etc? (at least, that means time.. and maybe a light headache)

Depending on the size and format of raw data you have and what you want to achieve with it, both can be optimal tools. Hence, there is no doubt the best idea is to know (at least the fundamentals) of both, and be able to use the most convenient/efficient for each task or project.