Overview

As a data scientist, my goal is to continually learn and improve my skills in data analysis and visualization. I believe that developing proficiency in programming languages such as Java is essential to achieving this goal. I decided to do this project as a way to learn the Java language and immediately use it in an applicable way. I chose to write a program that would read in a data file and perform statistical calculations on the data, then write a summary of the calculated statistics to an output file in a user-friendly, readable format.

Specifically, the program accepts an input file path from the user, a delimiter, and an output file path. The program expects a CSV file for the input and a Markdown file for the output. The program reads the input file and separates the columns with numeric values from the columns with string values. The program then calculates the minimum, maximum, mean, and sum of the numeric columns and writes the results to the output file. The program also alphabetizes the string columns and writes the top ten results to the output file.

The purpose of this software is to provide a tool for data scientists to quickly and easily analyze large CSV files, including generating summary reports and statistical calculations. By creating this software, I have gained a deeper understanding of the Java language and object-oriented programming principles, as well as furthered my skills in data analysis and visualization.

Software Demo Video

Development Environment

  • VS Code 1.76.0
  • Java 19
  • openjdk 11.0.16.1 2022-08-12 LTS

Useful Websites

Future Work

  • Error Handling
  • Support for more data types and file types
  • More statistical calculations
  • More output formats
  • Better optimization for large data sets
  • Support for multiple delimiters
  • Count and list the unique values in each string column