In this lab, we'll be building a CSV parser with Python! This will help us get comfortable with using standard Python libraries, data manipulation, and working with tuples.
- Read and parse
csv
files into an easier to manipulate format. - Build a functional Python CSV parser
- Write data to an easier to read JSON file
Fork
andclone
this repository- We'll be working in the
parser.py
file
We have a raw CSV file in our data
folder. It contains useful information about employees for a company but the data isn't accessible in its current state.
Our backend engineers would love to work with the data, but aren't having any success with it. This is why we're being hired, to convert the data into a useful JSON file!
How are we going to do this?
- First, we'll need to get the data from the CSV file
- Then we'll need to format each employee into a Python dictionary
- Finally, we'll need to convert a list of employee dictionaries into a JSON file
To start out, we'll need to import a few standard Python modules and create a few global variables at the top of our script
Let's add in our imports first at the top of the file
import csv
import json
- Were importing the CSV module to make reading and writing CSV files more accessible. More on the CSV module here.
- We're importing the JSON library to give ourselves access to useful JSON encoding/decoding methods like
json.dump()
andjson.loads()
. More on the JSON module here
Now let's add in a few global variables. They'll all be empty lists to start:
read_data = []
# for CSV data read and stored as tuples
fields = []
# for the name of each column in the CSV file's rows
employee_list = []
# for formatted data, a dictionary for each row in the CSV file
And with that, we've set up our script with the proper imports and global variables! We can move on to the fun part now - manipulating the data in our parsing function!
Let's start out with defining our function. It should take in 2 arguments: one for the path of our CSV file, and on for an output path to create a new JSON file. This way, we'll be able to take in any CSV file and convert it to JSON.
def csv_to_json(csv_path, json_path):
# your code here
...
csv_to_json('data/employee_data.csv','data/employees.json')
We'll write out each part of the function in the steps below. Make sure that each part is within the indented block of csv_to_json
!
The first thing we'll want to do here is extract the data from our CSV file. Once we've read the data, we'll append each entry into our read_data
list.
But first, how do we read a CSV file?
with open(csv_path, mode='r', encoding='utf-8-sig') as csv_file:
csv_data = csv.reader(csv_file, delimiter=',')
- More on Python's
open()
method here. - More on Python
with ... as
statements here. - More on using the CSV module's
reader()
method here.
With our read CSV data stored in csv_data
, we can now add it into our read_data
list.
- After setting the data to
csv_data
you'll need to loop through eachentry in csv_data
andappend
each entry as atuple
to ourread_data
list
We're setting these entries as tuples to ensure that the original data entries aren't changed.
Finally, outside of our with open
block, let's create two new variables for the headers of the CSV rows and the employee entries. These variables will be sub-lists of read_data
.
- One variable should be for the
headers
, the column titles of the CSV data.
HINT
print(data_list[0])
- The next variable should be for the
employees
, the data entries representing each employee indata_list
. Since theheaders
will be at the first index ofread_data
, everything else should be an employee's data entries. everything after the first index of read_data
Try print()
ing your headers
variable below to check that you've set it correcty.
print(headers)
('Emp ID', 'First Name', 'Middle Initial', 'Last Name', 'Gender', 'EMail', 'Date of Birth', 'Age', 'Weight', 'Date of Joining', 'Quarter of Joining', 'Year of Joining', 'Month of Joining', 'Month Name of Joining', 'Short Month', 'Day of Joining', 'Age in Company', 'Salary', 'Last % Hike', 'SSN', 'Phone Num', 'Place Name', 'County', 'City', 'State', 'Zip')
We know that we have access to the data in read_data
now, but what about formatting it before we convert it into JSON?
First, let's format the headers
and store them in our fields
list in a more accessible form.
Since the headers
are currently stored as a tuple, we'll need to use Python's enumerate() method to iterate over them with a for loop.
HINT
for i, some_value in enumerate(some_tuple):
- Set a variable
name
to represent each header inside the loop. It's value should:- Replace the empty spaces in each
header
with underscores and lowercase the whole string - Append the formatted header
name
to ourfields
list
- Replace the empty spaces in each
A few useful Python string methods:
If done properly, your fields
list should now have properly formatted Python strings! Try printing the fields
list below the loop to check.
print(fields)
['emp_id', 'first_name', 'middle_initial', 'last_name', 'gender', 'email', 'date_of_birth', 'age', 'weight', 'date_of_joining', 'quarter_of_joining', 'year_of_joining', 'month_of_joining', 'month_name_of_joining', 'short_month', 'day_of_joining', 'age_in_company', 'salary', 'last_%_hike', 'ssn', 'phone_num', 'place_name', 'county', 'city', 'state', 'zip']
Now that we have a list of the fields needed for each employee, let's add their data! We'll store each employee as a dictionary within employee_list
.
We'll need to loop through the employees
list that we set as a variable above. Inside the loop we'll also need to enumerate
through each employee tuple, just like we did with the headers
above. Inside the main loop of employees
, the following needs to happen:
- An empty dictionary should be created for each
employee
- Each employee's data entries should be accessed by
enumerating
over theemployee
- For each
entry
in the enumeratedemployee
loop, we'll need to create a key value pair for the dictionary- The
key
should be equal to thefield
at the same index as theentry
- The
value
should be equal to theentry
- The
- Update the dictionary with each key value pair for an employee's entries
- After updating the dictionary, append it to our
employee_list
and clear the dictionary by setting it back to an empty dictionary so the next employee can be added
Once finished, try printing employee_list[1]
to check if you're getting the right values. You should be getting a properly formatted dictionary in your console.
print(employee_list[1])
{'emp_id': '178566',
'first_name': 'Juliette',
'middle_initial': 'M',
'last_name': 'Rojo',
'gender': 'F', 'email':
'juliette.rojo@yahoo.co.uk',
'date_of_birth': '5/8/1967',
'age': '50.26',
'weight': '55',
'date_of_joining': '6/4/2011',
'quarter_of_joining': 'Q2',
'year_of_joining': '2011',
'month_of_joining': '6',
'month_name_of_joining': 'June',
'short_month': 'Jun',
'day_of_joining': '4',
'age_in_company': '6.15',
'salary': '193912',
'last_%_hike': '27%',
'ssn': '671-48-9915',
'phone_num': '215-254-9594',
'place_name': 'Glenside',
'county': 'Montgomery',
'city': 'Glenside',
'state': 'PA',
'zip': '19038'}
If you're getting the right return, we're done with formatting! Yay!
The final step in this process is converting our employee_list
into JSON and writing it into a readable JSON file.
Like we did to open the CSV file to start, we'll use the with open as
python statement, but this time we'll be giving it a mode of 'w'
to specify that we're writing a new file.
with open(json_path, mode='w') as output_file:
We've done two things here:
- Specified the path for a new JSON file we're creating from the path input in our function call
- Stored the file that we'll be writing to in a variable called
output_file
We need to do one last thing to create our new employees.json
file from the CSV data:
- Use the JSON package to dump our
employee_list
data into ouroutput_file
. We'll give it an indent of 2 for it to be nicely formatted.
with open(json_path, mode='w') as output_file:
# json.dump(???, ???, indent=2)
If everything was done correctly, you should have a brand new, cleanly formatted employees.json
file in your data
folder! Congrats, we've done it!
- The
csv_to_json
function should create a newemployees.json
file with properly formatted data when it is run
- Read about one important reason that CSV files are used with Python