In this section, you started by learning about how Python code interacts with files, and how to use the open
built-in function to read data from files on disk into Python objects in memory. Then you learned about the CSV and JSON formats for serializing data, and how to use the built-in csv
and json
Python modules to parse and extract data from files using those formats.
While there are many popular third-party libraries used for data analysis with Python, you can create simple, powerful analyses with just the built-in language data types and modules. When working with data in base Python:
- Data is stored in familiar data structures such as lists and dictionaries, which are often nested
- Analysis is performed using familiar control structures such as
for
loops andif
statements - Files are opened using the
open
built-in function, and CSV or JSON files are parsed using the built-incsv
orjson
modules
As you were initially learning to work with Python, you often used Python variables that were declared in the code of a Jupyter Notebook, manipulated using that code, then never existed again after that notebook was shut down. These are variables "in memory". In many cases data scientists will want to work with larger data files that are saved after the code is finished running. This is data "on disk". Python has built-in functionality to facilitate communication between variables in memory and files on disk.
Key takeaways:
- Variables in memory are the current active variables in your code, which go away if you shut down the kernel or shut down your computer
- Files on disk are the things you see printed out when you type
ls
in the terminal, which persist when you shut down the kernel or shut down your computer - There is no concept like "editing" when working with files in Python. You are either reading from the first line to the last line of the file (sometimes all at once), or writing lines one after another
- In Python, the typical way to interact with these files is to use the
open
built-in function- The first step is identifying the path to the file, which is a string and represents the relative path from the current code execution context to the location of the file on the computer. It is represented as
path_to_file
in the examples below - To open a file for reading, that looks like:
with open(path_to_file) as f: # Read data from file object f
- To open a file for writing, that looks like:
with open(path_to_file, "w") as f: # Write data to the file object f
- The
with...:
syntax means that the file is automatically closed after thewith
block ends. If you use the syntaxf = open(path_to_file)
instead, you also need to callf.close()
when you are finished working with the file - You will find that some third-party libraries handle opening and closing the files for you. In this case, you only need to identify the file path and do not need to call
open
and.close
yourself - The
open
function assumes that you are working with a file encoded as plain text, such as an unstructured.txt
file, CSV file, or JSON file. If you are working with a file encoded as bytes, such as a JPEG image, you need to specify a mode of"rb"
to read or"wb"
to write when you callopen
- The first step is identifying the path to the file, which is a string and represents the relative path from the current code execution context to the location of the file on the computer. It is represented as
The CSV format stands for "comma-separated values". It is possibly the most common data serialization format used by data scientists, and we will use it very frequently in this course.
Key takeaways:
- CSV is a file format. The file names typically end with the extension
.csv
- CSV is a plain text format. This means that it is encoded as text that is readable without specialized software, and you can explore the contents using VS Code, Vim, or any other general-purpose editor
- CSV is a delimited format. This means that there are characters in the text of the file that are intended to separate the pieces of data from one another
- As you might assume from the name, the most common delimiter is a comma. A line of comma-delimited CSV content representing four pieces of data might look like:
10,2,1.5,"Firstname Lastname"
- It is also possible to use some other delimiter. One of the most common is a tab
\t
, to the extent that there is a name "TSV" (tab-separated values) for that specific kind of file. You could also use a pipe|
, semicolon;
, or really any other character as long as the markup is consistent and the code knows how to interpret it. Technically all of these files would still be referred to as CSVs, even though the delimiter is not a comma
- As you might assume from the name, the most common delimiter is a comma. A line of comma-delimited CSV content representing four pieces of data might look like:
- In Python, we can use the
csv
module to parse data from CSV files (documentation here)- If the data file does not contain headings, the
csv.reader
function works well. It will return a reader iterable that produces a Python list for each row of the file. There is a matchingcsv.writer
option for writing data to files rather than reading data from files - If the data file does contain headings, the
csv.DictReader
class works well. It will use a reader iterable that produces a Python dictionary for each row of the file. There is a matchingcsv.DictWriter
option
- If the data file does not contain headings, the
The JSON format stands for "JavaScript object notation". It is a widely-adopted format for storing and transferring data between applications, particularly on the web.
Key takeaways:
- JSON is a file format. The file names typically end with the extension
.json
- JSON is a plain text format, like CSV
- The markup of JSON is more complex than CSV. While it typically does contain comma delimiters, it can also contain square brackets (
[
and]
) and curly braces ({
and}
).- For example, the same line represented as CSV above might look like this in JSON:
[10, 2, 1.5, "Firstname Lastname"]
- Unlike CSV (which mainly works for "flat" tabular data) it is also possible to store nested data with JSON. For example, this comes from the
json
docs:["foo", {"bar": ["baz", null, 1.0, 2]}]
- For example, the same line represented as CSV above might look like this in JSON:
- In Python, we can use the
json
module to parse data from JSON files (documentation here)- Typically you will want to load the entire file contents at once using
json.load
- You can also use
json.dump
to serialize an object in memory and write it to a file on disk
- Typically you will want to load the entire file contents at once using
A major part of data science is loading the data, and you just learned the fundamental building blocks of how to do this with Python! In future sections we will cover additional libraries and data formats, but you already have the knowledge to dig up interesting data sets and use Python to answer questions about the data.