The goal of this exercise is to learn the basics of reading and writing files in Python.
- Getting set up
- Learning objective
- Learning the basics
- The Exercise
- Extra challenges (not required)
- Acknowledgments
- License
At this point, you should have (1) an account on Github and (2) been introduced to the very basics of Git.
-
Login to your Github account.
-
Fork this repository, by clicking the 'Fork' button on the upper right of the page.
After a few seconds, you should be looking at your copy of the repo in your own Github account.
-
Click the 'Clone or download' button, and copy the URL of the repo via the 'copy to clipboard' button.
-
In your terminal, navigate to where you want to keep this repo (you can always move it later, so just your home directory is fine). Then type:
$ git clone the-url-you-just-copied
and hit enter to clone the repository. Make sure you are cloning your fork of this repo.
-
Next,
cd
into the directory:$ cd the-name-of-directory-you-just-cloned
-
At this point, you should be in your own local copy of the repository.
-
As you work on the exercise below, be sure to frequently
add
andcommit
your work andpush
changes to the remote copy of the repo hosted on GitHub. Don't enter these commands now; this is just to jog your memory:$ # Do some work $ git add file-you-worked-on.py $ git commit $ git push origin master
To become fluent in reading and writing files in Python.
In Python, we use the open
function to open a file.
The basic syntax is:
file_object = open(path_to_file, access_mode)
The access_mode
is usually 'r'
for reading from the file or 'w'
for
writing to the file.
There are other modes, but read and write are the most common and the only
modes we will worry about here.
Let's work in the Python interpreter to learn some basics about how to work with files in Python:
$ python3
Let's try opening the dummy.txt
file in this repo:
>>> file_object = open('dummy.txt', 'r')
>>> type(file_object)
Just like any object in Python, we can use help
to find out more about it and
see its methods:
>>> help(file_object)
We can use the read
method to get a string of the entire file:
>>> s = file_object.read()
>>> print(s)
For larger files, getting a string of the entire file is not very efficient. Below, we will learn a more efficient approach.
When we are done with the file, whether reading from it or writing to it, make sure you close the file:
>>> file_object.close()
Below, we will see how we can use with
statements so that files get closed
automatically.
Now, let's create a new file and write a line in it:
>>> new_file = open('my-new-file.txt', 'w')
>>> new_file.write('Hello, this is my new file!')
>>> new_file.close()
Quit out of the Python interpreter:
>>> quit()
In your directory you should now see a new file called my-new-file.txt
:
$ ls #In Windows: dir
dummy.txt
my-new-file.txt
origin.txt
README.md
$ cat my-new-file.txt #In Windows: type my-new-file.txt
Hello, this is my new file!
Now, let's write a short Python script to parse the dummy.txt
file in this
repo.
Create a new file called dummy.py
and open it with your preferred text editor.
Add the following code to your file:
print('Opening dummy.txt')
with open('dummy.txt', 'r') as in_stream:
print('Opening output.txt')
with open('output.txt', 'w') as out_stream:
for line in in_stream:
line = line.strip()
word_list = line.split()
word_list.sort()
for word in word_list:
out_stream.write('{0}\n'.format(word))
print("Done!")
print('dummy.txt is closed?', in_stream.closed)
print('output.txt is closed?', out_stream.closed)
Run this script from the command line:
$ python3 dummy.py
Take a look at the output.txt
file that was created by the script:
$ cat output.txt
Notice two things about this script:
- We looped over the file object for
dummy.txt
line by line using a for loop (for line in in_stream
). This is more efficient than reading the entire file as a string. - We used
with open(...) as variable_name
statements to ensure the files get closed. It is VERY easy to forget to close a file, so usewith
statements!
In this repo is a text version of On the Origin of Species by Charles Darwin. Check it out:
$ less origin.txt #In Windows, using Powershell: type file.txt | Out-Host -Paging OR type file.txt | oh -p
Your goal is to write a script that will read origin.txt
and find all
occurrences of certain words and write occurrences to a new file.
Below, I outline which words to find and the format to use when writing them to
the new file.
Find all words related to heritability; your script should find heritable, inherit, inheritance, and other forms of these words. Use a regular expression pattern to find these.
When you find such a word, write the line number it was found on and the word
itself separated by a tab ('\t'
) character.
For example, here is how the first 10 lines of the file should look:
471 Inheritance
660 inherited
799 inheritance
850 Inheritance
914 inheritance
991 INHERITANCE
993 inherited
1000 inherited
1046 inherited
1047 inheritable
If two words are found on the same line, both should be written to the file on separate lines (they will have the same line number).
Remember what you learned from the best practices exercise:
- Put meaningful units of code into functions.
- Make your script importable as a module; i.e., use the
if __name__ == '__main__':
flow control. - Write informative docstrings to document your script and functions.
Also, try to keep your code general/flexible. For example, write your functions so that they can be reused for other purposes; can you write them so they can be used with other files and other regular expression patterns?
If you look at the contents of origin.txt
, you will notice that the Gutenberg
Project has added text to the beginning and end of the file.
Can you you update your script to avoid searching this added text?
Try updating your script so that it can search for any regular expression pattern that is provided via the command line.
Try updating your script so that a path to any text file can be specified via
the command line.
Your script should then search this file rather than origin.txt
.
Thanks to Project Gutenberg for providing a text version of On the Origin of
Species. The origin.txt
file was copied from
http://www.gutenberg.org/cache/epub/2009/pg2009.txt.
This work was made possible by funding provided to Jamie Oaks from the National Science Foundation (DEB 1656004).
The text version of On the Origin of Species in origin.txt
was copied from
Project Gutenberg under the Project Gutenberg License, which states:
This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org
The full version of the Project Gutenberg License can be found at the end of
the origin.txt
file.
All content in this repository other than the origin.txt
is licensed under a Creative Commons Attribution 4.0 International License.