In this lab you will measure just how much faster the built-in sorted
function is than the functions you implemented on the last homework assignment.
You will also learn about how to use git submodules to connect multiple git repos together.
You will need a partner for portions of this lab.
Task 1: initialize the repo
Do not clone this repository. Instead:
-
Create a new empty repo. Ensure that the name of your repo is different than the name of your partner's repo (so that you can fork it later).
-
Copy the contents of this repo to your new repo with the following commands
$ git clone https://github.com/mikeizbicki/lab-sorting $ git remote rm origin $ git remote add origin <URL> $ git push origin master
NOTE: In the past I used
$URL
syntax to denote a "variable" that you need to substitute into a command. Both the$URL
and<URL>
style syntaxes are commonly used.NOTE: At this point you should ensure you understand what the
git remote
commands are doing fully. I will stop giving these commands to you explicitly in the future.
Task 2: setup the submodule
Run the runtimes.py
file.
You should get a ModuleNotFoundError
that looks something like
$ python3 runtimes.py
Traceback (most recent call last):
File "runtimes.py", line 9, in <module>
from sorting.sorting import merge_sorted, quick_sorted
ModuleNotFoundError: No module named 'sorting'
NOTE: In the code block above, I put both the command and the entire output of the command. Before the code block, I put a short summary of my steps using English intermixed with inline-code. When reporting error messages, it is standard to have a short English language summary and a full output of the error message in a code block. (Even if the output is hundreds of lines long, people will still include the full output, since all of those lines provide important information.) If you report your errors in this same way (e.g. on github issues/stackoverflow), you will be more likely to get good help from people.
The problem is that the runtimes.py
file is trying to load the functions that you wrote in the sorting homework,
but it can't find those files.
The most obvious way to get access to those files is to clone your sorting homework into the repository.
BUT DON'T DO THIS!
It is never a good idea to clone one git repository inside another repository directly.
Instead, we will use something called a git submodule which will allow the two git repositories to "talk" to each other when needed.
You can initialize a git submodule by running the command:
$ git submodule add <URL>
where <URL>
is the url to your sorting homework.
If this command works successfully, then the command
$ ls -a
should have a new sorting
directory in its output and a file .gitmodules
.
If you run
$ ls sorting
then you should see all of the contents of your sorting repo. Running the command
$ python3 runtimes.py
should now generate no errors. If you run
$ git status
you should see that the sorting
repo and the .gitmodules
file have has been added into the staging area.
You should commit theses changes and push them to github now.
If you visit the github website for your lab repo,
you should see that the sorting
folder is now visible in the repo,
and it will have a symbol next to it that looks like @ f43eacf
where the hash number after the @
is the hash of the commit of the sorting repo and will be different.
Clicking on the folder should take you to the url for your sorting repo and away from this repo.
There will be no indication on your sorting repo that it is a git submodule in another repo.
Task 3: runtimes of sorting algorithms on random lists
Run the command
$ python3 runtimes.py --max_x=5
This will use the timeit
module on 5 different lists of increasing size to measure the runtimes of the built-in sort
function (which uses timsort) versus the sorting functions that you implemented in your last homework.
The output of this command, however, is currently not nicely formatted.
At the end of the file is a line labeled FIXME 1
.
Follow the instructions so that the runtimes.py
file generates output in markdown table format.
NOTE: It is very common to write python code that generates code in another programming language, especially markdown or html.
After you've completed the FIXME, run the following command
$ python3 runtimes.py --max_x=22
and copy/paste the resulting table into this README file below this line.
len(xs) | timsort | merge_sorted | quicksort |
---|---|---|---|
2**0 | 4.54e-06 | 2.60e-06 | 2.48e-06 |
2**1 | 2.37e-06 | 8.64e-06 | 1.03e-05 |
2**2 | 2.05e-06 | 1.27e-05 | 1.34e-05 |
2**3 | 2.63e-06 | 2.82e-05 | 2.90e-05 |
2**4 | 3.73e-06 | 6.12e-05 | 5.68e-05 |
2**5 | 5.82e-06 | 1.33e-04 | 1.35e-04 |
2**6 | 1.36e-05 | 3.19e-04 | 3.23e-04 |
2**7 | 2.53e-05 | 7.07e-04 | 7.08e-04 |
2**8 | 5.37e-05 | 1.50e-03 | 1.86e-03 |
2**9 | 1.15e-04 | 3.43e-03 | 3.62e-03 |
2**10 | 2.64e-04 | 7.71e-03 | 8.94e-03 |
2**11 | 5.77e-04 | 1.73e-02 | 1.93e-02 |
2**12 | 1.43e-03 | 5.04e-02 | 4.19e-02 |
2**13 | 3.11e-03 | 9.14e-02 | 9.64e-02 |
2**14 | 5.53e-03 | 2.00e-01 | 2.02e-01 |
2**15 | 1.55e-02 | 4.23e-01 | 4.08e-01 |
2**16 | 3.52e-02 | 9.12e-01 | 8.44e-01 |
2**17 | 7.89e-02 | 1.93e+00 | 1.90e+00 |
2**18 | 1.80e-01 | 4.07e+00 | 4.08e+00 |
You should observe that python's built-in sort function is 10-100x faster than yours. All functions have the same wort-case asymptotic complexity (i.e. $\Theta(n \log n)$), but python's built-in sorting function uses lots of optimization tricks to achieve this extra speedup. Native python code is not very good at performing these types of optimizations, and so the built-in function is written in the C programming language. (You can find the source code of the built-in function on github). Functions that must be very fast are generally written in C instead of Python. One of the differences between a computer sciencee major and a data science major is that the CS major focuses on how to write these fast functions, and the DS major focuses on how to use these fast functions to accomplish interesting tasks.
Push your changes to github and verify that the table is being displayed correctly.
Task 4: cloning
Fork your partner's repo and clone your partner's repo to the lambda server. In this forked repo, re-run the command
$ python3 runtimes.py --max_x=22
You should get a ModuleNotFoundError
again.
The problem is that when you clone a repo with submodules, the submodules are by default not cloned as well.
ASIDE: Facebook announced in 2014 that the size of the git repo for facebook.com was 54GB, and it is undoubtably much larger today. This 54GB only includes the code directly responsible for facebook.com, and not any of the libraries that facebook.com depends on. Facebook's releases many public libraries on github at https://github.com/facebook, and most of these are going to be included in the main facebook.com repo as submodules. Some popular examples include the react web framework and pytorch machine learning library. Since these are their own separate git repos (and therefore submodules of facebook.com), developers can work on these libraries without needing access to the main facebook.com repo. Or if they are working at Facebook and do have access, they don't need to waste their time downloading all of the 54GB of code for facebook.com just to work on react or pytorch.
The main repo for facebook.com is not public, and not even hosted with github. Recall that one of the advantages of git is that it is a distributed version control system not tied to any hosting provider. Facebook only uses github for its public repos, and uses an internal git service for storing its proprietary repos like the one for facebook.com.
Now run the command
$ ls
and notice that the sorting
folder does exist in your forked repo.
The problem is that it is empty,
and you can verify that with the command
$ ls sorting
We need to populate or hydrate this folder, which is currently unpopulated or dehydrated. You can do that with the following sequence of commands:
$ git submodule init
$ git submodule update
Now, you should be able to run the original command successfully
$ python3 runtimes.py --max_x=22
Task 5: runtime of sorting an already sorted list
Remain in the forked repo for this task.
We will now measure the runtime of the sort functions on data that has already been sorted.
Do this by adding the --input=sorted
argument to get the following
$ python3 runtimes.py --max_x=23 --input=sorted
Traceback (most recent call last):
File "runtimes.py", line 38, in <module>
xs = FIXME
NameError: name 'FIXME' is not defined
You'll notice that you get a NameError
because the word FIXME
is undefined.
NOTE: It is common practice in python to put the word FIXME as a variable in a code path that you haven't implemented yet. If that code path accidentally gets run, then python will raise the
NameError
to let you know that you are running code that you shouldn't be running. From python's perspective, there's nothing special about the wordFIXME
: it's just a variable name that has not been defined.
Follow the instructions in the comments to provide a proper definition of xs
,
then rerun the command above to generate a markdown table of runtimes.
Copy/paste the table into the README file below this line.
x | timsort | merge_sorted | quicksorted |
---|---|---|---|
2**0 | 3.74e-06 | 2.68e-06 | 2.51e-06 |
2**1 | 2.32e-06 | 8.73e-06 | 9.26e-06 |
2**2 | 2.15e-06 | 1.59e-05 | 1.51e-05 |
2**3 | 2.43e-06 | 3.47e-05 | 3.03e-05 |
2**4 | 2.46e-06 | 7.46e-05 | 6.58e-05 |
2**5 | 3.08e-06 | 1.87e-04 | 1.49e-04 |
2**6 | 5.00e-06 | 3.63e-04 | 3.40e-04 |
2**7 | 6.80e-06 | 7.92e-04 | 7.59e-04 |
2**8 | 1.13e-05 | 1.79e-03 | 1.70e-03 |
2**9 | 2.14e-05 | 3.73e-03 | 3.77e-03 |
2**10 | 3.72e-05 | 8.26e-03 | 8.42e-03 |
2**11 | 7.91e-05 | 2.16e-02 | 1.84e-02 |
2**12 | 1.31e-04 | 1.83e-02 | 2.45e-02 |
2**13 | 2.12e-04 | 4.06e-02 | 7.41e-02 |
2**14 | 6.55e-04 | 1.62e-01 | 1.73e-01 |
2**15 | 1.23e-03 | 3.18e-01 | 3.44e-01 |
2**16 | 2.37e-03 | 6.98e-01 | 8.11e-01 |
2**17 | 5.20e-03 | 1.53e+00 | 1.73e+00 |
2**18 | 1.13e-02 | 3.19e+00 | 3.23e+00 |
You should notice that the built-in sorted
function ran much faster on this input,
but your merge_sorted
and quick_sorted
functions have essentially the same runtimes.
This is because TimSort is designed to not have to resort already sorted data,
and its best-case runtime is therefore
Add/commit your changes and push them to the forked github repo. Issue a pull request to your partner with your changes, and accept your partner's pull request when you receive it.
Submit the link to both your and your partner's github repos to sakai.