RSE-Sheffield/pando-python

Notes and comments from trial run 2024-02-15

Opened this issue · 19 comments

Hopefully useful capture of some of the points raised.

Introduction to Profiling

  • Disseminate setup instructions before course, perhaps bundle up the various files that users will download into a zip so they can download those in one go.
  • @willfurnass better distinction between benchmarking and profiling
  • Alt text on timeline profiling.

Function Level Profiling

  • Missing a to in What is a Stack?
  • Quote some of the terms and highlight in the glossary.
    • Best way to emphasize terms that can be found in glossary?
  • Perhaps a diagram to help clarify traceback.print_stack()
    • Perhaps more closely link diagram to code in the same call out
  • Good explanation of the basic output.
  • Demonstrate how to use SnakeViz to filter the output to the depth that users may want so they can avoid seeing the internals of libraries that they have called.
  • Farhad : Clarify how to interpret the overall interpretation of the following along example.
  • How much time does it take for people to run through the exercises? travellingsales.py with 10 cities took about 5 minutes, that needs accounting for in the class. Perhaps get them to start this and then talk about something else whilst running and return to the output. predprey.py ran in about 20 seconds.
    • Instructor note, once timings are documented. Will vary based on hardware though (although managed desktops can be assumed slower than students laptops)
  • Predator prey requires numpy which isn't included by default.

Line Level Profiling

  • People might not be familiar with putting things in functions when they come from using Notebooks, but this is covered nicely in the fizzbuzz example before they undertake the exercise.
  • @fred should perhaps separate <script name/arguments> throughout.
  • @edwin emphasise how you narrow down from the cProfile down to line_profile.

Optimisation

Testing

  • Is it worth going into the detail of test directory structure? MOre generally how much testing should be covered?

Data Structures and Algorithms

  • Minor Tpyo They allows direct and sequential element access, with the convenience to append items. extra s on allows.

List

  • Lot of detail here make it clear that people might not get everything here (Instructor callout perhaps?).
  • If there could be some diagrams to show the concepts it would be useful.

Generators

  • No great performance increase here so could be an example of Knuths principle that small gains aren't worth chasing. But highlight that this isn't part of memory profiling. Perhaps ditch this section.

Sets

  • Whilst true that these are keys in that they are unique and hashable, perhaps use the more general term items.

Searching

  • Explain load factor and collisions

Minimise Python

  • Perhaps remove zip() from built-in operators?

Numpy

  • Callout could demonstrate the use of dtype when having arrays of mixed types.
  • What about having the examples of speed differences between lists and arrays as tasks for the attendees to do? Would break up the talking and perhaps improve concentration to have an interactive task to demonstrate the point.

Pandas

  • Common convention many are likely to use is import numpy as np

Keeping Python up-to-date

  • Pedantic tpyo such changes to the JIT and GIL will provide is missing an as.
  • Highlight that they need to be careful that there aren't any breaking changes in when updating.
  • Will highlight benefits of number of cores used when doing NumPy calculations, see Which NumPy Functions Are Multithreaded - Super Fast Python.

Memory

  • alt text on first diagram is broken.

Accessing Disk

  • Is it worth mentioning/describing the differences in performance between csv and other formats such as parque or HDF5

Latency Overview

  • Clearer label for London > Canada > London

Optimisation Conclusion

  • Keypoints are not rendering correctly.

Useful resources to point people (from @ns-rse)

  • Could be useful to point people to additional material for the various topics I (@ns-rse) know of the following for NumPy

File formats - reasons for covering:

  • perf gains if not having to do lots of type inference/type conversion (plus data validation easier to enforce within raw data)
  • perf gains if working with single Parquet or HDF5 file (containing many tables/ndarrays) vs lots of tiny files and data stored on network/parallel filesystem: lots of overheads from all the metadata lookups associated with the many tiny files, particularly on Lustre filesystems.
  • some binary formats allow for reading just subsets of data - perf and mem benefits.

Could suggest creating HDF5 or Parquet caches of CSV files if need to make repeated reads of files?

EDIT: pd.read_csv() vs pd.read_hdf5() good for demoing perf differences.

  • Free benefits of numpy/pandas built against a good BLAS/LAPACK/FFTW library e.g. Intel MKL:
  • many operations might be multi-threaded by default - can experiment by requesting more cores on Stanage (up to 64 per node)
  • BLAS/LAPACK lib might be able to auto-detect and make use of advanced CPU features e.g. AVX512 hardware vectorisation (enabled in Intel Icelake CPUs in Stanage)

See https://github.com/RSE-Sheffield/hi-perf-ipynb/blob/master/tutorials/01-multithreading.ipynb

Generate a diagram of or text info on the CPU core, CPU cache, mem and peripheral device connectivity/affinity within your own machine: lstopo or lstopo-no-graphics (mentioned briefly on https://github.com/RSE-Sheffield/hi-perf-ipynb/blob/master/tutorials/01-multithreading.ipynb)

Native Python array datatype: rarely used anywhere; suggest don't mention.

@willfurnass : point people to advent of code and other toy examples.

https://projecteuler.net/ is fab and language agnostic.

Re references/objects in Numpy arrays and Pandas DataFrames being bad: recommend people look to see if my_arr.dtype is object and/or object in my_df.dtypes. Particularly valuable/important check after running my_df = pd.read_csv(..) with type inference left as defaults.

Use of decorators when profiling: add suggestions for how to enable profiling on dubious-quality 3rd party code? Edit files within packages in virtualenv/conda env or something cleaner than that?

'function' vs 'method': use 'function' everywhere for consistency unless explicitly meaning method of an object?

Predator prey requires numpy which isn't included by default.

  • And matplotlib

Function profiling: could comment that easier to introduce if have somewhat modular software architecture (a reminder of issues of having functions 1000s of lines long)?

Function Level Profiling

  • In Worked Example, change placement of “follow along” callout to after you've introduced the python code (so that people are noit busy being distracted by downloading code / running commands)
  • "Sunburst" callout is very large for something that you are saying isn't much use. Perhaps callout should be "Alternative visualisations" with smaller version of image?
  • Exercise 1 solution "Here it distance()" doesn't make sense

Profiling Summary

  • Capitalisation is a bit weird e.g. "When to Profile"

Optimisation

  • I don't understand this sentence: "Even a high-level understanding of a typical computer architecture; the most common data-structures and algorithms; and how Python executes your code, enable the identification of suboptimal approaches."

Testing

  • I think this could be cut down to "we recommend the use of a testing framework such as pytest - here's a quick example of how a test can be used to check your function's output against an expected value, so that you can be sure that when you modify your code the outputs are the same. You can find out more about testing... <link to pytest docs/signpost to FAIR4RS-testing>"

  • "Coming up" list seems a bit weird/pointless here

Data Structures and Algorithms

  • Capitalisation of "Lists", "Tuples" in Questions/Objectives (and elsewhere in episode) is weird.
  • Should this be one sentence?: "If it doesn’t, it will reallocate a larger array, copy across the elements, and deallocate the old array. Before copying the item to the end and incrementing the counter which tracks the list’s length."
  • "Callout" titles (here and subsequent episodes) - replace with something related to the content as in earlier episodes
  • Add "Exercise:" to exercise title

Minimise Python

  • Change "NumPY" to "NumPy"
  • Filter early: move up to before "Using NumPy/Pandas effectively" sections and provide example?

Latency Overview

  • re label for London > Canada > London:

James KW Moore:
Could just say cross Atlantic round trip..

General comments

  • Where episode titles contain "&" the prev/next links look a bit weird.

I spent some time this weekend profiling and trying to do some optimisation on one of my personal Python code projects. (Bear in mind that I am not a Python specialist and I wasn't working on particularly scientific or complex code.)

The profiling part of the course worked as advertised and helped me identify exactly which bits of my code were slow and would benefit from some effort improving. Unfortunately, the result there was that the major slow down in my code was caused by poor coding on my part and can only be improved by a better algorithm for tackling the problem, and not as far as I can see by taking advantage of any Python-specific quirks.

A few bits of the optimisation side of the course were still quite helpful however. I think by far the most useful thing I learned was about variable scope and function calls causing slow downs. The easiest and largest speed gains I got were pre-allocating non-local variables to local copies, putting functions called only once inline. Particularly the scope thing I think speed up those functions by around 10% and it might be worth putting more emphasis on this than just a single callout.

I think it would be beneficial to acknowledge at some point in the course that there might not be any optimisations to be made. I would not like to put a researcher in a position of "these things should be helping me but I can't get them to work I feel so disheartened".

Thanks Fred, useful comments.

I appreciate all this feedback, not too sure when I will have to time to address it though. I've got a bit of a busy month.

I discovered by chance that the IPython, an enhanced interactive Python shell, has support for both line and memory profiling using "in-line magic" %lprun and %mprun respectively. Not entirely sure how useful it would be but thought it worth mentioning.

Disseminate setup instructions before course, perhaps bundle up the various files that users will download into a zip so they can download those in one go.

I think @gyengen currently plans for it to run on managed desktops in Hicks. So this may not be that simple.

With a wider view though is it possible that some might want to use their own laptops?

I never used the managed desktops so don't know if its possible for people to install software in advance, i.e. they work like VMs/Remote desktops. If so it would seem sensible to ask people to download and install software and data in advance as doing so at the start of a session wastes valuable face-to-face time.

Also this course has the possibility of feeding up-stream into the Carpentries Incubator where it could be used by others and may see contributions and so making it as general as possible would be useful.

In that regard having instructions for participants to download and install setups before hand would be really useful.

With a wider view though is it possible that some might want to use their own laptops?

Yes, eventually. Still a lot to resolve before then. I'm acknowledging the feedback (not going to hide it away), just not an immediate priority. Afaik carpentries format does have a data page, which would serve this purpose. I'm just not a huge fan of having individual downloads that need to also be manually archived if change. So would want to look at whether I can fudge carpentries CI to do that for me.

Also this course has the possibility of feeding up-stream into the Carpentries Incubator where it could be used by others and may see contributions and so making it as general as possible would be useful.

There's already Sheffield specific stuff in here (such as the Theme), I expect carpentries incubator would end up being a fork of this repository.

Cool, the main reason I mentioned it is that with the Git course it can delay the start of the session if people hadn't followed the setup instructions.

If/when you get round to creating archives there seems to be a GitHub Action for everything...Create Archive · Actions · GitHub Marketplace!

Removed the scope callout whilst removing generator functions. Need to workout where it fits.

  • As suggested by Fred, likely worth promoting it beyond a passing callout.
    • Worth renaming the physical file minimise-python to understanding-python?

::::::::::::::::::::::::::::::::::::: callout

The use of `max_val` in the previous example moves the value of `N` from global to local scope.

The Python interpreter checks local scope first when finding variables, therefore this makes accessing local scope variables slightly faster than global scope, this is most visible when a variable is being accessed regularly such as within a loop.

Replacing the use of `max_val` with `N` inside `test_generator()` causes the function to consistently perform a little slower than `test_list()`, whereas before the change it would normally be a little faster.

:::::::::::::::::::::::::::::::::::::::::::::