Notes and comments from trial run 2024-02-15

Question

Notes and comments from trial run 2024-02-15

Opened this issue 9 months ago · 19 comments

Answer 1 · 2024-02-15T13:13:03.000Z

File formats - reasons for covering:

perf gains if not having to do lots of type inference/type conversion (plus data validation easier to enforce within raw data)
perf gains if working with single Parquet or HDF5 file (containing many tables/ndarrays) vs lots of tiny files and data stored on network/parallel filesystem: lots of overheads from all the metadata lookups associated with the many tiny files, particularly on Lustre filesystems.
some binary formats allow for reading just subsets of data - perf and mem benefits.

Could suggest creating HDF5 or Parquet caches of CSV files if need to make repeated reads of files?

EDIT: pd.read_csv() vs pd.read_hdf5() good for demoing perf differences.

Answer 2 · 2024-02-15T13:15:43.000Z

Free benefits of numpy/pandas built against a good BLAS/LAPACK/FFTW library e.g. Intel MKL:
many operations might be multi-threaded by default - can experiment by requesting more cores on Stanage (up to 64 per node)
BLAS/LAPACK lib might be able to auto-detect and make use of advanced CPU features e.g. AVX512 hardware vectorisation (enabled in Intel Icelake CPUs in Stanage)

See https://github.com/RSE-Sheffield/hi-perf-ipynb/blob/master/tutorials/01-multithreading.ipynb

Answer 3 · 2024-02-15T13:18:13.000Z

Generate a diagram of or text info on the CPU core, CPU cache, mem and peripheral device connectivity/affinity within your own machine: lstopo or lstopo-no-graphics (mentioned briefly on https://github.com/RSE-Sheffield/hi-perf-ipynb/blob/master/tutorials/01-multithreading.ipynb)

Answer 4 · 2024-02-15T13:18:55.000Z

Native Python array datatype: rarely used anywhere; suggest don't mention.

Answer 5 · 2024-02-15T13:19:48.000Z

@willfurnass : point people to advent of code and other toy examples.

https://projecteuler.net/ is fab and language agnostic.

Answer 6 · 2024-02-15T13:22:35.000Z

Re references/objects in Numpy arrays and Pandas DataFrames being bad: recommend people look to see if my_arr.dtype is object and/or object in my_df.dtypes. Particularly valuable/important check after running my_df = pd.read_csv(..) with type inference left as defaults.

Answer 7 · 2024-02-15T13:24:45.000Z

Use of decorators when profiling: add suggestions for how to enable profiling on dubious-quality 3rd party code? Edit files within packages in virtualenv/conda env or something cleaner than that?

Answer 8 · 2024-02-15T13:25:23.000Z

'function' vs 'method': use 'function' everywhere for consistency unless explicitly meaning method of an object?

Answer 9 · 2024-02-15T13:26:25.000Z

Predator prey requires numpy which isn't included by default.

And matplotlib

Answer 10 · 2024-02-15T13:28:01.000Z

Function profiling: could comment that easier to introduce if have somewhat modular software architecture (a reminder of issues of having functions 1000s of lines long)?

Answer 11 · 2024-02-15T13:52:07.000Z

Function Level Profiling

In Worked Example, change placement of “follow along” callout to after you've introduced the python code (so that people are noit busy being distracted by downloading code / running commands)
"Sunburst" callout is very large for something that you are saying isn't much use. Perhaps callout should be "Alternative visualisations" with smaller version of image?
Exercise 1 solution "Here it distance()" doesn't make sense

Profiling Summary

Capitalisation is a bit weird e.g. "When to Profile"

Optimisation

I don't understand this sentence: "Even a high-level understanding of a typical computer architecture; the most common data-structures and algorithms; and how Python executes your code, enable the identification of suboptimal approaches."

Testing

I think this could be cut down to "we recommend the use of a testing framework such as pytest - here's a quick example of how a test can be used to check your function's output against an expected value, so that you can be sure that when you modify your code the outputs are the same. You can find out more about testing... <link to pytest docs/signpost to FAIR4RS-testing>"
"Coming up" list seems a bit weird/pointless here

Data Structures and Algorithms

Capitalisation of "Lists", "Tuples" in Questions/Objectives (and elsewhere in episode) is weird.
Should this be one sentence?: "If it doesn’t, it will reallocate a larger array, copy across the elements, and deallocate the old array. Before copying the item to the end and incrementing the counter which tracks the list’s length."
"Callout" titles (here and subsequent episodes) - replace with something related to the content as in earlier episodes
Add "Exercise:" to exercise title

Minimise Python

Change "NumPY" to "NumPy"
Filter early: move up to before "Using NumPy/Pandas effectively" sections and provide example?

Latency Overview

re label for London > Canada > London:

James KW Moore:
Could just say cross Atlantic round trip..

General comments

Where episode titles contain "&" the prev/next links look a bit weird.

Answer 12 · 2024-02-19T08:59:41.000Z

I spent some time this weekend profiling and trying to do some optimisation on one of my personal Python code projects. (Bear in mind that I am not a Python specialist and I wasn't working on particularly scientific or complex code.)

The profiling part of the course worked as advertised and helped me identify exactly which bits of my code were slow and would benefit from some effort improving. Unfortunately, the result there was that the major slow down in my code was caused by poor coding on my part and can only be improved by a better algorithm for tackling the problem, and not as far as I can see by taking advantage of any Python-specific quirks.

A few bits of the optimisation side of the course were still quite helpful however. I think by far the most useful thing I learned was about variable scope and function calls causing slow downs. The easiest and largest speed gains I got were pre-allocating non-local variables to local copies, putting functions called only once inline. Particularly the scope thing I think speed up those functions by around 10% and it might be worth putting more emphasis on this than just a single callout.

I think it would be beneficial to acknowledge at some point in the course that there might not be any optimisations to be made. I would not like to put a researcher in a position of "these things should be helping me but I can't get them to work I feel so disheartened".

Answer 13 · 2024-02-19T10:43:30.000Z

Thanks Fred, useful comments.

I appreciate all this feedback, not too sure when I will have to time to address it though. I've got a bit of a busy month.

Answer 14 · 2024-02-24T08:01:46.000Z

I discovered by chance that the IPython, an enhanced interactive Python shell, has support for both line and memory profiling using "in-line magic" %lprun and %mprun respectively. Not entirely sure how useful it would be but thought it worth mentioning.

Answer 15 · 2024-02-27T13:40:00.000Z

Disseminate setup instructions before course, perhaps bundle up the various files that users will download into a zip so they can download those in one go.

I think @gyengen currently plans for it to run on managed desktops in Hicks. So this may not be that simple.

Answer 16 · 2024-02-27T14:35:15.000Z

With a wider view though is it possible that some might want to use their own laptops?

I never used the managed desktops so don't know if its possible for people to install software in advance, i.e. they work like VMs/Remote desktops. If so it would seem sensible to ask people to download and install software and data in advance as doing so at the start of a session wastes valuable face-to-face time.

Also this course has the possibility of feeding up-stream into the Carpentries Incubator where it could be used by others and may see contributions and so making it as general as possible would be useful.

In that regard having instructions for participants to download and install setups before hand would be really useful.

Answer 17 · 2024-02-27T14:38:56.000Z

With a wider view though is it possible that some might want to use their own laptops?

Yes, eventually. Still a lot to resolve before then. I'm acknowledging the feedback (not going to hide it away), just not an immediate priority. Afaik carpentries format does have a data page, which would serve this purpose. I'm just not a huge fan of having individual downloads that need to also be manually archived if change. So would want to look at whether I can fudge carpentries CI to do that for me.

Also this course has the possibility of feeding up-stream into the Carpentries Incubator where it could be used by others and may see contributions and so making it as general as possible would be useful.

There's already Sheffield specific stuff in here (such as the Theme), I expect carpentries incubator would end up being a fork of this repository.

Answer 18 · 2024-02-27T14:44:39.000Z

Cool, the main reason I mentioned it is that with the Git course it can delay the start of the session if people hadn't followed the setup instructions.

If/when you get round to creating archives there seems to be a GitHub Action for everything...Create Archive · Actions · GitHub Marketplace!

Answer 19 · 2024-03-05T14:04:04.000Z

Removed the scope callout whilst removing generator functions. Need to workout where it fits.

As suggested by Fred, likely worth promoting it beyond a passing callout.
- Worth renaming the physical file minimise-python to understanding-python?


::::::::::::::::::::::::::::::::::::: callout

The use of `max_val` in the previous example moves the value of `N` from global to local scope.

The Python interpreter checks local scope first when finding variables, therefore this makes accessing local scope variables slightly faster than global scope, this is most visible when a variable is being accessed regularly such as within a loop.

Replacing the use of `max_val` with `N` inside `test_generator()` causes the function to consistently perform a little slower than `test_list()`, whereas before the change it would normally be a little faster.

:::::::::::::::::::::::::::::::::::::::::::::

Notes and comments from trial run 2024-02-15

Introduction to Profiling

Function Level Profiling

Line Level Profiling

Optimisation

Testing

Data Structures and Algorithms

List

Generators

Sets

Searching

Minimise Python

Numpy

Pandas

Keeping Python up-to-date

Memory

Accessing Disk

Latency Overview

Optimisation Conclusion

Useful resources to point people (from @ns-rse)

Function Level Profiling

Profiling Summary

Optimisation

Testing

Data Structures and Algorithms

Minimise Python

Latency Overview

General comments