aguinane/nem-reader

Improve performance

matt-telstra opened this issue · 15 comments

I'm deploying this library to AWS Lambda functions, which have a maximum runtime of 15 minutes.
For 14MB files I'm getting timeouts.

I'm not here to complain though. I want to write a PR to improve this library to make it faster.
I'm raising this ticket so we can start the discussion to identify where the bottleneck is, and so I can come up with a plan to improve it.

Do you know off the top of your head where the bottleneck is?

I suspect it's the list comprehension that instantiates one named tuple per row, here. Does that sound like it's the bottleneck?

I'm still not super familiar with the code, or even the nem12 format.

parse_reading(val) sounds pretty trivial. If that's the bottleneck we can probably find a way to do that casting with numpy.

Is it the fact that this is a list or named tuples, instead of a pandas dataframe from the start? Pandas can parse CSVs pretty quickly, and you can tell it to ignore the first M rows and last N rows. My use case is only to output as a pandas dataframe (to dump to parquet). Would it be acceptable if every use involved casting the data to a pandas dataframe (and then possibly back to an Iterable of named tuples or whatever, to avoid making it a breaking change.)

Is it a big deal if we have to do two passes of the file?
(e.g. for certain streamable file-like objects, they're not seekable, so we'd have to cache it in memory.)

I don't think pandas would be capable of reading in the rows directly as you need to be able to handle 400 rows.
The 400 rows don't always show up but if there is a power outage or something then they do and change the quality attribute, see this example file

15 minutes seems pretty slow - my use case originally was just for parsing my personal export data which has never taken that long, so I haven't tried to optimise. I can try profiling it and see if there are any easy wins - not sure when I'll get time though.

Presumably you are trying to import a daily file with data from many meters/NMIs? I imagine the bottleneck might be even different for that use case because most of the example files are only for a single NMI anyway.

Yes I'm processing large daily files with many NMIs.

I'll try profiling.
It could be an issue with AWS Lambda. (e.g. by default they only give you a fraction of a CPU. But I thought I had already solved that one.) So I'll do some profiling on my laptop.

I'll also see if I can replace the confidential data with meaningless junk that causes the same behavior, so you can test it. (Does github let users attach 15MB files to issues?)

That's alright - I can track down a similar example file to profile on - just wanted to confirm that was your use case.

I tried this with a large example with lots of NMIs (82mb).

from nemreader import output_as_data_frames
dfs = output_as_data_frames("large_example.csv")
nmi, df = dfs[0] # Return data for first NMI in file
print(nmi)
print(df)

As a side note, #31 wants to include meter serial no in the data frame so it might make more sense to return a giant data frame with all NMIs instead of splitting them out. This would probably improve performance for your use case and be more useful as well?

Are dataframes the best output method to

Running the profiling function on the example code above:

python -m cProfile -o output.prof example.py
snakeviz output.prof

image

So improving the dataframe building might help a bit.
Also using split_days takes about 12s and that can probably be skipped for NEM12 as it was more for NEM13.
But yeah parse_nem12_rows is where most of the time is.

Yes for my use case I add NMI as a column, and then concatenate all dataframes. (Since I save as parquet, which is compressed and column-based, it's not wasteful to repeat the NMI so many times.)

        with open(tmp_path, 'wb') as f:
            f.write(data)
        dfs = output_as_data_frames(tmp_path)
        for (nmi, df) in dfs:
            df['nmi'] = nmi
        df = pd.concat(df for (nmi, df) in dfs)
        m = read_nem_file(tmp_path)
        df['creation_date'] = m.header.creation_date
        assert m.header.creation_date != None

So I'd like just one dataframe. But is there a way we can make that change that's performant whilst also backwards compatable?

I don't really understand what split_days does. (The docs don't mention the word "split") I see that for output_as_data_frames, split_days defaults to True. So if I set it False, should that be faster? I just tested and it's not substantially faster.

Also, I currently need to parse the file twice. Once to get the dataframes, once to get the creation date. Even though the creation date is only in the first line, to get it using this library as it is today, I need to read the whole file, which is slow.
So I'll write a function that calls parse_header_row directly. What if I submit a pull request to add such a function to the library?

I did some more testing on my end. The performance inside an AWS Lambda seems about twice as slow as running locally. So there's probably something on my end/AWS' end that I need to investigate too.

split_days() is a generator that returns one Reading per day for the given Iterable, even if the actual Readings contain gaps that are bigger than this. If your data has daily samples, it should be lightweight.

The reading the file twice part should be an easy-ish fix. I've made a start by making the parser a class. Just need to make the bit that reads the header a separate function and then you could call that separately. There are a few edge cases to work though though (empty first row/missing first row).

To be clear, when I said I was ingesting daily files, I means files with all the intervals for one day, received each day. They're not files with one row per day.

Re a change for the read-twice part, great. Thanks.

v0.8.0 now has a way to load the file as a class with the header row retained as a variable so you don't need to call it twice.

I added a new file to the examples directory that has a bunch of daily records for many NMIs.
I only included 100 as I didn't want tests to be too slow (and this is now loaded in the test suite) but here is the code I used to make it if you wanted to do more.

import pandas as pd
from datetime import datetime, timedelta
from random import randrange
from nemwriter import NEM12


interval = 5
num_intervals = int(60 / interval) * 24 
index = [
    datetime(2020, 1, 1) + timedelta(minutes=interval * x)
    for x in range(1, num_intervals + 1)
]

m = NEM12(to_participant='EXAMPLE')
nmis = [f"nmi{x}" for x in range(1, 100)]
for nmi in nmis:
    e1 = [randrange(1, 10) for x in range(1, num_intervals + 1)]
    e2 = [randrange(1, 5) for x in range(1, num_intervals + 1)]
    s1 = pd.Series(data=e1, index=index, name="E1")
    s2 = pd.Series(data=e2, index=index, name="E2")
    df = pd.concat([s1, s2], axis=1)
    m.add_dataframe(nmi=nmi, df=df, uoms={"E1": "kWh", "E2": "kWh"})

output_file = f"{m.nem_filename()}.zip"
fp = m.output_zip(file_path=output_file)

Also, here's some new example code for the profiling:

from nemreader import NEMFile
nf = NEMFile("examples/Example_NEM12_ManyNMIs.zip", strict=True)
df = nf.get_data_frame()
print(df)
print(nf.header)

When I call output_as_data_frames on data generated with that code snippet, I get this warning many times:

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nmi_df.rename(columns={"quality": "quality_method"}, inplace=True)
/Users/matthew/Documents/nem12/nemreader/outputs.py:86: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

This doesn't happen if there's exactly 1 NMI.

This doesn't happen for get_data_frame().

Looking through the code here and here, I don't think this is an bug, since we don't expect the rename to propegate to the original. But I thought I'd just point that out.

I've looked through the code to find ways to improve performance. None of my ideas make much difference, except one risky change.

I tried replacing:

with archive.open(csv_file) as csv_text:
    # Zip file is open in binary mode
    # So decode then convert back to list
    nmi_file = csv_text.read().decode("utf-8").splitlines()

with

with archive.open(csv_file) as csv_bin:
    with TextIOWrapper(csv_bin) as csv_text:

That didn't make a noticeable difference.

I tried passing in a list of named tuples directly to Pandas, which does work (i.e. doesn't throw an error). I hoped that might be faster than the list comprehension. But there's two unused columns which we had to drop. So that ended up being slightly slower.

I tried giving pandas just a single nmi, not a list of the same nmi repeatedly. Of course we'll still end up with a numpy array of the same nmi repeatedly, but I hoped pandas would do that faster. That was only a very slight difference.

For NEMData and NEMReadings I tried adding some pydantic config:

    class Config:
        # don't re-validate readings and transactions
        # since that results in re-work
        copy_on_model_validation = 'none'

That resulted in a slight speedup.
My hunch is that NEMData and NEMReadings both perform the same validation, which is redundant.

I considered trying to leverage Pydantic's serialisation functions for NEMData, but I doubt that would be faster than list comprehension.

I tried replacing those two classes with normal, non-Pydantic classes (i.e. no validation). That sped up substantially. From 35 seconds to 9 seconds with my particular test file. Perhaps that's a solution to improve speed?
What would be the implications of converting those two classes in particular to non-pydantic classes? Which pydantic functions do you use? Is pydantic used to validate the contents of each NamedTuple? (Since NamedTuples can have anything assigned regardless of the type annotation.)

Even if we must keep both classes as pydantic classes, perhaps there's some way to factor out Readings = readings: Dict[str, Dict[str, List[Reading]]], so we only validate each reading once?

Or perhaps I could just monkeypatch the library in my code to be non-pydantic models, then do validation on the dataframe with a few assertions on pandas methods. Which I assume is faster than pydantic.

You can remove the pydantic dependancy. I added it as an extra way to validate the datatypes and an easy way to add classes (but now that the whole thing is a class that isn't needed) but if it's causing a bit performance hit I'm happy to remove it. It should still work without it.