Question on API Limits / Fetching Bulk Data

Question

Question on API Limits / Fetching Bulk Data

mike-welch opened this issue 4 years ago · 8 comments

Over the past two weeks I have been working with rex to extract data from the offshore_ca dataset, ultimately running a snippet similar to this:

import pandas as pd
from pathlib import Path
from rex import WindX

sites = Path('sites.csv')
data = pd.read_csv(sites)

for _, row in data.iterrows():
    print(row.Name)
    parameters = {
        'lat_lon': (row.Latitude, row.Longitude),
        'hub_height': row.hub_height,
        'require_wind_dir': True,
    }

    outDir = sites.parent.joinpath('data', row.Name)
    outDir.mkdir(parents=True, exist_ok=True)

    for yyyy in range(2000, 2020):
        h5file = f'/nrel/wtk/offshore_ca/Offshore_CA_{yyyy}.h5'
        outFile = f'wtk_{row.Latitude}_{row.Longitude}_60min_{row.hub_height}_{yyyy}.csv'
        if not outFile.exists():
            with WindX(h5file, hsds=True) as f:
                f.get_SAM_lat_lon(
                    lat_lon=(row.Latitude, row.Longitude),
                    hub_height=row.hub_height,
                    # require_wind_dir=True,
                    out_path=str(outFile),
                )
                # time.sleep(1800)

This has worked pretty well, with the exception that I can only get data for three years before I hit the hourly API limit of 1000 calls, receiving OSError(429) partway through the fourth. To work around this, I added time.sleep(1800) (commented out in the snippet above) which allowed me to pull 2 years each hours for 10 or so hours to get the full 20 years of data. The rationale for requesting the full 20 years is because we are processing the data with PySAM and then feeding the data into a stochastic PLEXOS simulation, where any of the 20 years of data can be used in each sample.

I did some rough profiling of how many API calls are being made when calling get_SAM_lat_lon:

3 initial calls (during constructor or other)
46 calls when getting the metadata for the datasets available in the h5 file
246 calls to get the data (excludes wind direction)

I measured this by adding the following bug print statement after line 890 in base.py from h5pyd. All calls but the initial 3 were captured here (to the best of my knowledge).

print(f"API Requests Remaining: {rsp.headers._store['x-ratelimit-remaining'][1]}\n")

I also measured the number of calls using get_lat_lon_df to get a windspeed dataset for a single lat/long and observed 128 calls there.

I'm not sure if my use case is the standard or not and am wondering if there is a better way to fetch the data. I know that the MultiYearX class exists which can be used but I don't think this will get around the foundational API limit since the data is located across multiple h5 files and would require pulling data from each. However, maybe this approach would alleviate needing to refetch the metadata or certain scalar datasets.

Ultimately, I'm not sure if the rate limit of 1000 calls per hour is appropriate given the nature of h5's, or if I'm a candidate for an increased API limit because of my use case.

I appreciate any feedback and insight from the NREL team.

Answer 1 · 2021-07-06T15:02:43.000Z

Hey Mike, glad to hear you're using the wind data and the rex software. I'm hoping that @MRossol will chime in, but my understanding is that we have an open "demo" HSDS server set up for small data retrieval and prototyping, but if you're trying to pull TB of data you will need to stand up your own HSDS production server. Michael might have an example of how to do so.

Answer 2 · 2021-07-06T15:08:05.000Z

Hi @mike-welch, @grantbuster is spot on. Our public HSDS server is meant as a resource to allow people to get familiar with the code and data, but doesn't have the throughput (as you've noticed) to handle "production" analysis. Regardless as our service is publicly available even if we up'd your API limit you'll end up running into 505 errors due to contention will all of our other users. As Grant mentioned your best bet is to:

Stand up your own HSDS server: https://github.com/HDFGroup/hsds
Use the HDF groups Kita Lab (a managed HSDS service on AWS): https://www.hdfgroup.org/solutions/hdf-kita/

If you'd like more info about the above I can put you in touch with the HSDS architect at the HDF group, he's happy to help migrate users to their own service!

Cheers!

Answer 3 · 2021-07-06T15:12:31.000Z

@MRossol, for my own edification, are the following two links (in order) the best resources to get started on this?

https://github.com/HDFGroup/hsds/blob/master/docs/docker_install_aws.md
https://nrel.github.io/reV/misc/examples.running_with_hsds.html#setting-up-hsds

Answer 4 · 2021-07-06T19:20:07.000Z

Thanks for the feedback!

In the example docs for reV (thanks for the link @grantbuster):

Please note that our HSDS service is for demonstration purposes only, if you would like to use HSDS for production runs of reV please setup your own service: https://github.com/HDFGroup/hsds and point it to our public HSDS bucket: s3://nrel-pds-hsds

If I'm reading this correctly, we can set up our own HSDS service which will pull data from the NREL bucket. So there is no need for me to maintain a copy of the Wind Toolkit data, just the need to have a different way to access it. Does that sound right?

Answer 5 · 2021-07-06T19:22:12.000Z

@mike-welch that is 100% correct. As mentioned above you just need to point to our public HSDS bucket (s3://nrel-pds-hsds) when you setup your HSDS service.

Answer 6 · 2021-07-06T20:07:43.000Z

Okay, very cool.

I'm leaning towards Kita and would love to get connected with your HDF Group contact to see what the options are for setting this up and if it makes sense for us to do so.

Answer 7 · 2021-07-06T20:09:32.000Z

Great! Whats the best e-mail address for you and I can send an introduction!

Answer 8 · 2021-07-06T20:19:42.000Z

Great! Whats the best e-mail address for you and I can send an introduction!

michael.welch@telos.energy