lifewatch/eurobis

Slow requests when trying to access a large volume of data

salvafern opened this issue · 1 comments

I am currently trying to use the eurobis R library to download all records of benthic animals from the North Sea. Before, I used the robis package to download all data from a bounding box. Now, I have defined North Sea as area, and asked for all data with traits=benthos. I copied the url from the interface and used that in the eurobis package.

What I notice is that the response is very slow. In one hour, I got approximately 80000 records (of 1 million). So this may take a lot more time. Is that due to heavy traffic on the server, or has it to do with the eurobis package?

This is a large amount of data so it is indeed reasonable that it will take a long time. We should however check how we can improve the way the data is downloaded in this package.


Edit: here is the request performed

library("eurobis")
quer_ben_ns<-"http://geo.vliz.be/geoserver/wfs/ows?service=WFS&version=1.1.0&request=GetFeature&typeName=Dataportal%3Aeurobis-obisenv&resultType=results&viewParams=where%3A%28%28up.geoobjectsids+%26%26+ARRAY%5B2350%5D%29%29+AND+measurement_type_group_ids+%26%26+ARRAY%5B3%5C%2C14%5C%2C1%5C%2C28%5C%2C27%5D+AND+aphiaid+IN+%28+SELECT+aphiaid+FROM+eurobis.taxa_attributes+WHERE+selectid+IN+%28%27Benthos%27%29%29%3Bcontext%3A0100&propertyName=datasetid%2Cdatecollected%2Cdecimallatitude%2Cdecimallongitude%2Ccoordinateuncertaintyinmeters%2Cscientificname%2Caphiaid%2Cscientificnameaccepted%2Cmodified%2Cinstitutioncode%2Ccollectioncode%2Cyearcollected%2Cstartyearcollected%2Cendyearcollected%2Cmonthcollected%2Cstartmonthcollected%2Cendmonthcollected%2Cdaycollected%2Cstartdaycollected%2Cenddaycollected%2Cseasoncollected%2Ctimeofday%2Cstarttimeofday%2Cendtimeofday%2Ctimezone%2Cwaterbody%2Ccountry%2Cstateprovince%2Ccounty%2Crecordnumber%2Cfieldnumber%2Cstartdecimallongitude%2Cenddecimallongitude%2Cstartdecimallatitude%2Cenddecimallatitude%2Cgeoreferenceprotocol%2Cminimumdepthinmeters%2Cmaximumdepthinmeters%2Coccurrenceid%2Cscientificnameauthorship%2Cscientificnameid%2Ctaxonrank%2Ckingdom%2Cphylum%2Cclass%2Corder%2Cfamily%2Cgenus%2Csubgenus%2Cspecificepithet%2Cinfraspecificepithet%2Caphiaidaccepted%2Coccurrenceremarks%2Cbasisofrecord%2Ctypestatus%2Ccatalognumber%2Creferences%2Crecordedby%2Cidentifiedby%2Cyearidentified%2Cmonthidentified%2Cdayidentified%2Cpreparations%2Csamplingeffort%2Csamplingprotocol%2Cqc%2Ceventid%2Cparameter%2Cparameter_value%2Cparameter_group_id%2Cparameter_measurementtypeid%2Cparameter_bodcterm%2Cparameter_bodcterm_definition%2Cparameter_standardunit%2Cparameter_standardunitid%2Cparameter_imisdasid%2Cparameter_ipturl%2Cparameter_original_measurement_type%2Cparameter_original_measurement_unit%2Cparameter_conversion_factor_to_standard_unit%2Cevent%2Cevent_type%2Cevent_type_id&outputFormat=csv"
tt<-getEurobisData(geourl = quer_ben_ns)

Potential solutions are (non-exclusive):

  1. Cache: Data are downloaded to disk instead of reading into memory. E.g. use httr::GET as in osmextract: https://github.com/ropensci/osmextract/blob/master/R/download.R#L134 or explore in detail httr2::req_cache()
  2. Pagination: WFS allows for pagination, so the data could be downloaded in batches and written into disk.
  3. Compressed data: We can look into all the possible outputs for vector data of WFS to get the data as compressed as possible, and then decompressed locally. There are geoserver extensions to get other formats that might be more compressed or more interesting for requesting and aggregating large volumes of data: https://docs.geoserver.org/latest/en/user/extensions/index.html