mbloch/mapshaper

Memory usage almost 4x size of input geojson file

Closed this issue · 4 comments

Hi, I understand this might not be a real bug in the mapshaper code, but I could use some suggestions on how to optimize the memory usage of mapshaper for my task since it will be deployed to an elixir web server on an EC2 instance in AWS.

My use case is to convert a pdf (containing vector data which could be up to 5mb in size) into a geojson file to be rendered in the browser with Leaflet.

I'm using mupdf to clean the pdf input.pdf, ghostscript to filter out text and images, gdal to convert the pdf to geojson, and mapshaper to reduce the geojson file size and apply transformations as needed.

$ mutool clean -gggg input.pdf clean.pdf
$ gs -o no-tech-no-images.pdf -sDEVICE=pdfwrite -dFILTERTEXT -dFILTERIMAGE clean.pdf
$ ogr2ogr --config OGR_PDF_READ_NON_STRUCTURED YES --config GDAL_PDF_DPI 200 \
    -oo RENDERING_OPTIONS=VECTOR \
    -lco COORDINATE_PRECISION=3 \
    big.geojson no-text-no-images.pdf
$ mapshaper big.geojson no-topology \
    -target type=polyline -dissolve \
    -target type=polyline -affine shift=-0.0,-0.0 \
    -clip bbox=0,0,9600,7200 \
    -target type=polyline -o precision=0.001 final.json

The outputted geojson from ogr2ogr is about 1GB in size which mapshaper reduces down to about 14MB.
The mapshaper step uses almost 4GB of memory, I suspect due to the size of the outputted ogr2ogr file.

I tried replacing the mapshaper step with a python script using geopandas, but the memory usage was about the same and it took much longer than mapshaper (~5 minutes vs ~30 seconds).

I'm open to any ideas on how (or even if it's realistic) to get that memory usage down.

Thanks!

Tiffany

Hi,
Your options for reducing memory usage may depend on the particular features of your dataset. If you can send me a representative sample of the GeoJSON you are processing, I'd be happy to take a look and make suggestions.

sample.json
I've attached a sample of the geojson outputted from ogr2ogr. You can also get the full 1GB file from following the steps with the initially attached pdf. Thanks for taking a look!

Your dataset has a large number of small simple vector paths. That's going to require a lot more RAM to process than a dataset with the same number of vertices contained in a few complex paths.

I see that you are using mapshaper to clip to a bounding box (-clip bbox=0,0,9600,7200). You could perform that clipping operation in ogr2ogr and then process the clipped file in mapshaper. If your clipping box represents a small part of the total file, then this approach will greatly reduce the memory used by mapshaper.

If your clipping operation doesn't remove most of the paths in the file, then the only other optimization I can think of is to split your large file into parts and then process each part separately.

Hope my feedback was useful...