microsoft/USBuildingFootprints

geoJSON state datasets too large

elijaflores6 opened this issue · 36 comments

Hello,

I see that the datasets are available on the state level – however, these datasets for each state are very large and I did not have any luck with trying to convert the geoJSON files to shapefiles (the process took way longer than usual and overheated my computer).

Besides suggesting a better computer to handle heavier processing - is it possible get any building footprint datasets for cities, so that the files are smaller and easier on my computer for converting?

I would prefer that the data are all from one source (for consistency), as opposed to going to multiple sources for the data. My team is searching for building footprints for the following cities:

• Minneapolis, MN
• Atlanta, GA
• Raleigh, NC
• St Paul, MN
• Charlotte, NC
• Winston-Salem, NC
• Chicago, IL
• Dallas, TX

Any other suggestions/solutions would be very much appreciated!

Thanks,
Elija

@elija How big is your team? Nobody with a desktop PC (→no overheating)?

I did the subset in the Texas dataset using a MacBook Pro, no further problems. Took me about 5 minutes. Just had to learn a little better about the SF package. Using RGEOS/RGDAL was very intensive computing that hang the computer for about 6h to Maine, and endless for Texas. On the readme file there is a nice walkthrough how to do it in R/sf package. It is pretty simple. Let me know if you need further help. I appreciate MS is using a open source format (geoJSON). However a shapefile format would be an alternative for most of the folks.

Hey there @antifa-ev

right now, we are a team of 3 people (myself included), one of them has a developer computer and he'll be working on that. But we would like to split up the work amongst ourselves.

We all also work remotely in different states, so we were only provided laptops for work. We're hoping to find a better solution with the resources we have at the moment.

Look at using safe software, FME. Took me 1 hour 26 minutes to convert California from GeoJson to File GDB. FME runs on Mac, too

The following script worked for me. Took ~10 min in a MacBook Pro 2017. After that, any intersection works pretty fast.

library(sf)
setwd("./USBuildingFootprints/")
Texas_buildings <- st_read("./Texas.geojson")
st_write(Texas_buildings, "Texas_buildings.shp")

@elijaflores6 Have either of the two suggestions worked for you?

  1. Generate workspace
    1 generate workspace
  2. Edit the inputs
    1 generate workspace inputs
  3. Run the translation
    2 run the translation

30mins to complete Texas.

I used:
ogr2ogr -nlt POLYGON dest.shp source.geojson
on my mac and it worked for me. Thanks!

Re large geojson's, I'm able to download and open others but not California (ie, I opened NY and DC, and NY is one of the largest). Is the California file corrupted? If not, can it be broken into N and S California or some other partition, 3 or 4 parts, to make processing easier? Thanks

The following script worked for me. Took ~10 min in a MacBook Pro 2017. After that, any intersection works pretty fast.

library(sf)
setwd("./USBuildingFootprints/")
Texas_buildings <- st_read("./Texas.geojson")
st_write(Texas_buildings, "Texas_buildings.shp")

Hey @abuabara , thanks for sharing the R script.
Trying to replicate it gives an error on the Texas dataset:
"Error in CPL_read_ogr(dsn, layer, as.character(options), quiet, type, :
Open failed.
In addition: Warning messages:
1: In CPL_read_ogr(dsn, layer, as.character(options), quiet, type, :
GDAL Error 1: JSON parsing error: buffer size overflow (at offset 0)
2: In CPL_read_ogr(dsn, layer, as.character(options), quiet, type, :
GDAL Error 4: Failed to read GeoJSON data"
Running on a Dell XPS13 8gb i5-5th gen processor.

Your help would be appreciated.

The following script worked for me. Took ~10 min in a MacBook Pro 2017. After that, any intersection works pretty fast.

library(sf)
setwd("./USBuildingFootprints/")
Texas_buildings <- st_read("./Texas.geojson")
st_write(Texas_buildings, "Texas_buildings.shp")

Hey @abuabara , thanks for sharing the R script.
Trying to replicate it gives an error on the Texas dataset:
"Error in CPL_read_ogr(dsn, layer, as.character(options), quiet, type, :
Open failed.
In addition: Warning messages:
1: In CPL_read_ogr(dsn, layer, as.character(options), quiet, type, :
GDAL Error 1: JSON parsing error: buffer size overflow (at offset 0)
2: In CPL_read_ogr(dsn, layer, as.character(options), quiet, type, :
GDAL Error 4: Failed to read GeoJSON data"
Running on a Dell XPS13 8gb i5-5th gen processor.

Your help would be appreciated.

I'm trying to replicate your error here. So far I have two guesses ...

  1. memory available. I can test on a 16GB MacBook Pro and a 32GB iMac. On the iMac it's substantially faster and it uses most of the memory. On the MBP it works, but reaches full memory pressure. Probably 8GB is not enough.

  2. what about the GDAL installation. After loading sf package I have: "Linking to GEOS 3.6.1, GDAL 2.1.3, proj.4 4.9.3"

It seems that the problem with geojson is that commonly the software tries to read the whole thing.
However, I have written a small script to move from json to an ascii format which, say Global Mapper, reads easily.

"""
Prints out ascii format from the building footprint files.
usage:
python get_ascii_from_json.py json.geojson > out.file
output in form <feature number>,<x>,<y>:
0,-120.800643,46.963025
0,-120.800727,46.96307
0,-120.800825,46.962986
0,-120.800741,46.96294
0,-120.800643,46.963025
1,-120.686818,47.038457
...
"""
import sys
c = 0
maxc = -1;

for i in file(sys.argv[1]):
    if not "Poly" in i:
        continue
    if not '[[[' in i or not ']]]' in i:
        raise Exception("assumptions on format are incorrect")

    if maxc > 0: print i
    
    r = i.split('[[[')[1]
    arr = r.split(']]]')[0]
    arr = "".join(("".join(arr.split('['))).split(']'))
    pp = arr.split(",")

    if maxc > 0: print pp

    prefix = str(c)
    j = 0
    while j < len(pp) / 2:
        print prefix + "," + pp[j*2]+"," + pp[j*2+1]
        j+=1;

    c = c+1;
    if (c == maxc):
        break;

@elijaflores6 I transformed the geojson into shapefiles and saved one indexed shapefile per county in an s3 bucket, here: s3://glr-ds-us-building-footprints. Files are named {county_fips}.{extension}, eg. 19075.shx. A 'full' shapefile consists of five files, with extensions: .dbf, .prj, .qix, .shp, .shx.

I created a repo for the derived shapefiles, with an example of downloading a shapefile, querying it, and computing some building attributes of plots of land in the county. Check it out! https://github.com/granularag/USBuildingFootprints

jwhyb commented

The following script worked for me. Took ~10 min in a MacBook Pro 2017. After that, any intersection works pretty fast.

library(sf)
setwd("./USBuildingFootprints/")
Texas_buildings <- st_read("./Texas.geojson")
st_write(Texas_buildings, "Texas_buildings.shp")

Hey @abuabara , thanks for sharing the R script.
Trying to replicate it gives an error on the Texas dataset:
"Error in CPL_read_ogr(dsn, layer, as.character(options), quiet, type, :
Open failed.
In addition: Warning messages:
1: In CPL_read_ogr(dsn, layer, as.character(options), quiet, type, :
GDAL Error 1: JSON parsing error: buffer size overflow (at offset 0)
2: In CPL_read_ogr(dsn, layer, as.character(options), quiet, type, :
GDAL Error 4: Failed to read GeoJSON data"
Running on a Dell XPS13 8gb i5-5th gen processor.
Your help would be appreciated.

I'm trying to replicate your error here. So far I have two guesses ...

  1. memory available. I can test on a 16GB MacBook Pro and a 32GB iMac. On the iMac it's substantially faster and it uses most of the memory. On the MBP it works, but reaches full memory pressure. Probably 8GB is not enough.
  2. what about the GDAL installation. After loading sf package I have: "Linking to GEOS 3.6.1, GDAL 2.1.3, proj.4 4.9.3"

I am having the same issue. I have tried to open up the Michigan geojson with QGIS and ArcPro. It crashes both my laptop and my desktop each time I try. When I tried the R script, it froze the windows desktop computer I was using and crashed several programs before erring 2 hrs later with pretty much the same error:

> library(sf)
Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
> setwd("C:/UPX/DATA/MI/Data/Polygons/Michigan")
> Buildings <- st_read("Michigan.geojson")
Cannot open data source C:\UPX\DATA\MI\Data\Polygons\Michigan\Michigan.geojson
Error in CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  : 
  Open failed.
In addition: Warning message:
In CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  :
  GDAL Error 4: Failed to read GeoJSON data

My laptop and desktop both have 8GB RAM and run 64-bit Windows 10.
I work at a university research lab and this is all we have access to.
@ledusledus Any ideas?

@jwhyb - did you try pushing the thing through the script above and then opening it up in qgis?

jwhyb commented

@ledusledus It did work! I had to upload as csv points and then use the Points2One extension, grouping by the first field, to turn into polygons. Thanks!

The FME software method works great for Texas and California. I found that R can handle everything else. I downloaded all the rest of the states and unzipped them into an empty folder then used the following code to convert all of them to shapefiles with a loop in R:

library(geojsonsf)
library(sf)

files <- list.files(path="Folder/Path/Building_jsons", pattern="*.geojson", full.names=TRUE, recursive=FALSE)

for(f in files){
split <- read.table(text = f, sep = "/")
filename <- split[,8]
state <- substr(filename,1,nchar(as.character(filename))-8)
outfile <- paste0("Folder/Path/Building_shps/",state,".shp")
sf <- geojsonsf::geojson_sf(f)
st_write(sf, outfile)
print(paste0("Completed ",state))
}

I'm having the same problem with California, except I am limited by hardware. Loading the state of California's footprints successfully crashes my computer when opened or read through FME, R, Python, cmd, and notepad++. Conversion to other filetypes has been equally unsuccessful. At this point my goal is to subset the state into smaller bitesize portions that my machine can handle.

Does anyone have a subset, or even converted format of the CA data they would be willing to share?

@aquaraider333,
I had the same problem which I solved building tilesets using tippecanoe (https://github.com/mapbox/tippecanoe):

tippecanoe -o out.mbtiles -zg --drop-densest-as-needed California.geojson -P

worked like a charm.

I can share the tileset if you are interested

I've got county-level shapefiles on s3. https://github.com/granularag/USBuildingFootprints

@chipfranzen I'm having trouble getting Fiona working properly on my windows build, and it looks like theres a paywall to download large amounts of data through S3. Is there an alternative way I might access the data?

I can share the tileset if you are interested

@mattmar If you are willing to share, I'd be happy to use them.

@aquaraider333 yeah, you will have to pay for data transfer from s3. It's insanely cheap, like $0.0007 per GB. https://aws.amazon.com/s3/pricing/

I used PostGIS to transform the geojson files to county CSV files with the geometry as WKT. These files compress really nicely compared to geojson and shapefiles and the zipfiles can be opened directly in QGIS. It took a few days to figure out the process and run it for the entire dataset (all 50 states plus DC). My workflow is documented here: https://github.com/dlab-geo/msfootprints_by_county. You can see the output of this processing for CA in this google drive folder. Hope that helps. - Patty

@elijaflores6 I transformed the geojson into shapefiles and saved one indexed shapefile per county in an s3 bucket, here: s3://glr-ds-us-building-footprints. Files are named {county_fips}.{extension}, eg. 19075.shx. A 'full' shapefile consists of five files, with extensions: .dbf, .prj, .qix, .shp, .shx.

Hi I am no developer or programmer but I do need California building shapefiles by county. I don't know what s3 bucket is. How can I get them? (if okay for you).

@aquaraider333 I've got county-level shapefiles on s3. https://github.com/granularag/USBuildingFootprints

I actually signed in for aws s3 bucket but somehow I cannot not connect to your link. s3://glr-ds-us-building-footprints.

Is there a chance you could put the files somewhere else?

@MimiS2008 That link, s3://glr-ds-us-building-footprints is an s3 path, not a url for a webpage. The easiest way to interact with an s3 bucket is probably aws-cli. https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html

If you install that and run aws configure you should be able to input your AWS credentials, then use aws s3 <command> to interact with the bucket.

aws s3 ls s3://glr-ds-us-building-footprints will list the bucket contents, and you can use aws s3 cp to download the files you want.

https://github.com/woodb/geojsplit works for me.

node --max-old-space-size=20480 /usr/lib/node_modules/geojsplit/bin/geojsplit -a 1 -l 3000000 -v -o ~/data ~/data/California.geojson

I have divided California.geojson into four files with max 3000000 features per each.

By default, node has only 512 Mb memory limit so --max-old-space-size=20480 increase it to 20Gb