- GIS: Geographic Information System
- ESRI: A company whose GIS mapping software is some of the most powerful mapping and spatial data analytics technology available.
- Esri JSON: A proprietary JSON format for GIS data created by Esri.
- GeoJSON: http://geojson.org/ In 2015, the Internet Engineering Task Force (IETF), in conjunction with the original specification authors, formed a GeoJSON WG to standardize GeoJSON. RFC 7946 was published in August 2016 and is the new standard specification of the GeoJSON format, replacing the 2008 GeoJSON specification.
- Shapefile: https://en.wikipedia.org/wiki/Shapefile The shapefile format is a popular geospatial vector data format for geographic information system (GIS) software. It is developed and regulated by Esri as a (mostly) open specification for data interoperability among Esri and other GIS software products.
- If you need to install Apache Maven: https://maven.apache.org/install.html
- Esri examples and sample data: https://github.com/Esri/gis-tools-for-hadoop
git clone https://github.com/Esri/geometry-api-java
git clone https://github.com/Esri/spatial-framework-for-hadoop
hdfs dfs -mkdir /user/hive/udf_jars
hdfs dfs -put esri-geometry-api-2.0.0.jar /user/hive/udf_jars
hdfs dfs -put spatial-sdk-json-2.1.0-SNAPSHOT.jar /user/hive/udf_jars
hdfs dfs -put spatial-sdk-hive-2.1.0-SNAPSHOT.jar /user/hive/udf_jars
wget https://raw.githubusercontent.com/Esri/gis-tools-for-hadoop/master/samples/data/counties-data/california-counties.json
wget https://raw.githubusercontent.com/Esri/gis-tools-for-hadoop/master/samples/data/earthquake-data/earthquakes.csv
hdfs dfs -mkdir gis
hdfs dfs -mkdir gis/california-counties
hdfs dfs -mkdir gis/earthquakes
hdfs dfs -put california-counties.csv gis/california-counties
hdfs dfs -put earthquakes.csv gis/earthquakes
Building ogr2ogr to convert Shape files to GEOJson:
Compiling a recent ogr2ogr from source on CentOS (RHEL)
- wget http://download.osgeo.org/gdal/2.2.3/gdal-2.2.3.tar.gz
- wget http://download.osgeo.org/proj/proj-4.9.3.tar.gz
https://mapzen.com/data/metro-extracts/
Search for New York ---> Export
https://mapzen.com/data/metro-extracts/metro/new-york_new-york/
Shapefile 2 Geo JSON:
new-york_new-york.imposm-shapefiles]$ ogr2ogr -f GeoJSON -t_srs crs:84 new-york_new-york_osm_places.geojson new-york_new-york_osm_places.shp
All Starbucks locations in the world
US States:
https://github.com/shawnbot/topogram/blob/master/data/us-states.geojson
Esri Binning:
https://github.com/Esri/gis-tools-for-hadoop/wiki/Aggregating-CSV-Data-%28Spatial-Binning%29
Precision of Accruacy of Lat / Long:
https://gis.stackexchange.com/questions/8650/measuring-accuracy-of-latitude-and-longitude/8674#8674
Big Data and Analytics: A Conceptual Overview
http://proceedings.esri.com/library/userconf/proc15/tech-workshops/tw_445-253.pdf
ll geometry-api-java/target/esri-geometry-api-2.0.0.jar
ll spatial-framework-for-hadoop/hive/target/spatial-sdk-hive-2.1.0-SNAPSHOT.jar
ll spatial-framework-for-hadoop/json/target/spatial-sdk-json-2.1.0-SNAPSHOT.jar
https://www.census.gov/cgi-bin/geo/shapefiles2010/main
[ztrew@foo zipcode]$cat ../bin/ogr2ogr.sh
ogr2ogr -f GeoJSON -t_srs crs:84 ${1}.geojson ${1}.shp
[ztrew@foo zipcode]$ ../bin/ogr2ogr.sh tl_2010_us_zcta510
set hive.auto.convert.join=false;
add jar hdfs:///user/hive/udf_jars/esri-geometry-api-2.0.0.jar;
add jar hdfs:///user/hive/udf_jars/spatial-sdk-json-2.1.0-SNAPSHOT.jar;
add jar hdfs:///user/hive/udf_jars/spatial-sdk-hive-2.1.0-SNAPSHOT.jar;
create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';
create temporary function ST_GeomFromGeoJson as 'com.esri.hadoop.hive.ST_GeomFromGeoJson';
create temporary function ST_Intersects as 'com.esri.hadoop.hive.ST_Intersects';
create temporary function ST_AsText as 'com.esri.hadoop.hive.ST_AsText';
create temporary function ST_AsGeoJson as 'com.esri.hadoop.hive.ST_AsGeoJson';
create temporary function ST_Bin as 'com.esri.hadoop.hive.ST_Bin';
create temporary function ST_BinEnvelope as 'com.esri.hadoop.hive.ST_BinEnvelope';
DROP TABLE IF EXISTS zipcode;
CREATE EXTERNAL TABLE IF NOT EXISTS zipcode (
ZCTA5CE10 STRING,
GEOID10 STRING,
CLASSFP10 STRING,
MTFCC10 STRING,
FUNCSTAT10 STRING,
ALAND10 STRING,
AWATER10 STRING,
INTPTLAT10 STRING,
INTPTLON10 STRING,
geometry binary
)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.GeoJsonSerDe'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedGeoJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/ztrew/zipcode';
select ZCTA5CE10 from zipcode limit 5;
SELECT z.ZCTA5CE10, s.addr, s.long, s.lat FROM zipcode z
JOIN starbucks s
WHERE ST_Contains(z.geometry, ST_Point(s.long, s.lat))
limit 5;
set hive.auto.convert.join=false;
add jar hdfs:///user/hive/udf_jars/esri-geometry-api-2.0.0.jar;
add jar hdfs:///user/hive/udf_jars/spatial-sdk-json-2.1.0-SNAPSHOT.jar;
add jar hdfs:///user/hive/udf_jars/spatial-sdk-hive-2.1.0-SNAPSHOT.jar;
create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';
drop table if exists earthquakes;
drop table if exists counties;
CREATE TABLE earthquakes (earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, depth DOUBLE, magnitude DOUBLE,
magtype string, mbstations string, gap string, distance string, rms string, source string, eventid string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
CREATE TABLE counties (Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.EsriJsonSerDe'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedEsriJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
Load data into the respective tables:
LOAD DATA INPATH 'earthquake-demo/earthquake-data/earthquakes.csv' OVERWRITE INTO TABLE earthquakes;
LOAD DATA INPATH 'earthquake-demo/counties-data/california-counties.json' OVERWRITE INTO TABLE counties;
SELECT counties.name, count(*) cnt FROM counties
JOIN earthquakes
WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))
GROUP BY counties.name
ORDER BY cnt desc;
Total MapReduce CPU Time Spent: 15 seconds 590 msec
OK
Kern 36
San Bernardino 35
Imperial 28
Inyo 20
Los Angeles 18
Riverside 14
Monterey 14
Santa Clara 12
Fresno 11
San Benito 11
San Diego 7
Santa Cruz 5
San Luis Obispo 3
Ventura 3
Orange 2
San Mateo 1
Time taken: 67.654 seconds, Fetched: 16 row(s)
hive>