possible to avoid multiple dijkstra's path algorithms for each distance band
Closed this issue ยท 9 comments
I spent some more time looking at the single_source_dijkstra_path_length
function. It appears that it returns a dictionary, and that the values keyed to each node may in fact be the weighted path length. If so, this information could be used to avoid calling the single_source_dijkstra_path_length
function three times for each hospital. The function returns a dict
in which the node is the key and the value is the length.
To take advantage of this, it'd be better to save the dictionary object first (rather than simply passing it to the subgraph
function), and then see if the two lower distances can be extracted from it.
So far... it doesn't seem possible to use the nx.single_source_dijkstra_path_length
at once with multiple (or, a list of) weights. I may be misinterpreting the concept but, I think, either way the nx.single_source_dijkstra_path_length
function will have to be run three times.
True: you can't provide more than one maximum weight to single_source_dijkstra_path_length
. However, it might be possible to be more creative with it's output. The authors have just been using it to get a list of nodes for creating an ego graph or subgraph. However, try to see if the function results in a dictionary which also includes the path length to each node. If it does, then the two shorter distances could be selected from the output without re-running the dijkstra's shortest path algorithm.
I was able to figure out how to get separate dictionaries... so i didn't make any sweeping breakthroughs but am a little bit closer.
I have been trying to figure out how to take three separate dictionaries and turn them into subgraphs and then a list of polygons that can "differenced" into the two donuts + donut hole, might just take a few days of tinkering with these to actually get there....
def dijkstra_cca_polygons(G, nearest_osm, distances, distance_unit = "time"):
# create dictionary of nearest nodes
nearest_nodes_30 = nx.single_source_dijkstra_path_length(G, nearest_osm, cutoff=distances[2])
# ^ creating the largest graph from which 10 and 20 minute drive times can be extracted from
# extract values within 20 minutes drive times
nearest_nodes_20 = dict()
for key, value in nearest_nodes_30.items():
if value <= 20:
nearest_nodes_20[key] = value
# extract values within 10 minute drive times
nearest_nodes_10 = dict()
for key, value in nearest_nodes_30.items():
if value <= 10:
nearest_nodes_10[key] = value
... from here we want to construct three subgraphs and create and a list of three polygons (per hospital)
Side note: I was able to get closer to a reproduction using the buffered street network... The figures are the 'classified' maps in the results/reproduction section.
Ok, here is a pretty long-winded way of getting the three catchment areas while only running the nx.single_source_dijkstra_path_length
once ... I haven't yet compared the time of this one compared to run-length dijkstra_cca
and hospital_measure_acc
because they are nested into and/or dependent on some other functions.
Tomorrow I'm going to compare the run time, incorporate the difference geometric operation, and see if hopefully find a more efficient way to write the function.
def dijkstra_cca_polygons(G, nearest_osm, distances, distance_unit = "time"):
## CREATE DICTIONARIES
# create dictionary of nearest nodes
nearest_nodes_30 = nx.single_source_dijkstra_path_length(G, nearest_osm, cutoff=distances[2]) # creating the largest graph from which 10 and 20 minute drive times can be extracted from
# extract values within 20 minutes drive times
nearest_nodes_20 = dict()
for key, value in nearest_nodes_30.items():
if value <= 20:
nearest_nodes_20[key] = value
# extract values within 10 minute drive times
nearest_nodes_10 = dict()
for key, value in nearest_nodes_30.items():
if value <= 10:
nearest_nodes_10[key] = value
## DERIVE SUBGRAPHS FROM DICTIONARY NODES
# create service areas subgraphs for each
service_area_30 = G.subgraph(nearest_nodes_30)
service_area_20 = G.subgraph(nearest_nodes_20)
service_area_10 = G.subgraph(nearest_nodes_10)
## TURN SUBGRAPHS INTO GDFS
# create empty list to store polygons
polygons = []
# For 30 min drive time
# creates an x and y point for each node
nodes_30 = [Point((data['x'], data['y'])) for node, data in service_area_30.nodes(data=True)]
# constructs a multipart convex hull out of the subgraph of nodes
polygon_30 = gpd.GeoSeries(nodes_30).unary_union.convex_hull ## to create convex hull
# turns the polygon into a geodatframme
polygon_30 = gpd.GeoDataFrame(gpd.GeoSeries(polygon_30)) ## change polygon to geopandas
# renames geometry column
polygon_30 = polygon_30.rename(columns={0:'geometry'}).set_geometry('geometry')
# For 20 min drive timee
# creates an x and y point for each node
nodes_20 = [Point((data['x'], data['y'])) for node, data in service_area_20.nodes(data=True)]
# constructs a multipart convex hull out of the subgraph of nodes
polygon_20 = gpd.GeoSeries(nodes_20).unary_union.convex_hull ## to create convex hull
# turns the polygon into a geodatframme
polygon_20 = gpd.GeoDataFrame(gpd.GeoSeries(polygon_20)) ## change polygon to geopandas
# renames geometry column
polygon_20 = polygon_20.rename(columns={0:'geometry'}).set_geometry('geometry')
# For 10 min drive time
# creates an x and y point for each node
nodes_10 = [Point((data['x'], data['y'])) for node, data in service_area_10.nodes(data=True)]
# constructs a multipart convex hull out of the subgraph of nodes
polygon_10 = gpd.GeoSeries(nodes_10).unary_union.convex_hull ## to create convex hull
# turns the polygon into a geodatframme
polygon_10 = gpd.GeoDataFrame(gpd.GeoSeries(polygon_10)) ## change polygon to geopandas
# renames geometry column
polygon_10 = polygon_10.rename(columns={0:'geometry'}).set_geometry('geometry')
# append each to list
polygons.append(polygon_30)
polygons.append(polygon_20)
polygons.append(polygon_10)
return polygons
output is list of our three catchment areas:
[ geometry
0 POLYGON ((-87.61689 41.89684, -87.68071 41.934...,
geometry
0 POLYGON ((-87.66875 41.94062, -87.67356 41.943...,
geometry
0 POLYGON ((-87.67064 41.95433, -87.67433 41.957...]
The new Dijkstra function is both working (producing the same results) and noticeably faster.
Using 1 processor (with multiple kernels open):
- catchment calc w new function: avg 6 min 51 seconds (3 tests)
- catchment calc w old function: avg 7 min 27 seconds (3 tests)
Using 4 processors (with multiple kernels open):
- catchment calc w new function: 2 min 31 seconds (1 test)
- catchment calc w old function: 2 min 45 seconds (1 test).
After condensing the 20 and 10 min node dictionary construction into one for loop using 4 processors:
- catchment calc w new function: 2 min 16 seconds (4 tests)
- catchment calc w old function: 3 min 12 seconds (4 tests).
I haven't yet committed the changes to the code because the notebooks are a bit messy from troubleshooting.
Should be only two changes to the code get the new function working:
- substitute the below function for
djikstra_cca
:
def dijkstra_cca_polygons(G, nearest_osm, distances, distance_unit = "time"):
## CREATE DICTIONARIES
# create dictionary of nearest nodes for each distance (Drive Time) benchmark
nearest_nodes_30 = nx.single_source_dijkstra_path_length(G, nearest_osm, distances[2], distance_unit) # creating the largest graph from which 10 and 20 minute drive times can be extracted from
# extract values within 20 and 10 (respectively) minutes drive times
nearest_nodes_20 = dict()
nearest_nodes_10 = dict()
for key, value in nearest_nodes_30.items():
if value <= 20:
nearest_nodes_20[key] = value
if value <= 10:
nearest_nodes_10[key] = value
## DERIVE SUBGRAPHS FROM DICTIONARY NODES
# create service areas subgraphs for each
service_area_30 = G.subgraph(nearest_nodes_30)
service_area_20 = G.subgraph(nearest_nodes_20)
service_area_10 = G.subgraph(nearest_nodes_10)
## TURN SUBGRAPHS INTO GDFS
# create empty list to store polygons
polygons = []
# For 30 min drive time
# creates an x and y point for each node
nodes_30 = [Point((data['x'], data['y'])) for node, data in service_area_30.nodes(data=True)]
# constructs a multipart convex hull out of the subgraph of nodes
polygon_30 = gpd.GeoSeries(nodes_30).unary_union.convex_hull ## to create convex hull
# turns the polygon into a geodatframme
polygon_30 = gpd.GeoDataFrame(gpd.GeoSeries(polygon_30)) ## change polygon to geopandas
# renames geometry column
polygon_30 = polygon_30.rename(columns={0:'geometry'}).set_geometry('geometry')
# For 20 min drive timee
# creates an x and y point for each node
nodes_20 = [Point((data['x'], data['y'])) for node, data in service_area_20.nodes(data=True)]
# constructs a multipart convex hull out of the subgraph of nodes
polygon_20 = gpd.GeoSeries(nodes_20).unary_union.convex_hull ## to create convex hull
# turns the polygon into a geodatframme
polygon_20 = gpd.GeoDataFrame(gpd.GeoSeries(polygon_20)) ## change polygon to geopandas
# renames geometry column
polygon_20 = polygon_20.rename(columns={0:'geometry'}).set_geometry('geometry')
# For 10 min drive time
# creates an x and y point for each node
nodes_10 = [Point((data['x'], data['y'])) for node, data in service_area_10.nodes(data=True)]
# constructs a multipart convex hull out of the subgraph of nodes
polygon_10 = gpd.GeoSeries(nodes_10).unary_union.convex_hull ## to create convex hull
# turns the polygon into a geodatframme
polygon_10 = gpd.GeoDataFrame(gpd.GeoSeries(polygon_10)) ## change polygon to geopandas
# renames geometry column
polygon_10 = polygon_10.rename(columns={0:'geometry'}).set_geometry('geometry')
# append each to list
## they must be appended to the list in this order so that
## the difference algorithim set below properly cuts out overlap
## (if done the other way, the largest polygon simply erases the
## geometries of its two counterparts)
polygons.append(polygon_10)
polygons.append(polygon_20)
polygons.append(polygon_30)
for i in reversed(range(1, len(distances))):
polygons[i] = gpd.overlay(polygons[i], polygons[i-1], how="difference")
return polygons
- And then substitute the two for loops that create and difference the catchment areas for the one-liner below the
#create polygons
comment:
def hospital_measure_acc (_thread_id, hospital, pop_data, distances, weights):
# create polygons
polygons = dijkstra_cca_polygons(G, hospital['nearest_osm'], distances)
num_pops = []
for j in pop_data.index:
point = pop_data['geometry'][j]
for k in range(len(polygons)):
if len(polygons[k]) > 0: # to exclude the weirdo (convex hull is not polygon)
if (point.within(polygons[k].iloc[0]["geometry"])):
num_pops.append(pop_data['pop'][j]*weights[k])
total_pop = sum(num_pops)
for i in range(len(distances)):
polygons[i]['time']=distances[i]
polygons[i]['total_pop']=total_pop
polygons[i]['hospital_icu_beds'] = float(hospital['Adult ICU'])/polygons[i]['total_pop'] # proportion of # of beds over pops in 10 mins
polygons[i]['hospital_vents'] = float(hospital['Total Vent'])/polygons[i]['total_pop'] # proportion of # of beds over pops in 10 mins
polygons[i].crs = { 'init' : 'epsg:4326'}
polygons[i] = polygons[i].to_crs({'init':'epsg:32616'})
print('\rCatchment for hospital {:4.0f} complete'.format(_thread_id), end="")
return(_thread_id, [ polygon.copy(deep=True) for polygon in polygons ])
Great!
Now I'm thinking further... why are we creating points and iterating through lists of nodes so many times? It shouldn't be necessary... so I worked on adding point geometry data to the graph attributes, and then ways of using that to create convex hulls.
Prior to any network analysis and after the network setting algorithm, create point geometries for the whole graph:
for node, data in G.nodes(data=True):
data['geometry']=Point(data['x'], data['y'])
Then inside the dijkstra's function, use the geopandas and pandas libraries to get this work done without creating points from x
and y
and without calling more than one subgraph
function
# if the graph already has a geometry attribute with point data,
# this line will create a GeoPandas GeoDataFrame from the nearest_nodes_30 dictionary
points = gpd.GeoDataFrame(gpd.GeoSeries(nx.get_node_attributes(G.subgraph(nearest_nodes_30), 'geometry')))
# this line converts the nearest_nodes_30 dictionary into a Pandas data frame and joins it to points
# left_index=True and right_index=True are options for merge() to join on the index values
points = points.merge(pd.Series(nearest_nodes_30).to_frame(), left_index=True, right_index=True)
# re-name the columns and set the geodataframe geometry to the geometry column
points = points.rename(columns={'0_x':'geometry','0_y':'z'}).set_geometry('geometry')
# create a convex hull polygon from the points
polygon = gpd.GeoDataFrame(gpd.GeoSeries(points.unary_union.convex_hull))
polygon = polygon.rename(columns={0:'geometry'}).set_geometry('geometry')
# select nodes less than or equal to 20
points = points.query("z <= 20")
Other things I learned while trying this:
nx.single_source_dijkstra_path_length
creates a dictionary of node id's and travel timesubgraph
creates a view of a subset of a graph, and from that...G.nodes(data=true)
creates a dictionary of nodes and their attributes, includingx
andy
from a graph or subgraphG
nx.get_node_attributes(G, 'geometry')
can then get a dictionary of all the geometries for graph G's nodes
I got the new function working (just calling one subgraph and going from there) and ran a few isolated tests (just calculating the catchment areas). It seemed a bit quicker. I tried implementing it into the whole notebook but I kept running into different errors. I'm going to take the afternoon off but will troubleshoot the errors tomorrow morning.
Here's the (semi) working function:
def dijkstra_cca_polygons_2(G, nearest_osm, distances, distance_unit = "time"):
## CREATE DICTIONARIES
# create dictionary of nearest nodes
nearest_nodes_30 = nx.single_source_dijkstra_path_length(G, nearest_osm, distances[2], distance_unit) # creating the largest graph from which 10 and 20 minute drive times can be extracted from
# extract values within 20 and 10 (respectively) minutes drive times
nearest_nodes_20 = dict()
nearest_nodes_10 = dict()
for key, value in nearest_nodes_30.items():
if value <= 20:
nearest_nodes_20[key] = value
if value <= 10:
nearest_nodes_10[key] = value
## CREATE POLYGONS FOR 3 DISTANCE CATEGORIES (10 min, 20 min, 30 min)
# 30 MIN
# if the graph already has a geometry attribute with point data,
# this line will create a GeoPandas GeoDataFrame from the nearest_nodes_30 dictionary
points = gpd.GeoDataFrame(gpd.GeoSeries(nx.get_node_attributes(G.subgraph(nearest_nodes_30), 'geometry')))
# this line converts the nearest_nodes_30 dictionary into a Pandas data frame and joins it to points
# left_index=True and right_index=True are options for merge() to join on the index values
points = points.merge(pd.Series(nearest_nodes_30).to_frame(), left_index=True, right_index=True)
# re-name the columns and set the geodataframe geometry to the geometry column
points = points.rename(columns={'0_x':'geometry','0_y':'z'}).set_geometry('geometry')
# create a convex hull polygon from the points
polygon = gpd.GeoDataFrame(gpd.GeoSeries(points.unary_union.convex_hull))
polygon = polygon.rename(columns={0:'geometry'}).set_geometry('geometry')
# 20 MIN
# select nodes less than or equal to 20
points_20 = points.query("z <= 20")
# create a convex hull polygon from the points
polygon_20 = gpd.GeoDataFrame(gpd.GeoSeries(points_20.unary_union.convex_hull))
polygon_20 = polygon_20.rename(columns={0:'geometry'}).set_geometry('geometry')
# 10 MIN
# select nodes less than or equal to 10
points_10 = points.query("z <= 10")
# create a convex hull polygon from the points
polygon_10 = gpd.GeoDataFrame(gpd.GeoSeries(points_10.unary_union.convex_hull))
polygon_10 = polygon_10.rename(columns={0:'geometry'}).set_geometry('geometry')
# Create empty list and append polygons
polygons = []
# append
polygons.append(polygon_10)
polygons.append(polygon_20)
polygons.append(polygon)
for i in reversed(range(1, len(distances))):
polygons[i] = gpd.overlay(polygons[i], polygons[i-1], how="difference")
return polygons
Ok so the newer algorithm seems to be running a bit quicker but the kernel was giving quite varied run-times. So, I'm going to do another set of tests either over the weekend or on Monday. For now, the newest version of the function (calling the subgraph once and constructing the geometries by querying point geometries) ran an avg of 2 min 35 sec over 4 tests. The older version of the function (creating 3 separate subgraphs and constructing polygons from each one) ran an avg of 3 min 20 sec over 4 tests.
I'm going to leave the original notebook as is for now. I'll incorporate the fastest running function to the improvements notebook but I want to further investigate these run-times when they are running a bit more consistently. I'll keep track of all these different functions in the Dijkstra-revisions.ipynb.
Cool! Even if the run-time is about the same, I think this new way is more intuitive? The more that these notebooks can use the higher-level pandas and geopandas functions instead of lower-level code like for loops, the easier it will be for GIS people to understand.