HEGSRR/RPr-Kang-2020

possible to avoid multiple dijkstra's path algorithms for each distance band

Closed this issue ยท 9 comments

I spent some more time looking at the single_source_dijkstra_path_length function. It appears that it returns a dictionary, and that the values keyed to each node may in fact be the weighted path length. If so, this information could be used to avoid calling the single_source_dijkstra_path_length function three times for each hospital. The function returns a dict in which the node is the key and the value is the length.

To take advantage of this, it'd be better to save the dictionary object first (rather than simply passing it to the subgraph function), and then see if the two lower distances can be extracted from it.

https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.shortest_paths.weighted.single_source_dijkstra_path_length.html#networkx.algorithms.shortest_paths.weighted.single_source_dijkstra_path_length

So far... it doesn't seem possible to use the nx.single_source_dijkstra_path_length at once with multiple (or, a list of) weights. I may be misinterpreting the concept but, I think, either way the nx.single_source_dijkstra_path_length function will have to be run three times.

True: you can't provide more than one maximum weight to single_source_dijkstra_path_length. However, it might be possible to be more creative with it's output. The authors have just been using it to get a list of nodes for creating an ego graph or subgraph. However, try to see if the function results in a dictionary which also includes the path length to each node. If it does, then the two shorter distances could be selected from the output without re-running the dijkstra's shortest path algorithm.

I was able to figure out how to get separate dictionaries... so i didn't make any sweeping breakthroughs but am a little bit closer.

I have been trying to figure out how to take three separate dictionaries and turn them into subgraphs and then a list of polygons that can "differenced" into the two donuts + donut hole, might just take a few days of tinkering with these to actually get there....

def dijkstra_cca_polygons(G, nearest_osm, distances, distance_unit = "time"):
    # create dictionary of nearest nodes
    nearest_nodes_30 = nx.single_source_dijkstra_path_length(G, nearest_osm, cutoff=distances[2]) 
    # ^ creating the largest  graph from which 10 and 20 minute drive times can be extracted from
    
    # extract values within 20 minutes drive times
    nearest_nodes_20 = dict()
    for key, value in nearest_nodes_30.items():
        if value <= 20:
            nearest_nodes_20[key] = value
        
    # extract values within 10 minute drive times
    nearest_nodes_10 = dict()
    for key, value in nearest_nodes_30.items():
        if value <= 10:
            nearest_nodes_10[key] = value

     ... from here we want to construct three subgraphs and create and a list of three polygons (per hospital)

Side note: I was able to get closer to a reproduction using the buffered street network... The figures are the 'classified' maps in the results/reproduction section.

Ok, here is a pretty long-winded way of getting the three catchment areas while only running the nx.single_source_dijkstra_path_length once ... I haven't yet compared the time of this one compared to run-length dijkstra_cca and hospital_measure_acc because they are nested into and/or dependent on some other functions.

Tomorrow I'm going to compare the run time, incorporate the difference geometric operation, and see if hopefully find a more efficient way to write the function.

def dijkstra_cca_polygons(G, nearest_osm, distances, distance_unit = "time"):
    
    ## CREATE DICTIONARIES
    # create dictionary of nearest nodes
    nearest_nodes_30 = nx.single_source_dijkstra_path_length(G, nearest_osm, cutoff=distances[2]) # creating the largest graph from which 10 and 20 minute drive times can be extracted from
    
    # extract values within 20 minutes drive times
    nearest_nodes_20 = dict()
    for key, value in nearest_nodes_30.items():
        if value <= 20:
            nearest_nodes_20[key] = value
        
    # extract values within 10 minute drive times
    nearest_nodes_10 = dict()
    for key, value in nearest_nodes_30.items():
        if value <= 10:
            nearest_nodes_10[key] = value
            
    ## DERIVE SUBGRAPHS FROM DICTIONARY NODES
    # create service areas subgraphs for each
    service_area_30 = G.subgraph(nearest_nodes_30)
    service_area_20 = G.subgraph(nearest_nodes_20)
    service_area_10 = G.subgraph(nearest_nodes_10)

    ## TURN SUBGRAPHS INTO GDFS
    
    # create empty list to store polygons
    polygons = []
        
    # For 30 min drive time
    # creates an x and y point for each node
    nodes_30 = [Point((data['x'], data['y'])) for node, data in service_area_30.nodes(data=True)]
    # constructs a multipart convex hull out of the subgraph of nodes
    polygon_30 = gpd.GeoSeries(nodes_30).unary_union.convex_hull ## to create convex hull
    # turns the polygon into a geodatframme
    polygon_30 = gpd.GeoDataFrame(gpd.GeoSeries(polygon_30)) ## change polygon to geopandas
    # renames geometry column
    polygon_30 = polygon_30.rename(columns={0:'geometry'}).set_geometry('geometry')
    
    # For 20 min drive timee
    # creates an x and y point for each node
    nodes_20 = [Point((data['x'], data['y'])) for node, data in service_area_20.nodes(data=True)]
    # constructs a multipart convex hull out of the subgraph of nodes
    polygon_20 = gpd.GeoSeries(nodes_20).unary_union.convex_hull ## to create convex hull
    # turns the polygon into a geodatframme
    polygon_20 = gpd.GeoDataFrame(gpd.GeoSeries(polygon_20)) ## change polygon to geopandas
    # renames geometry column
    polygon_20 = polygon_20.rename(columns={0:'geometry'}).set_geometry('geometry')
    
    # For 10 min drive time
    # creates an x and y point for each node
    nodes_10 = [Point((data['x'], data['y'])) for node, data in service_area_10.nodes(data=True)]
    # constructs a multipart convex hull out of the subgraph of nodes
    polygon_10 = gpd.GeoSeries(nodes_10).unary_union.convex_hull ## to create convex hull
    # turns the polygon into a geodatframme
    polygon_10 = gpd.GeoDataFrame(gpd.GeoSeries(polygon_10)) ## change polygon to geopandas
    # renames geometry column
    polygon_10 = polygon_10.rename(columns={0:'geometry'}).set_geometry('geometry')
    
    # append each to list
    polygons.append(polygon_30)
    polygons.append(polygon_20)
    polygons.append(polygon_10)

    return polygons

output is list of our three catchment areas:

[ geometry
0 POLYGON ((-87.61689 41.89684, -87.68071 41.934...,
geometry
0 POLYGON ((-87.66875 41.94062, -87.67356 41.943...,
geometry
0 POLYGON ((-87.67064 41.95433, -87.67433 41.957...]

The new Dijkstra function is both working (producing the same results) and noticeably faster.

Using 1 processor (with multiple kernels open):

  • catchment calc w new function: avg 6 min 51 seconds (3 tests)
  • catchment calc w old function: avg 7 min 27 seconds (3 tests)

Using 4 processors (with multiple kernels open):

  • catchment calc w new function: 2 min 31 seconds (1 test)
  • catchment calc w old function: 2 min 45 seconds (1 test).

After condensing the 20 and 10 min node dictionary construction into one for loop using 4 processors:

  • catchment calc w new function: 2 min 16 seconds (4 tests)
  • catchment calc w old function: 3 min 12 seconds (4 tests).

I haven't yet committed the changes to the code because the notebooks are a bit messy from troubleshooting.

Should be only two changes to the code get the new function working:

  1. substitute the below function for djikstra_cca:
def dijkstra_cca_polygons(G, nearest_osm, distances, distance_unit = "time"):
    
    ## CREATE DICTIONARIES 
    # create dictionary of nearest nodes for each distance (Drive Time) benchmark
    nearest_nodes_30 = nx.single_source_dijkstra_path_length(G, nearest_osm, distances[2], distance_unit) # creating the largest graph from which 10 and 20 minute drive times can be extracted from
    
    # extract values within 20 and 10 (respectively) minutes drive times
    nearest_nodes_20 = dict()
    nearest_nodes_10 = dict()
    for key, value in nearest_nodes_30.items():
        if value <= 20:
            nearest_nodes_20[key] = value
        if value <= 10:
            nearest_nodes_10[key] = value
            
    ## DERIVE SUBGRAPHS FROM DICTIONARY NODES
    # create service areas subgraphs for each
    service_area_30 = G.subgraph(nearest_nodes_30)
    service_area_20 = G.subgraph(nearest_nodes_20)
    service_area_10 = G.subgraph(nearest_nodes_10)

    ## TURN SUBGRAPHS INTO GDFS
    # create empty list to store polygons
    polygons = []
        
    # For 30 min drive time
    # creates an x and y point for each node
    nodes_30 = [Point((data['x'], data['y'])) for node, data in service_area_30.nodes(data=True)]
    # constructs a multipart convex hull out of the subgraph of nodes
    polygon_30 = gpd.GeoSeries(nodes_30).unary_union.convex_hull ## to create convex hull
    # turns the polygon into a geodatframme
    polygon_30 = gpd.GeoDataFrame(gpd.GeoSeries(polygon_30)) ## change polygon to geopandas
    # renames geometry column
    polygon_30 = polygon_30.rename(columns={0:'geometry'}).set_geometry('geometry')
    
    # For 20 min drive timee
    # creates an x and y point for each node
    nodes_20 = [Point((data['x'], data['y'])) for node, data in service_area_20.nodes(data=True)]
    # constructs a multipart convex hull out of the subgraph of nodes
    polygon_20 = gpd.GeoSeries(nodes_20).unary_union.convex_hull ## to create convex hull
    # turns the polygon into a geodatframme
    polygon_20 = gpd.GeoDataFrame(gpd.GeoSeries(polygon_20)) ## change polygon to geopandas
    # renames geometry column
    polygon_20 = polygon_20.rename(columns={0:'geometry'}).set_geometry('geometry')
    
    # For 10 min drive time
    # creates an x and y point for each node
    nodes_10 = [Point((data['x'], data['y'])) for node, data in service_area_10.nodes(data=True)]
    # constructs a multipart convex hull out of the subgraph of nodes
    polygon_10 = gpd.GeoSeries(nodes_10).unary_union.convex_hull ## to create convex hull
    # turns the polygon into a geodatframme
    polygon_10 = gpd.GeoDataFrame(gpd.GeoSeries(polygon_10)) ## change polygon to geopandas
    # renames geometry column
    polygon_10 = polygon_10.rename(columns={0:'geometry'}).set_geometry('geometry')
    
    # append each to list
    ## they must be appended to the list in this order so that 
    ## the difference algorithim set below properly cuts out overlap
    ## (if done the other way, the largest polygon simply erases the 
    ## geometries of its two counterparts)
    polygons.append(polygon_10)
    polygons.append(polygon_20)
    polygons.append(polygon_30)
    
    for i in reversed(range(1, len(distances))):
        polygons[i] = gpd.overlay(polygons[i], polygons[i-1], how="difference")

    return polygons
  1. And then substitute the two for loops that create and difference the catchment areas for the one-liner below the #create polygons comment:
def hospital_measure_acc (_thread_id, hospital, pop_data, distances, weights):
    
    # create polygons
    polygons = dijkstra_cca_polygons(G, hospital['nearest_osm'], distances)
    
    num_pops = []
    for j in pop_data.index:
        point = pop_data['geometry'][j]
        for k in range(len(polygons)):
            if len(polygons[k]) > 0: # to exclude the weirdo (convex hull is not polygon)
                if (point.within(polygons[k].iloc[0]["geometry"])):
                    num_pops.append(pop_data['pop'][j]*weights[k])  
    total_pop = sum(num_pops)
    for i in range(len(distances)):
        polygons[i]['time']=distances[i]
        polygons[i]['total_pop']=total_pop
        polygons[i]['hospital_icu_beds'] = float(hospital['Adult ICU'])/polygons[i]['total_pop'] # proportion of # of beds over pops in 10 mins
        polygons[i]['hospital_vents'] = float(hospital['Total Vent'])/polygons[i]['total_pop'] # proportion of # of beds over pops in 10 mins
        polygons[i].crs = { 'init' : 'epsg:4326'}
        polygons[i] = polygons[i].to_crs({'init':'epsg:32616'})
    print('\rCatchment for hospital {:4.0f} complete'.format(_thread_id), end="")
    return(_thread_id, [ polygon.copy(deep=True) for polygon in polygons ])

Great!

Now I'm thinking further... why are we creating points and iterating through lists of nodes so many times? It shouldn't be necessary... so I worked on adding point geometry data to the graph attributes, and then ways of using that to create convex hulls.

Prior to any network analysis and after the network setting algorithm, create point geometries for the whole graph:

for node, data in G.nodes(data=True):
   data['geometry']=Point(data['x'], data['y'])

Then inside the dijkstra's function, use the geopandas and pandas libraries to get this work done without creating points from x and y and without calling more than one subgraph function

# if the graph already has a geometry attribute with point data,
# this line will create a GeoPandas GeoDataFrame from the nearest_nodes_30 dictionary
points = gpd.GeoDataFrame(gpd.GeoSeries(nx.get_node_attributes(G.subgraph(nearest_nodes_30), 'geometry')))

# this line converts the nearest_nodes_30 dictionary into a Pandas data frame and joins it to points
# left_index=True and right_index=True are options for merge() to join on the index values
points = points.merge(pd.Series(nearest_nodes_30).to_frame(), left_index=True, right_index=True)

# re-name the columns and set the geodataframe geometry to the geometry column
points = points.rename(columns={'0_x':'geometry','0_y':'z'}).set_geometry('geometry')

# create a convex hull polygon from the points
polygon = gpd.GeoDataFrame(gpd.GeoSeries(points.unary_union.convex_hull))
polygon = polygon.rename(columns={0:'geometry'}).set_geometry('geometry')

# select nodes less than or equal to 20
points = points.query("z <= 20")

Other things I learned while trying this:

  • nx.single_source_dijkstra_path_length creates a dictionary of node id's and travel time
  • subgraph creates a view of a subset of a graph, and from that...
  • G.nodes(data=true) creates a dictionary of nodes and their attributes, including x and y from a graph or subgraph G
  • nx.get_node_attributes(G, 'geometry') can then get a dictionary of all the geometries for graph G's nodes

I got the new function working (just calling one subgraph and going from there) and ran a few isolated tests (just calculating the catchment areas). It seemed a bit quicker. I tried implementing it into the whole notebook but I kept running into different errors. I'm going to take the afternoon off but will troubleshoot the errors tomorrow morning.

Here's the (semi) working function:

def dijkstra_cca_polygons_2(G, nearest_osm, distances, distance_unit = "time"):
    
    ## CREATE DICTIONARIES
    # create dictionary of nearest nodes
    nearest_nodes_30 = nx.single_source_dijkstra_path_length(G, nearest_osm, distances[2], distance_unit) # creating the largest graph from which 10 and 20 minute drive times can be extracted from
    
    # extract values within 20 and 10 (respectively) minutes drive times
    nearest_nodes_20 = dict()
    nearest_nodes_10 = dict()
    for key, value in nearest_nodes_30.items():
        if value <= 20:
            nearest_nodes_20[key] = value
        if value <= 10:
            nearest_nodes_10[key] = value
    
    ## CREATE POLYGONS FOR 3 DISTANCE CATEGORIES (10 min, 20 min, 30 min)
    # 30 MIN
    # if the graph already has a geometry attribute with point data,
    # this line will create a GeoPandas GeoDataFrame from the nearest_nodes_30 dictionary
    points = gpd.GeoDataFrame(gpd.GeoSeries(nx.get_node_attributes(G.subgraph(nearest_nodes_30), 'geometry')))

    # this line converts the nearest_nodes_30 dictionary into a Pandas data frame and joins it to points
    # left_index=True and right_index=True are options for merge() to join on the index values
    points = points.merge(pd.Series(nearest_nodes_30).to_frame(), left_index=True, right_index=True)

    # re-name the columns and set the geodataframe geometry to the geometry column
    points = points.rename(columns={'0_x':'geometry','0_y':'z'}).set_geometry('geometry')

    # create a convex hull polygon from the points
    polygon = gpd.GeoDataFrame(gpd.GeoSeries(points.unary_union.convex_hull))
    polygon = polygon.rename(columns={0:'geometry'}).set_geometry('geometry')
    
    # 20 MIN
    # select nodes less than or equal to 20
    points_20 = points.query("z <= 20")
    
    # create a convex hull polygon from the points
    polygon_20 = gpd.GeoDataFrame(gpd.GeoSeries(points_20.unary_union.convex_hull))
    polygon_20 = polygon_20.rename(columns={0:'geometry'}).set_geometry('geometry')
    
    # 10 MIN
    # select nodes less than or equal to 10
    points_10 = points.query("z <= 10")
    
    # create a convex hull polygon from the points
    polygon_10 = gpd.GeoDataFrame(gpd.GeoSeries(points_10.unary_union.convex_hull))
    polygon_10 = polygon_10.rename(columns={0:'geometry'}).set_geometry('geometry')
    
    # Create empty list and append polygons
    polygons = []
    
    # append
    polygons.append(polygon_10)
    polygons.append(polygon_20)
    polygons.append(polygon)
    
    for i in reversed(range(1, len(distances))):
        polygons[i] = gpd.overlay(polygons[i], polygons[i-1], how="difference")

    return polygons

Ok so the newer algorithm seems to be running a bit quicker but the kernel was giving quite varied run-times. So, I'm going to do another set of tests either over the weekend or on Monday. For now, the newest version of the function (calling the subgraph once and constructing the geometries by querying point geometries) ran an avg of 2 min 35 sec over 4 tests. The older version of the function (creating 3 separate subgraphs and constructing polygons from each one) ran an avg of 3 min 20 sec over 4 tests.

I'm going to leave the original notebook as is for now. I'll incorporate the fastest running function to the improvements notebook but I want to further investigate these run-times when they are running a bit more consistently. I'll keep track of all these different functions in the Dijkstra-revisions.ipynb.

Cool! Even if the run-time is about the same, I think this new way is more intuitive? The more that these notebooks can use the higher-level pandas and geopandas functions instead of lower-level code like for loops, the easier it will be for GIS people to understand.