Efficient Computation for Large City Networks on Personal Computers
ruoyuch opened this issue · 27 comments
Is there a more efficient way to compute large city networks on a personal computer?
@dapugacheva and I @ruoyuch have attempted to reduce the chunk size
, but the computation still fails to complete. Specifically, the Los Angeles Metro street network, which contains over 100,000 nodes and more than 60 different GTFS feeds, presents significant challenges. When running this computation locally, even on a Mac Studio workstation with 64 GB RAM and M2 chips, the process crashes.
Given that the mentioned PC is more powerful than an average personal computer, we are seeking a solution to handle such extensive datasets more efficiently. Feel free to provide any possible assistance on this. @carlhiggs @gboeing
Feel free to use the shared Los Angeles data and .yml file to reproduce our problem: https://drive.google.com/file/d/1Y0jZXYRqnGisnGXjoc8l4QSwJJO-cz9y/view?usp=sharing
Common situations when the process crashes:
- When performing
_03_create_network_resources.py
, the error ispsycopg2.errors.IndexCorrupted: index "idx_edges_geom_4326" contains unexpected zero page at block 1122.
- When performing
_06_open_space_areas_setup.py
, the error ispsycopg2.errors.InternalError_: shared buffer hash table corrupted
- When performing
_11_neighbourhood_analysis.py
, the error isprogram killed due to memory issue
.
Can you give some more details? How far does the process get? What step is it on when it finally crashes from memory exhaustion?
@gboeing Some details of the common crash points are given in the main issue comment. @dapugacheva feel free to provide some common crash points in your experience.
Hi @ruoyuch @dapugacheva @gboeing , sounds like something has happened that has corrupted the underlying spatial database. Perhaps the Docker postgis was restarted while indexing was occurring, or something like that ---- but I'm thinking the errors mentioned above are not simply memory issues (typically display as 'Killed'), but actually more serious ones that probably mean it will be worth dropping that database and starting again.
I will have limited capacity to work on this project until Friday due to other project commitments, but some considerations may be:
- memory allocated to Docker (we are agreed that developing a good guide to customising this in FAQ based on previous resolved issues is a priority)
- study regions size (is it the metro area, or some larger conurbation? If it's a very large conurbation of multiple cities, perhaps these could be analyses separately, e.g. using the urban portion within administrative jurisdictions)
- again, suggestions for modifying the code to be more robust for large study regions are welcome (robustness to error is priority over optimisation)
- are you running this directly within python, or within Jupyter Lab? If the latter, maybe try doing it directly in python, as Jupyter may have additional memory settings that could require tweaking
Hope this helps,
Carl
Hi @carlhiggs, I tested the code using Jupyter Lab. I agree that running it directly in Python is a good idea. I'll give it a try later this week and let you know how it goes. Thanks. - Ruoyu
Hi everyone,
@gboeing Some details of the common crash points are given in the main issue comment. @dapugacheva feel free to provide some common crash points in your experience.
I ran the LA County on the MacBook M2, 16GB, and it crushed in the 11th step of the analysis, and the process was killed then. The Docker memory was set to a maximum of 16GB.
@ruoyuch @dapugacheva can you share a Google link to your inout so others can try with your exact inputs?
Hi Carl @carlhiggs and Geoff @gboeing
I had some time today to try the methods Carl suggested last week. Unfortunately, the analysis got stuck at step 06 multiple times, displaying the same error code we encountered before:
sqlalchemy.exc.InternalError: (psycopg2.errors.InternalError_) shared buffer hash table corrupted
Additionally, Docker showed the following warning message (see screenshot):
Based on this, I believe the issue is not primarily related to my machine's memory (where 64 GB allocated memory was normally used ~50% during the analysis process) but rather with Docker PostGIS. The PostGIS server is not fully compatible with my Apple silicon chip, causing the crash. However, other constraints might also appear after solving this one.
For reference, here are the detailed settings based on Carl's suggestions:
- memory allocated to Docker (we are agreed that developing a good guide to customising this in FAQ based on previous resolved issues is a priority)
The memory and swap allocated were set to maximum values, and the memory usage never exceeded 60% during the analysis.
- study regions size (is it the metro area, or some larger conurbation? If it's a very large conurbation of multiple cities, perhaps these could be analyses separately, e.g. using the urban portion within administrative jurisdictions)
The LA Metro is a single, large urban area, which is difficult to calculate separately.
- are you running this directly within python, or within Jupyter Lab? If the latter, maybe try doing it directly in python, as Jupyter may have additional memory settings that could require tweaking
I tried both methods, and both crashed. I ran the code directly in Python by creating a single run_analysis.py
file in the ./process
folder and executing it from the terminal. This should avoid potential constraints of Jupyter Lab. Here are the command lines from the run_analysis.py:
# run_analysis.py #
import ghsci
ghsci.help() # Display help information
r = ghsci.example() # Run an example function
r = ghsci.Region('US_Los_Angeles') # Create a Region object for Los Angeles
r.analysis() # Perform analysis on the Region object
Any suggestions on how to proceed would be greatly appreciated.
Thanks,
Ruoyu
Hi @ruoyuch , thanks for the update; I'll aim to look into this later today (I think you shared access to your files somewhere; will try to find).
One thing I'm wondering --- have you tried dropping your database (r.drop()), and starting analysis again?
It sounds like your database may have been corrupted by an unfortunately timed restart or some other crash, and maybe just deleting it and starting over might help. Probably can leave the files in study regions output folder, which might make reprocessing quicker; but then again, if you really wanted a clean start, delete those too.
If you give that a go will be interesting to hear if it helps.
Otherwise, I'll write back here after over looked at this on me depth later today.
Hi @ruoyuch ,
I've just started having a first look at your project --- looks like you've made a great start, but I do have some thoughts on where things could perhaps be improved to make them run more smoothly.
First, small thing, but I'd add a _2024 suffix to your study region name --- will make things easier later (eg. if you go to analyse 2025, etc).
The main issue I see is in your configuration file, where you have,
ghsl_urban_intersection: false
I very much recommend you set this to true
, because the area you have supplied for your study region boundary is vast and includes a lot of desert from what I can see; it mostly is not urban. It may be the politically relevent area, but it is not an urban area, so arguably it doesn't make sense conceptually to be calculating urban walkability metrics for much of this space.
Restricting to the urban area has a few benefits:
- it is the area for which urban metrics like walkability are relevent (their not designed/intended for rural areas; they have predominantly been developed for cities and towns)
- open data is often more accurate/reliable for cities than non-urban regions
- important for you >>> it makes processing so much easier working with a smaller study region; it will make it less memory and processing intensive, and quicker to run
The easiest path forwards will be to use the GHSL urban centres database to identify empirical urban areas. Its not perfect, but its a good starting point. Get that working, critique it if you like, find something better perhaps; but I strongly recommend you restrict to some kind of urban boundary in some way.
Just to illustrate this, here is a map showing your configured boundary (yellow) juxtaposed with Greater Melbourne (white shading) and urban centres from GHSL (including Los Angeles) in orange --- it shows how truly vast the study region you are using is. Greater Melbourne includes non-urban areas --- I wouldn't analyse that whole area as though it were urban area, but it is a fraction the size of what you are analysing.
So --- my first recommendation is, use the urban intersection and make things easier on yourself, with important side benefit of increasing quality of data and valid usage of urban metrics.
Analysis is running for me, and I'll leave it running --- but I won't be surprised if it has some sort of issue, because the software hasn't been designed to analyse areas of this scale.
Having said that, there is plenty of scope for optimisation of the software -- that's absolutely welcomed.
But in the first instance, I don't believe it is a bug if the software can't analyse an area of this scale; its just not what its meant to do.
If the above suggestion, restricting the study region to urban area resolves this issue, perhaps we can close this issue?
Just an update on this, as mentioned I continued processing using the very large study region --- as anticipated it, took a very long time to make progress; after an hour and a half it seemed to have processed population, but had also stalled -- not sure if that was a OneDrive sync issue, but in any case, no further apparent progress was made in four hours following that so I stopped the process, dropped the database and deleted the generated files for that study region.
As suggested above, while contributions to the software to support optimising analysis for very large regional areas are welcome, that's not the current intent of this software and I don't think the current boundary is the right one to use for this task.
I followed the recommendation above and set
ghsl_urban_intersection: true
This meant that the configured urban region and urban query for the LA study region would be used --- this should make things run much better for you. The comparison with a sprawling city like Greater Melbourne makes more sense now!
Actually, I suspect this may have been your intent all along, and it may have just been an oversight/typo with the ghsl_urban_intersection
set to false. If that's the case, please feel free to close this issue. Thanks for checking in; I hope the above helped.
@carlhiggs Hi Carl,
I conducted several quick tests after adjusting the ghsl_urban_intersection
setting you mentioned. I understand the logic behind this change and agree that it should help. However, the analysis still crashes. There might be some underlying issues causing this, so I will spend more time cleaning up the environment and eliminating potential factors, such as iCloud sync issues. I'll report back with my findings.
BTW, @dapugacheva, if your analysis is successful following Carl's suggestion, please let us know!
Thank you all!
Ruoyu
Sounds good Ruoyu; when you re-run after cleaning up, if you experience a crash it would be great to know more details.
Just an update on my re-running using the urban intersection, and time taken for analysis (as could be useful for us identifying pain points fo opimisation of computation for large city networks as per thread title)
16:40 2024-06-19 : Commenced analysis using GHSL urban intersection for LA
Successfully processed the following in 2 hours and 20 minutes for LA:
- _00_create_database.py (instant)
- _01_create_study_region.py (a few seconds)
- _02_create_osm_resources.py (8 minutes)
- _03_create_network_resources.py (52.4 minutes; there's double handling downloading OpenStreetMap here)
- _04_create_population_grid.py (31.44 minutes)
- _05_compile_destinations.py (less than a minute)
- _06_open_space_areas_setup.py (36.2 minutes)
- _07_locate_origins_destinations.py (10 minutes)
- _08_destination_summary.py (a few seconds)
- _09_urban_covariates.py (1.43 minutes; should be much quicker but not a big deal in scheme of things)
19:00 2024-06-19: The GTFS configuration for LA had incorrect folder configured, so path didn't resolve (probably relates to how files were packaged up in zip when you sent them)
I found this error when I arrived at work in the morning next day (ideally, GTFS data paths should be checked in data checking to identify this before commencing analysis - #459).
So, changed
folder: us_los_angeles_gtfs
to
folder: la_transit_gtfs/us_los_angeles_gtfs
09:12 2024-06-20: Recommenced analysis from the beginning so the remaining steps were included in the processing log.
09:25 2024-06-25: GTFS analysis commenced (ie. it ran through the other bits; in theory we could we should be able to make re-running completed steps quicker, just have to take care how we do it); took under 3 minutes to process the 67 configured feeds. No doubt could be made quicker if we used parallel processing, but there are bigger bottle necks than that.
09:27 2024-06-27: Commenced neighbourhood analysis (the main bit).
- logging is excessive (see #457 for simple proposal to make more manageable )
- took 40 minutes to generate 1000m neighbourhoods for nodes (All pairs Dijkstra shortest path analysis; first pass pre-prcessing)
- took a further 40 minutes to iterate again and summarise attributes (average value from unique associated grid cells within nh buffer distance)
- the reason these steps were seperated, and not done using parallel processing was to avoid crashes relating to running out of memory/resources on lower spec computers
- but we could refactor to use parallel processing where possible, and perhaps avoid the double loop; that would save time
- but the next step 'Generating contraction hierarchies with 16 threads.' is the real bottleneck
- this is where we set up Pandanas; a package we use for network analysis its been stuck on that bit for 2+ hours with no indication of progress.
- Container CPU usage
- 99.95% / 1600% (16 CPUs available)
- Container memory usage
- 18.68GB / 30.48GB
- Pandanas hasn't been updated since 2021
- there's an open issue re slow network initiation for large networks: UDST/pandana#174
- Big opportunities for optimisation through re-factoring our approach to network analysis, e.g.
- follow advice in that Pandana thread re: chunking up study region (seems like there is a limit beyond which the set up becomes impossibly slow; that will be where CPU hits 100% presumably); one poster reported that had a dramatic impact
- use pgRouting (we already have this installed and ready to go; why not leverage the spatial database, which already has edges and nodes indexed for analysis?). Side benefit is by removing Pandana we reduce dependencies on software that is no longer being updated.
- incorporate r5 (would involve a larger re-factor but opens other possibilities for multimodal transport analysis, optionally using additional data like DEM to account for slope or other aspects in walking). This would add more dependencies too though.
Anyway --- its still on that step for me, but I suspect this is the step that may have been crashing for you, in which case it traces back to Pandana, with possible work arounds (other than waiting/using smaller study region/using more powerful computer) as per above.
So things to consider re the topic of this thread, in what might be order of opportunities for gains:
- refactor network analysis (perhaps retire Pandana)
- make better use of our spatial database; do more in there and less in Python
- find ways to avoid double handling of openstreetmap data (using pbf, and overpass; should have option to use one or other)
- implement optional parallel processing, and identify which tasks might be able to be undertaken concurrently without risking overload of memory/cpu
Those are my general thoughts on the topic of this thread, about how to improve efficiency. But I think one thing we need to remember is, guiding people to use the urban intersection and correctly define study regions is very important to reduce scope of analysis to more managable scale --- that was a big win, and I'm still hopeful that by using the urban intersection, even if the LA analysis is relatively slow, it hopefully will complete on a competent computer successfully.
yup --- 7 hours later and its still stuck on pandanas initialisation for LA urban region; I agree, we need to find another solution for this. I won't be able to look at this tomorrow as have other things i need to do, but will look to experiment some time next week. @ruoyuch @dapugacheva @shiqin-liu @gboeing , if you have thoughts/capacity for experimenting on resolving this with/without pandana that's also welcome as I've got quite a bit reduced capacity only working one day/week on this project now. Thoughts on resolving this step, ideally in a way that doesn't have any notable impact on analysis resultsthat would be great!
The issue with solution lodged on Pandana's github shared in above post looks like low-hanging fruit --- finding a way to optionally chunk up a region's analysis and then join it back after; if there's an easy fix that works identically to current methods, at least for now, that would be ideal. That's what I'll first look into next week when I get time, but someone else gets to it first or has other thoughts to share, that is great.
update --- the LA urban region did process successfully in the end, but that particular step did seem to take unreasonably long. But, it didn't crash, so will be interested to hear if it succeeds for you in the end. Good to know that if you do have a powerful computer, it should be possible (i can post stats tomorrow but Docker stats in above post give idea - 16 cores (underutilised), 32 GB ram.
so that script 11, took 798 minutes, most of which was the Pandana pre-set up of contraction hierarchies.
But... on the other hand, although it took a long time it didn't crash for me (other than the other configuration issues noted above w/ GTFS data dir). Because it did successfully make it through analysis workflow, it will be less of a priority for me to refactor things. I fully support you if you want to explore ways to optimise and get equivalent results.
Hi @ruoyuch and @carlhiggs.
I just wanted to update you on my re-running of the same LA metro analysis in python directly. I used ghsl_urban_intersection: true
and chunks_size: 20
to reduce the risk of memory issues with the maximum memory allocated to Docker resources (CPU 10 and memory limit 16GB). However, I ran the analysis multiple times and it always got stuck and killed on 95% generated neighborhoods at Generate 1000m neighbourhoods for nodes (All pairs Dijkstra shortest path analysis) at _11_neighbourhood_analysis.py step crashing with returned non-zero exit status
137 error - memory limits.
The last run time breakdowns:
- _00_create_database.py (instant)
- _01_create_study_region.py (instant)
- _02_create_osm_resources.py (less than a minute; OSM data has been already imported)
- _03_create_network_resources.py (less than a minute; OSM data has been already imported)
- _04_create_population_grid.py (Population grid already exists)
- _05_compile_destinations.py (less than a minute)
- _06_open_space_areas_setup.py (less than a minute, AOS has been previously prepared)
- _07_locate_origins_destinations.py (61.07 minutes)
- _08_destination_summary.py (less than a minute)
- _09_urban_covariates.py (less than a minute)
- _10_gtfs_utils.py (1.60 minutes)
- _11_neighbourhood_analysis.py (95%|███████▌| 597116/629062 [13:06<19:19:18, 2.18s/nodes]Killed)
@carlhiggs that's great the LA metro analysis was successful at your machine - thank you for you and your machine times:) So, I guess my issue falls under memory limits. My suggestion would be to pre-process nodes_pop_intersect_density
; however, there is a possibility that it still fails even when running independently. We will brainstorm and plan with @ruoyuch how we can increase efficiency and experiment further with our machine's capacities.
To add to the discussion, speaking of my current work processing large network and accessibility analysis, things we have done to handle memory starvation: 1. trigger Garbage Collector, that is calling gc.collect() explicitly after major computation steps, this help reduce memory usage; 2. convert pandas dataframe to multi-index dataframe before join, this help reduce the size of the dataframe and improve memory efficiency. These steps could help improve memory usage but might not necessary solve the issue when it comes to large network handling and limited computation power of our personal labtop...
For processing time and efficiency, Polars might be a tool to look into. It is very similar to pandas, but it is natively multithreaded and designed to handle very large datasets. We have been using polars in our national accessibility work, the performance is very impressive when it comes to aggregating very large datasets.
I am also interested in conversations thinking about refactoring network and routing analysis, I have not follow pandana lately but it seems like it has been inactive for many years? r5 might be a option as @carlhiggs suggests, it is very well developed routing engine and has both R and python interface, but this would be a major change in methodology. Should we add this to our discussion items in spatial meeting.
Hello @carlhiggs and @dapugacheva,
I wanted to provide a quick update on my situation. Unfortunately, the process was not successfully completed on my M2 Mac as it kept crashing during step 03. The error message I am receiving is: psycopg2.errors.InternalError_: shared buffer hash table corrupted
. This error persists even after re-downloading the software and clearing all existing data and docker environment. I am unsure if this issue is specific to the M2 chip or MacOS.
I will attempt to resolve this problem in late July after my travel and have access to my workstation. Alternatively, I may try running the software using pure Python on my Intel MacBook with 16 GB of RAM (which suffered from the memory issue before).
I will keep you updated on any progress. Thank you for your assistance.
Hi Ruoyu, I am just wondering, perhaps your issue might not be due to M2 of MacOS but could be a hardware issue with your RAM, and maybe it only is noticable because the LA analysis is intensive enough to try to use all of it?
Perhaps if a portion of your memory has some issues, and an attempt is made to use it because of the processing demands, the bad memory causes the failure on your system (independent of OS or hardware architecture)? Here's a couple of threads that suggest the 'shared buffer hash table corrupted' may relate to RAM errors:
https://postgrespro.com/list/thread-id/1651324 (from 2004)
https://postgrespro.com/list/thread-id/2478970 (from 2022)
I don't have a MacOS/M2 computer so can't test that aspect myself, but could be worth doing some kind of memtest as suggested in that thread perhaps?
Re: @shiqin-liu's suggestions -- I like these suggestions, but also agree that moving to r5 will be a big change and something to plan for and test implications of well before we did anything like this. Could be worth considering the explicit garbage collection management as a first step, although, Ruoyu's issue sounds like its on the database side. Polars sounds interesting too --- could make things quicker, but again, not sure what is causing the memory issues experienced by Ruoyu in postgres. I don't think I've heard of others experiencing that specific issue.
Hi @carlhiggs , cc: @gboeing @dapugacheva ,
The LA analysis ran successfully on my Mac Studio this afternoon. The only change I made was updating my MacOS to Sequoia and Docker Desktop to version 4.34.0. Thanks for mentioning the potential RAM bug, @carlhiggs, but it doesn’t seem to be the issue in my case, as the RAM diagnostics came back normal.
I ran the analysis on a Mac Studio with the M2 Max chip and 64GB RAM. The entire process was completed in around 260 minutes, with Step 11 (similar to @carlhiggs ) taking approximately 210 minutes. I observed that the CPU usage was quite low throughout the analysis, with only one core actively in use. Memory usage hovered around 15–20GB.
I have attached a map from the result:
It is good to finish the analysis successfully on Mac after the upgrade, but it is sad to be unable to reproduce the error and find the actual reason.
Let me know if you have some thoughts on this.
Thanks,
Ruoyu
Thanks @ruoyuch. It looks like this resolves it then. @carlhiggs we may want to document a required "minimum versions" of Docker and MacOS somewhere if other users hit this problem.
@gboeing @carlhiggs If we need such documentation, I personally suggest "updating the system and docker desktop to the newest available version" and "doing a hardware check if it keeps crashing randomly."
Thanks @ruoyuch and @gboeing --- great you were able to run the analysis. Also, great that there is a lot of unused capacity in the computer --- that means, there could be a lot of scope for optimisation e.g. through implementation of optional parallel processing (when someone has time to implement that!).
While we aren't certain of the specific cause of the error, I suppose that's like many diseases/syndromes where we don't know the cause but recognise symptoms and ways of managing these ---- in that way, we could add/draft a brief entry in our FAQ record regarding this specific Docker issue, how it presents and how users have managed it successfully. That could link to this issue for more details, potentially. Would you be interested in drafting this, in line with your recommendations @ruoyuch ?
@ruoyuch thanks for sharing the results–great to see it finally running successfully on a Mac!
I'm wondering if the analysis would go through on my M2 Mac with 16 RAM after the latest MacOS version update. I might doubt it since your machine's memory usage was 15 GB at a minimum.. worth a try. Anyway, we might want to add to this section FAQ something like 'large metro areas may require higher usage of memory– if the analysis crashes/is killed at the same stage, it could be due to a memory capacity issue'.
How does the draft below look? @gboeing @carlhiggs
What to Consider if a Large Study Area Fails to Run Successfully?
-
Insufficient Memory
Running large study areas requires substantial memory. For example, analyzing the Los Angeles Metro network (over 1 million edges) typically demands at least 32 GB of RAM. If you don’t have enough, consider reducing the chunk size to optimize performance. Thechunk_size
can be found in theconfig.yml
file; follow the comments in the file. -
Hardware Issues
Hardware problems, such as faulty RAM, may only surface during intensive tasks. Ensure your hardware, particularly RAM, is functioning correctly before troubleshooting further. -
System and Docker Updates
Unexpected issues can arise with large study areas. Make sure your operating system and Docker Desktop are fully updated before beginning the analysis. To download the latest version of Docker Desktop, visit the official site or check updates in your existing software. -
Anticipate Time Consumption
Step 11, i.e., network-building for neighborhood analysis, often takes the longest in large urban areas. Consider running smaller test areas first, and plan to leave your machine running overnight for the full analysis.
Refer to a discussion thread on running Los Angeles Metro to get a sense of potential challenges.
Hi @ruoyuch , thank you for drafting this. Broadly I think this is great!
I wondered -- could you add some advice on how users can reduce the chunk size or link to one of our threads where we advise people to do that? It could also help with other points adding links, e.g. to Docker Desktop's site, just to make it easier for people.
Overall, I am sure others will find it really helpful. We can always build on it later if we want.
If you would like to add this in as an edit to the wiki FAQ, that would be great. Alternatively, @zarayousefi is in the process of planning updates to the wiki and FAQ more broadly, so may be happy to add this on your behalf.
@carlhiggs Thanks for checking it! I made some revisions to the previous comment. @zarayousefi Feel free to add it to the FAQ and let me know if you need any help from my end.