Can someone answer why the number and x columns of '201105. shp' in the output of this code also become 0?

Question

Can someone answer why the number and x columns of '201105. shp' in the output of this code also become 0?

1jiangxd opened this issue a year ago · 1 comments

(Two shp files have been uploaded to my GitHub repository)
https://github.com/1jiangxd/daskgeopandasproblems

The code I used is as follows, but when checking proceed '201105. shp', only the first 2 million lines were processed, and the remaining other original content changed into 0
May I ask where the problem lies with this code? If anyone can answer, I would greatly appreciate your help

import geopandas as gpd
import time

import dask_geopandas

def process_row(row):
    outwen = r'201105.shp'
    bianjie = r'2023xian.shp'
    jiabianjie = r'E:\201105out'
    
    start_time3 = time.time()
    
    # Read input and clipped boundary shapefiles
    target_gdf = gpd.read_file(outwen)
    join_gdf = gpd.read_file(bianjie)
    
    # Switch to dask approach
    target_gdfnew = dask_geopandas.from_geopandas(target_gdf, npartitions=4)
       
    # Reproject the boundary participating in the join to match the CRS of the target geometry
    join_gdf = join_gdf.to_crs(target_gdf.crs)
    
    # Switch to dask approach
    join_gdfnew = dask_geopandas.from_geopandas(join_gdf, npartitions=4)
    
    # Use spatial join to find intersecting parts
    joined = gpd.sjoin(target_gdfnew, join_gdfnew, how='inner', predicate='intersects')
    
    # Add attributes from 'bianjie' to 'outwen'
    joined = joined.drop(columns='index_right')  # Remove redundant index column
    result = target_gdfnew.merge(joined, how='left', on=target_gdfnew.columns.to_list())
    
    # Save the result to the output boundary
    result.to_file(jiabianjie, encoding='utf-8-sig')  # Ensure the correct encoding is used
    
    end_time3 = time.time()
    execution_time3 = end_time3 - start_time3
    
    print(f"'{jiabianjie}' has added boundaries. Start time: {start_time3:.2f}, End time: {end_time3:.2f}, Execution time: {execution_time3:.2f} seconds")

process_row()

print('Finish')

Answer 1 · 2024-05-06T14:33:08.000Z

@1jiangxd apologies for the slow reply, but looking at your code, the following lines

    # Add attributes from 'bianjie' to 'outwen'
    joined = joined.drop(columns='index_right')  # Remove redundant index column
    result = target_gdfnew.merge(joined, how='left', on=target_gdfnew.columns.to_list())

are typically not needed. The result of the spatial join, joined, already has the columns of the original target_gdf, so this additional merge is not doing anything, except for getting back the original rows of target_gdf that didn't have a match in the spatial join. To achieve the same, you do a left join (specifying how='left' in the sjoin` call).

Also, I assume that the gpd.sjoin in your code above should be dask_geopandas.sjoin ?