newfront/hitchhikers_guide_to_deltalake_streaming

Datasets dir DNE

Closed this issue · 3 comments

Running the ecomm_csv_to_parquet.ipynb nb cell(3)

# read and process initial dataset

ecomm_df = (
    spark.read.format("csv")
    .option("header", True)
    .schema(schema)
    .load(f"{dataset_dir}/{datasets[0]}")
)

gives the following error:
AnalysisException: Path does not exist: file:/opt/spark/work-dir/hitchhikers_guide/datasets/ecomm_behavior_data/2019-Oct-sm.csv
Looking at the leading directories I only see:

os.listdir('/opt/spark/work-dir/hitchhikers_guide')

['first-steps', 'pre-processing', 'when-things-go-bump-in-the-night']

I think getting the dataset is maybe not included in the docker compose
I did see it in the repo and copied it in manually and it works fine

looked at the PR, might be moot

Yeah I botched the mount location in the non-arm docker-compose. Just made the change.