Reading a file again incorrectly restricts to subset of schema

Question

Reading a file again incorrectly restricts to subset of schema

ingomueller-net opened this issue 5 years ago · 5 comments

Thanks a lot for the great project!

I ran into a bug that happens when reading the same file twice. The following programs exhibits the problem:

import argparse

import pyspark.sql

parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', help='Input ROOT file')
parser.add_argument('-t', '--tree',  help='Name of tree to open', default='Events')
args = parser.parse_args()


spark = pyspark.sql.SparkSession.builder \
    .config('spark.jars.packages', 'edu.vanderbilt.accre:laurelin:1.0.0') \
    .getOrCreate()
sc = spark.sparkContext
df = spark.read.format('root') \
                .option('tree', args.tree) \
                .load(args.input)
df.printSchema()
df.select('Int32').show()

#df = spark.read.format('root') \
#                .option('tree', args.tree) \
#                .load(args.input)
df.select('Int64').show()

You run it as follows (using uproot-small-flat-tree.root from this repository).

spark-submit \
    --packages edu.vanderbilt.accre:laurelin:1.0.0 \
   hello-world.py -i uproot-small-flat-tree.root -t tree

If executed as is, the program prints the top 20 rows of the Int32 and Int64 columns, respectively. If you uncomment the three lines towards the end, which read the file anew, the following error is thrown:

Traceback (most recent call last):
  File "/home/muellein/git/root-playground/laurelin/python/hello-world.py", line 24, in <module>
    df.select('Int64').show()
  File "/mnt/scratch/muellein/download/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1321, in select
  File "/mnt/scratch/muellein/download/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/mnt/scratch/muellein/download/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u"cannot resolve '`Int64`' given input columns: [Int32];;\n'Project ['Int64]\n+- RelationV2 root[Int32#44] (Options: [tree=tree,path=../../testdata/small-flat-tree.root,paths=[]])\n"

It seems like something caches the restriction on a subset of the schema.

The same happens when using the library in Java.

Answer 1 · 2020-03-19T15:05:50.000Z

Hello Ingo! Thanks for the report -- that's definitely a bug around a cache I had built. I'll make a 1.0.1 shortly to disable that cache. By the way -- I'm quite close to finishing Spark3 support, which I'm excited about. Cheers Andrew

…

Answer 2 · 2020-03-19T15:14:19.000Z

Thanks a lot, Andrew! Very appreciated.

Answer 3 · 2020-03-20T17:23:57.000Z

I've made a 1.0.1 with a fix for this .. it should propagate to maven central in the next hour or two.

Answer 4 · 2020-03-20T18:19:52.000Z

Awesome, thanks for the quick fix!

Answer 5 · 2020-03-23T09:35:31.000Z

Wonderful, that was fast! We'll try this out.