spark-root/laurelin

Reading a file again incorrectly restricts to subset of schema

ingomueller-net opened this issue · 5 comments

Thanks a lot for the great project!

I ran into a bug that happens when reading the same file twice. The following programs exhibits the problem:

import argparse

import pyspark.sql

parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', help='Input ROOT file')
parser.add_argument('-t', '--tree',  help='Name of tree to open', default='Events')
args = parser.parse_args()


spark = pyspark.sql.SparkSession.builder \
    .config('spark.jars.packages', 'edu.vanderbilt.accre:laurelin:1.0.0') \
    .getOrCreate()
sc = spark.sparkContext
df = spark.read.format('root') \
                .option('tree', args.tree) \
                .load(args.input)
df.printSchema()
df.select('Int32').show()

#df = spark.read.format('root') \
#                .option('tree', args.tree) \
#                .load(args.input)
df.select('Int64').show()

You run it as follows (using uproot-small-flat-tree.root from this repository).

spark-submit \
    --packages edu.vanderbilt.accre:laurelin:1.0.0 \
   hello-world.py -i uproot-small-flat-tree.root -t tree

If executed as is, the program prints the top 20 rows of the Int32 and Int64 columns, respectively. If you uncomment the three lines towards the end, which read the file anew, the following error is thrown:

Traceback (most recent call last):
  File "/home/muellein/git/root-playground/laurelin/python/hello-world.py", line 24, in <module>
    df.select('Int64').show()
  File "/mnt/scratch/muellein/download/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1321, in select
  File "/mnt/scratch/muellein/download/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/mnt/scratch/muellein/download/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u"cannot resolve '`Int64`' given input columns: [Int32];;\n'Project ['Int64]\n+- RelationV2 root[Int32#44] (Options: [tree=tree,path=../../testdata/small-flat-tree.root,paths=[]])\n"

It seems like something caches the restriction on a subset of the schema.

The same happens when using the library in Java.

Thanks a lot, Andrew! Very appreciated.

I've made a 1.0.1 with a fix for this .. it should propagate to maven central in the next hour or two.

Awesome, thanks for the quick fix!

Wonderful, that was fast! We'll try this out.