Reading a file again incorrectly restricts to subset of schema
ingomueller-net opened this issue · 5 comments
Thanks a lot for the great project!
I ran into a bug that happens when reading the same file twice. The following programs exhibits the problem:
import argparse
import pyspark.sql
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', help='Input ROOT file')
parser.add_argument('-t', '--tree', help='Name of tree to open', default='Events')
args = parser.parse_args()
spark = pyspark.sql.SparkSession.builder \
.config('spark.jars.packages', 'edu.vanderbilt.accre:laurelin:1.0.0') \
.getOrCreate()
sc = spark.sparkContext
df = spark.read.format('root') \
.option('tree', args.tree) \
.load(args.input)
df.printSchema()
df.select('Int32').show()
#df = spark.read.format('root') \
# .option('tree', args.tree) \
# .load(args.input)
df.select('Int64').show()
You run it as follows (using uproot-small-flat-tree.root
from this repository).
spark-submit \
--packages edu.vanderbilt.accre:laurelin:1.0.0 \
hello-world.py -i uproot-small-flat-tree.root -t tree
If executed as is, the program prints the top 20 rows of the Int32
and Int64
columns, respectively. If you uncomment the three lines towards the end, which read the file anew, the following error is thrown:
Traceback (most recent call last):
File "/home/muellein/git/root-playground/laurelin/python/hello-world.py", line 24, in <module>
df.select('Int64').show()
File "/mnt/scratch/muellein/download/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1321, in select
File "/mnt/scratch/muellein/download/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/mnt/scratch/muellein/download/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u"cannot resolve '`Int64`' given input columns: [Int32];;\n'Project ['Int64]\n+- RelationV2 root[Int32#44] (Options: [tree=tree,path=../../testdata/small-flat-tree.root,paths=[]])\n"
It seems like something caches the restriction on a subset of the schema.
The same happens when using the library in Java.
Thanks a lot, Andrew! Very appreciated.
I've made a 1.0.1 with a fix for this .. it should propagate to maven central in the next hour or two.
Awesome, thanks for the quick fix!
Wonderful, that was fast! We'll try this out.