vepadulano/PyRDF

Report more expressive error message when Numcluster = 0

Closed this issue · 4 comments

During the creation of ranges (BuildRanges), the input dataset is extracted from the arguments passed to RDataFrame. If the dataset cannot be properly read, i.e. due to an incorrect user input, non-existing files, non-reachable... , BuildRanges fails when trying to divide by zero.

Does it ever reach this point? Isn't the program stopped before by ROOT if the RDataFrame is not properly instantiated?

It is, since the RDataFrame object is instantiated after the ranges have been calculated.
This step happens before any of these.

So I tried to reproduce this issue, this is the current state:

>>> import PyRDF
>>> PyRDF.use("spark")
19/07/17 16:11:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>> df = PyRDF.RDataFrame("random","random.root")
>>> mydef = df.Define("x","rdfentry_")
>>> myhisto = mydef.Histo1D("x")
>>> nentries = myhisto.GetEntries()
Error in <TFile::TFile>: file random.root does not exist
/home/.local/lib/python3.7/site-packages/PyRDF-0.1.0-py3.7.egg/PyRDF/Proxy.py:80: UserWarning: No entries in the Tree, falling back to local execution!
Error in <TFile::TFile>: file random.root does not exist
Error in <TFile::TFile>: file random.root does not exist
Error in <TFile::TFile>: file random.root does not exist
Error in <TFile::TFile>: file random.root does not exist
Error in <TFile::TFile>: file random.root does not exist
Error in <TFile::TFile>: file random.root does not exist
Error in <TFile::TFile>: file random.root does not exist

Is this enough or should we catch the error ourselves and be more specific ? Or am I missing something?

There is currently an early check that the arguments to the RDataFrame constructor make sense. Plus another check in the distributed execution, where if the dataset has zero entries and it cannot be executed an error is raised