nightscape/spark-excel

input_file_name returns empty string

btelle opened this issue · 6 comments

Hello,

I'm trying to create a dataframe from an Excel doc and then append a column with the input file's name as returned by input_file_name. Instead of returning the file path, input_file_name is returning an empty string.

Test code

from pyspark.sql.functions import input_file_name

df = (sql_context.read.format("com.crealytics.spark.excel")
      .option('sheetName', "Sheet1")
      .option('useHeader', True)
      .load("/home/btelle/test.xlsx")
      )

df = df.withColumn('file_name', input_file_name())
df.select('file_name').show(1)

Expected result

+--------------------+
|           file_name|
+--------------------+
|file:///home/btel...|
+--------------------+

Actual result

+---------+
|file_name|
+---------+
|         |
+---------+

Does the same code work with other DataFrames?

I've tested the same code using multiple xls and xlsx files, same empty string result every time. Using other formats like csv results in the expected result.

TBH I don't have any idea how input_file_name is supposed to work. Can you dig out the corresponding documentation and maybe some code examples what one has to do to make it work?
Maybe you can find the corresponding code in the CSV package.

Closing this due to inactivity.

Hi, I too face the same issue, the input_file_name() return empty when used with this.

example how this is supposed to work.

file_read = spark.read.option("header","true").csv(blob_location)
file_loc = file_read.withColumn("file_loc",input_file_name())
display(file_loc)

result:
col1 | col2 | file_loc
merin | 25 | wasbs://temp-blob@xxxxxxx.blob.core.windows.net/folder/file.csv

So this will give full path name of the file processed. But when using spark excell, the file path return empty.

See above:

TBH I don't have any idea how input_file_name is supposed to work. Can you dig out the corresponding documentation and maybe some code examples what one has to do to make it work?
Maybe you can find the corresponding code in the CSV package.