databricks/spark-avro

Partition according to the block size of HDFS

Closed this issue · 1 comments

I just use spark-avro to read an avro file on HDFS (The block size of our HDFS is 64MB) which is around 900MB. The spark gives 7 tasks for the job, thus I assume that it uses 128MB as default block size.

I run the following code to ensure that spark know the block size of HDFS is 64MB

conf = SparkConf().setAppName("read seq")
sc = SparkContext(conf=conf)
blocksize = 67108864
sc._jsc.hadoopConfiguration().setLong( "dfs.blocksize", blocksize )
SparkConf = sc.getConf()
spark = SparkSession\
    .builder\
    .config(conf=SparkConf)\
    .getOrCreate()

df = spark.read.format("com.databricks.spark.avro").load("Test.avro")
vfile = df.rdd
vfile.count()

However, I still get 7 tasks. Is there a way that can tell spark-avro about the HDFS block size?

Any suggestions are appreciated!!

After some trials, the spark.sql.files.maxPartitionBytes property works in Spark 2.1, thanks!