Partition according to the block size of HDFS
Closed this issue · 1 comments
phonchi commented
I just use spark-avro to read an avro file on HDFS (The block size of our HDFS is 64MB) which is around 900MB. The spark gives 7 tasks for the job, thus I assume that it uses 128MB as default block size.
I run the following code to ensure that spark know the block size of HDFS is 64MB
conf = SparkConf().setAppName("read seq")
sc = SparkContext(conf=conf)
blocksize = 67108864
sc._jsc.hadoopConfiguration().setLong( "dfs.blocksize", blocksize )
SparkConf = sc.getConf()
spark = SparkSession\
.builder\
.config(conf=SparkConf)\
.getOrCreate()
df = spark.read.format("com.databricks.spark.avro").load("Test.avro")
vfile = df.rdd
vfile.count()
However, I still get 7 tasks. Is there a way that can tell spark-avro about the HDFS block size?
Any suggestions are appreciated!!
phonchi commented
After some trials, the spark.sql.files.maxPartitionBytes property works in Spark 2.1, thanks!