cerndb/hdfs-metadata

How to find small files(size smaller than block size)?

Closed this issue · 4 comments

I think the hdfs-metadata project is great. It show us a detailed view of meta data information about data blocks and replicas stored into HDFS.
But in my project, I want to find the small files(size smaller than block size) on HDFS, What can I do, please give me a hand, thank you!

Hi @jiangshouzhuang ,

You would need to go through all files looping similarly as I am doing here: https://github.com/cerndb/hdfs-metadata/blob/master/src/main/java/ch/cern/db/hdfs/DistributedFileSystemMetadata.java#L174

You would need to compare file size with block size checking if file size is smaller.

If main class extends Configured and implements Tool, you could get configured default block size from getConf() method. Block size can be different per file, you can get configured block size of each file once you get its FileStatus.

Hope it helps.

@dlanza1
Thank you very much for your advice.
In fact, I would like to find the Hive/Impala table corresponding to the number of small files, and then tell the user to optimize the small file.
Because we know that small files will affect the performance of HDFS.

Definitely, it affects performance.

Compacting them would be a good practice.

Yes, That's what I mean. But firstly I want to find the small files, then tell users to compact them.