Hadoop is now the Big Data de-facto standard platform in the Enterprise world. In particular, HDFS, Hadoop Distributed File System – the Hadoop module implementing the distributed storage part – is the most widespread solution for storing files that composes the so-called “Data Lake”. In this article will analyze one of the most frequent and insidious “Antipattern” than you can meet in case of improper use of this technology: storing small files.
HDFS allows you to manage large files in an efficient and reliable way, guaranteeing high throughput and high availability of contents, even in case of cluster’s single nodes malfunction, thanks to its architecture that provides files partitioning into chunks (known as “block”) that are stored redundantly in the cluster’s nodes.
The architecture of HDFS is based on two main modules: the Namenode, “master” node that handles the metadata of the distributed file-system and coordinates the read/write operations of files and directories, and the Datanode, “worker” node that manages the physical storage of the blocks that compose the files.
Every time we create a new file on HDFS, this is first divided into blocks of 128Mb (using the default configurations) that will be written by clients on the Datanodes and then replicated three times each (always according to the default configuration values). At The end of the write operations, the information necessary for the next retrieval of the file is written in the Namenode, which will handle them in memory in order to guarantee the maximum reading performances.
In Enterprise contexts, when migrating legacy procedures originally writing to RDBMS or NAS so to have them writing to HDFS instead, it’s very common not taking into account the impact of a bad file size optimization according to the features of the Hadoop distributed file-system.
Storing a large number of small files on HDFS results in a series of problems ranging from high memory consumption on the Namenode, to an increase in network traffic between nodes in the cluster caused by a large number of RPC calls. This also leads to a performance degradation of analytical jobs implemented through distributed parallel processing engines such as MapReduce, Spark, or query engines such as Hive and Impala.
Where do all my small files come from?
Storing configuration files, for example in XML or .properties format, is one of the most frequent cases generating small files on HDFS. The placeholders _SUCCESS and _failure files, leveraged by several distributed processing engines to track the output status of Jobs, is also a common cause of small files proliferation (actually empty files, in this case).
Ingestion procedures in streaming or micro-batch scenarios are another fairly common cause of small files number increase on HDFS.
How to locate small files?
One of the most commonly used method for detecting small files is the analysis of the fsimage file, the representation on disk of the Namenode’s in-memory database content. This file can be interpreted and converted into various formats, e.g. CSV, leveraging the Hadoop command line tools and then inspected with analytical tools such as ad-hoc Hive/Impala queries or Spark/ MapReduce jobs.
As an alternative to the information provided by the fsimage, the fsck command output can also provide useful insights. Open source tools leveraging this command are available on the Internet, such as this script that you can download from GitHub: https://github.com/shashanknaikdev/fsck-small-files-analyzer.
What solutions can be taken to avoid the problem?
When there’s no chance to modify or adapt the ingestion procedures and optimize them in order to prefer bigger files creation, one of the following solutions can be assessed:
- Create batch procedures, e.g. job Spark, that “assemble” small files together keeping the original format/compression, or that remove unnecessary jobs status _SUCCESS or _failure files;
- Use of the HAR format (Hadoop aRchive) when creating files on HDFS, paying attention to the fact that such files are unchangeable once archived (more info: https://hadoop.apache.org/docs/current/hadoop-archives/HadoopArchives.html)
For further insights, you can read this interesting article published on the official Cloudera’s blog: https://blog.cloudera.com/blog/2019/05/small-files-big-foils-addressing-the-associated-metadata-and-application-challenges/.
Written by Mario Cartia – Agile Skill Big Data Evangelist & Trainer