Linux | File_Combiner4Hadoop

- September 21, 2012

File_Combiner4Hadoop

This shell script can be used to combine a set of small files into one or more big file.

This script is very useful when working with hadoop( at least it did for me).

With Hadoop the overhead involved in processing small files is very high. So it is better to combine all the small files and make one or more big files and use those big files for hadoop processing. At the same time if the "big/combined file" is more than the block size( by default it is 64MB) in hadoop, there is a possibility for the file to get split during the hadoop process( i.e one half of the file will be processed by one node and another half on another node). If you dont want the files to be split, then this is one of the easiest solution - combine the small files into one or more big files and make sure the big file's size does go above the hadoop block size ( in my case it is 64MB). This shell script has a parameter "-size" where you can specify the maximum allowed size for the "big/combined" file.

There are other solution like using a customized "inputFileformat" or making sequentialFile or using HBase to store all the small files. But this is one of the easiest method which was suffice for my requirements.

Parameters

-size	size_in_MB Mandatory. Small files are grouped together to a one or more big file(s). Each grouped file will not exceed the max size passed on in this argument.Grouped files are named like bigfile1, bigfile2 and so on. If no size is given then all files are grouped into one big Single File.
-path	parent dir Mandatory. Path of the folder where all the small files are available. By default only files which are directly under this folder will be grouped. If you want to pick the files from the sub-folder too then use -a or --all argument. If you are having WILDCARDS in your path arguments,then please eclose the values in double quotes(") to avoid command-line globing.
-name	filename with wildcards This argument helps to select files with specific name patterns.
-opath	out file dir path This is the dir path to which the output file is written.
-a \| --all	If you would like to pick all the files under the given -path,including the files under the sub-folder,then use this paramater.
-h \| --help	usage details

Download the Source Code

Shell Script - hadoop_file_combiner.sh

Installation Steps

download the above file and and run it like you run any other shell script

Usage

./hadoop_file_combiner.sh --help

hadoop_file_combiner.sh -size -path [-name] [-opath] [-a] [-h]

Search This Blog

Venkat$ echo