Linux | File_Combiner4Hadoop

File_Combiner4Hadoop

This shell script can be used to combine a set of small files into one or more big file.

This script is very useful when working with hadoop( at least it did for me).


With Hadoop the overhead involved in processing small files is very high. So it is better to combine all the small files and make one or more big files and use those big files for hadoop processing. At the same time if the "big/combined file" is more than the block size( by default it is 64MB) in hadoop, there is a possibility for the file to get split during the hadoop process( i.e one half of the file will be processed by one node and another half on another node). If you dont want the files to be split, then this is one of the easiest solution - combine the small files into one or more big files and make sure the big file's size does go above the hadoop block size ( in my case it is 64MB). This shell script has a parameter "-size" where you can specify the maximum allowed size for the "big/combined" file.


There are other solution like using a customized "inputFileformat" or making sequentialFile or using HBase to store all the small files. But this is one of the easiest method which was suffice for my requirements.

Parameters

  -size  size_in_MB
  Mandatory.
  Small files are grouped together to a one or more big file(s). Each grouped file will
  not exceed the max size passed on in this argument.Grouped files are named like
  bigfile1, bigfile2 and so on. If no size is given then all files are grouped into one big
  Single File.
  -path parent dir
  Mandatory.
  Path of the folder where all the small files are available. By default only files which
  are directly under this folder will be grouped.
  If you want to pick the files from the sub-folder too then use -a or --all argument.
  If you are having WILDCARDS in your path arguments,then please eclose the values
  in double quotes(") to avoid command-line globing.
  -name filename with wildcards
  This argument helps to select files with specific name patterns.
  -opath  out file dir path
  This is the dir path to which the output file is written.
  -a | --all If you would like to pick all the files under the given -path,including the files
under the sub-folder,then use this paramater.
  -h | --help usage details

Download the Source Code


Installation Steps

download the above file and and run it like you run any other shell script

Usage


./hadoop_file_combiner.sh --help

hadoop_file_combiner.sh -size -path [-name] [-opath] [-a] [-h]

Comments

Popular posts from this blog

Tableau - Accessing Tableau's DB

react-bootstrap-table | header column alignment fix

Tableau : Convert ESRI shapes into Tableau Format