Linux | File_Combiner4Hadoop
File_Combiner4Hadoop This shell script can be used to combine a set of small files into one or more big file. This script is very useful when working with hadoop( at least it did for me). With Hadoop the overhead involved in processing small files is very high. So it is better to combine all the small files and make one or more big files and use those big files for hadoop processing. At the same time if the "big/combined file" is more than the block size( by default it is 64MB) in hadoop, there is a possibility for the file to get split during the hadoop process( i.e one half of the file will be processed by one node and another half on another node). If you dont want the files to be split, then this is one of the easiest solution - combine the small files into one or more big files and make sure the big file's size does go above the hadoop block size ( in my case it is 64MB). This shell script has a parameter "-size" where you can specify the maximum all