Hadoop data compression

MelodyYN 2022-02-13 08:26:51 阅读数:146

hadoop data compression

Hadoop data compression

** Purpose :** Reduce disk IO、 Reduce disk storage space ;

Compression principle :

Computationally intensive Job, Use less compression

IO intensive Job, Multi use compression

1、MR Supported compression codes

  1. Compression algorithm comparison :
Compressed format Hadoop Bring their own Algorithm File extension Whether it can be sliced After changing to compressed format , Does the original program need to be modified
DEFLATE yes DEFLATE .deflate no Like text processing , It doesn't need to be modified
Gzip yes DEFLATE .gz no Like text processing , It doesn't need to be modified
bzip2 yes bzip2 .bz2 yes Like text processing , It doesn't need to be modified
LZO no , Need to install LZO .lzo yes Need to index , You also need to specify the input format
Snappy yes Snappy .snappy no Like text processing , It doesn't need to be modified
  1. Compression performance
Compression algorithm Original file size Compressed file size Compression speed Decompression speed
gzip 8.3GB 1.8GB 17.5MB/s 58MB/s
bzip2 8.3GB 1.1GB 2.4MB/s 9.5MB/s
LZO 8.3GB 2.9GB 49.3MB/s 74.6MB/s
Snappy 250MB/s 500MB/s

2、 The choice of compression mode

  • Compress / Decompression rate
  • compression ratio
  • Whether it can be sliced after compression
Compression way advantage shortcoming
Gzip High compression rate I won't support it Split; Compress / The decompression speed is average
Bzip2 High compression ratio ; Support Split Compress / Decompression is slow
LZO Compress / The decompression speed is relatively fast ; Support Split The compression ratio is average ; To support slicing, you need to create additional indexes
Snappy Fast compression and decompression I won't support it Split; The compression ratio is average

Selection of compression mode in different positions

 Insert picture description here

3、 Compression parameter configuration

  1. Hadoop Introduced code / decoder

    Compressed format Corresponding code / decoder
    DEFLATE org.apache.hadoop.io.compress.DefaultCodec
    gzip org.apache.hadoop.io.compress.GzipCodec
    bzip2 org.apache.hadoop.io.compress.BZip2Codec
    LZO com.hadoop.compression.lzo.LzopCodec
    Snappy org.apache.hadoop.io.compress.SnappyCodec
  2. To be in Hadoop Medium enabled compression , The following parameters can be configured

    Parameters The default value is Stage Suggest
    io.compression.codecs( stay core-site.xml Middle configuration ) nothing , This needs to be entered on the command line hadoop checknative see Input compression Hadoop Use the file extension to determine if a codec is supported
    mapreduce.map.output.compress( stay mapred-site.xml Middle configuration ) false mapper Output This parameter is set to true Enable compression
    mapreduce.map.output.compress.codec( stay mapred-site.xml Middle configuration ) org.apache.hadoop.io.compress.DefaultCodec mapper Output Enterprises use more LZO or Snappy Codec compresses data at this stage
    mapreduce.output.fileoutputformat.compress( stay mapred-site.xml Middle configuration ) false reducer Output This parameter is set to true Enable compression
    mapreduce.output.fileoutputformat.compress.codec( stay mapred-site.xml Middle configuration ) org.apache.hadoop.io.compress.DefaultCodec reducer Output Use standard tools or codecs , Such as gzip and bzip2

4、 Case study

4.1 Map Output compression

package com.hpu.hadoop.compress;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WCDriver {

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

//1. Get configuration information and job object 
Configuration conf = new Configuration();
// Turn on map End output compression 
conf.setBoolean("mapreduce.map.output.compress", true);
// Set up map End output compression mode 
conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);
Job job = Job.getInstance(conf);
//2. relation Driver
job.setJarByClass(WCDriver.class);
//3. relation Mapper and Reducer
job.setMapperClass(WCMapper.class);
job.setReducerClass(WCReducer.class);
//4. Set up Mapper Output KV type 
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//5. Set the final output KV type 
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//6. Set I / O path 
FileInputFormat.setInputPaths(job,new Path("E:\\Test\\input\\inputwc"));
FileOutputFormat.setOutputPath(job,new Path("E:\\Test\\w1"));
//7. Submit job
job.waitForCompletion(true);
}
}

4.2 Reduce Compression at output

package com.hpu.hadoop.compress.R;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WCDriver {

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

//1. Get configuration information and job object 
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//2. relation Driver
job.setJarByClass(WCDriver.class);
//3. relation Mapper and Reducer
job.setMapperClass(WCMapper.class);
job.setReducerClass(WCReducer.class);
//4. Set up Mapper Output KV type 
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//5. Set the final output KV type 
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// Set up reduce End output compression on 
FileOutputFormat.setCompressOutput(job, true);
// Set the way to compress 
FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
//6. Set I / O path 
FileInputFormat.setInputPaths(job,new Path("E:\\Test\\input\\inputwc"));
FileOutputFormat.setOutputPath(job,new Path("E:\\Test\\w2"));
//7. Submit job
job.waitForCompletion(true);
}
}

Unified format :

conf.set("mapreduce.map.output.compress","true");
conf.set("mapreduce.map.output.compress.codec","org.apache.hadoop.io.compress.BZip2Codec");
conf.set("mapreduce.output.fileoutputformat.compress","true");
conf.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.BZip2Codec");
copyright:author[MelodyYN],Please bring the original link to reprint, thank you. https://en.javamana.com/2022/02/202202130826487256.html