How to efficiently write the data (T / day) in Kafka to HDFS?

CSDN Q & A 2022-02-13 06:44:00 阅读数:509

efficiently write data day kafka

The problem background :

at present kafka There are 10T Data left and right , This data needs to be written to every day hdfs in . The amount of data per second is 10000/s strip , Every piece of data 10kB about .

To try :

1. Use logstash Direct consumption kafka Data in , stay output writes hdfs in . result : Processing efficiency 500/s.

2. Use logstash Write data to local disk , Use... Every hour hadoop fs -put Statements are uploaded regularly . About every hour 400G data , Can barely handle , But if the amount of data continues to grow in the future , There is bound to be a data backlog .

Consider compressing the data and uploading it :

1. Compress the data before uploading it every hour , test lz4 The compression efficiency is 3g/min,gzip The compression efficiency is 1g/min. Obviously, the compression takes enough time to upload all the data .

2. Use logstash When outputting data, use streaming compression to gzip Format ,logstash The ability to process data has been greatly reduced to 1000/s strip .

I'd like to ask you to give some advice on how to deal with other large data from kafka To hdfs The idea of , Not limited to the components mentioned above . I am a novice in big data , Thank you for your answer .




Take the answer :

Solution :
First you have to make sure it's upstream kafka Too much data output is still downstream hdfs Poor writing ability

  1. If kafka Too much data output , We can't afford it , You can increase the partition , Increase the pull quantity of each batch to solve
  2. If hdfs Poor writing ability , You may consider the problem of multi-threaded writing or concurrent writing
    I saw you use logstash Tool synchronization per second 500 I think the performance of this bar is a little poor , You look at logstash There is no increase in poll Parameters of , Setting multithreading parameters and so on ,
    I usually use... In this scenario streamsets The tool , On T There is no pressure on the amount of level data , Recently realized Kafka Synchronize to HBase The function of , If you want to be interested, you can see :
    https://blog.csdn.net/BlackArmand/article/details/118367522?spm=1001.2014.3001.5502


Other answers 2:

It can be used flume As kafka The consumer , Reuse flume Come on sink To hdfs, Various requirements can be configured flume Configuration file to complete


Other answers 3:

have access to hudi technology , Take data from Kafka after hudi write in hdfs, Here you can use Structured Streaming Write from Kafka after hudi To hdfs Code for , The efficiency of flow processing is very good

copyright:author[CSDN Q & A],Please bring the original link to reprint, thank you. https://en.javamana.com/2022/02/202202130643579327.html