The problem background ：
at present kafka There are 10T Data left and right , This data needs to be written to every day hdfs in . The amount of data per second is 10000/s strip , Every piece of data 10kB about .
To try ：
1. Use logstash Direct consumption kafka Data in , stay output writes hdfs in . result ： Processing efficiency 500/s.
2. Use logstash Write data to local disk , Use... Every hour hadoop fs -put Statements are uploaded regularly . About every hour 400G data , Can barely handle , But if the amount of data continues to grow in the future , There is bound to be a data backlog .
Consider compressing the data and uploading it ：
1. Compress the data before uploading it every hour , test lz4 The compression efficiency is 3g/min,gzip The compression efficiency is 1g/min. Obviously, the compression takes enough time to upload all the data .
2. Use logstash When outputting data, use streaming compression to gzip Format ,logstash The ability to process data has been greatly reduced to 1000/s strip .
I'd like to ask you to give some advice on how to deal with other large data from kafka To hdfs The idea of , Not limited to the components mentioned above . I am a novice in big data , Thank you for your answer .
First you have to make sure it's upstream kafka Too much data output is still downstream hdfs Poor writing ability
- If kafka Too much data output , We can't afford it , You can increase the partition , Increase the pull quantity of each batch to solve
- If hdfs Poor writing ability , You may consider the problem of multi-threaded writing or concurrent writing
I saw you use logstash Tool synchronization per second 500 I think the performance of this bar is a little poor , You look at logstash There is no increase in poll Parameters of , Setting multithreading parameters and so on ,
I usually use... In this scenario streamsets The tool , On T There is no pressure on the amount of level data , Recently realized Kafka Synchronize to HBase The function of , If you want to be interested, you can see ：
It can be used flume As kafka The consumer , Reuse flume Come on sink To hdfs, Various requirements can be configured flume Configuration file to complete
have access to hudi technology , Take data from Kafka after hudi write in hdfs, Here you can use Structured Streaming Write from Kafka after hudi To hdfs Code for , The efficiency of flow processing is very good