This article will introduce you to hadoop3 x

User 8639654 2022-06-24 06:32:47 阅读数:298


Yes Hadoop All friends who know know know ,Hadoop1.x yes Hadoop The second generation of open source versions , Major repair Hadoop0.x Some problems in the version , With the update and iteration of big data technology, this version has been eliminated . With Hadoop2.x Appearance , The architecture has changed significantly , Introduced yarn Many new features of the platform , It's the current mainstream version .

Hadoop3.x a Hadoop1.x and Hadoop2.x What's the difference ? In order to get to know more about Hadoop3.x, Let's take a closer look Hadoop3.x.

One 、Hadoop3.x brief introduction

Hadoop 3.x Is based on JDK1.8 Developed , Compared with the other two versions , Great changes have taken place in terms of function and optimization , These include HDFS Erasable encoding 、 many Namenode Support 、MR Native Task Optimization, etc. .

According to the Apache hadoop The latest news of ,Hadoop3.x The scheme structure will be adjusted , take Mapreduce Memory based +io+ disk , Working together on data . among , stay Hadoop3.x The biggest change in HDFS, It passed recently Block Block for calculation , According to the most recent calculation principle , Will local Block Add blocks to memory , To calculate , And then through IO, Shared memory computing area , Finally, the calculation results are formed quickly , Its calculation speed is faster than Spark fast 10 times .

Two 、Hadoop 3.x New characteristics

Hadoop 3.x In terms of function and performance Hadoop The kernel has made a number of significant improvements , It mainly includes the improvement of generality 、HDFS Improvement 、MapReduce Improvement 、YARN Improvement of resources, etc :

( One ) Improvement of universality

1、 Yes Hadoop Kernel reduction , This includes removing expired API And the implementation , Replace the default component implementation with the most efficient one .

2、Classpath isolation Prevent different versions jar Packet collision

3、Shell Script refactoring , Start scripts and Hadoop2.x Different ,Hadoop3.x Yes Hadoop The management script of is refactored , Lots of repairs bug Added new features , And added dynamic commands .

( Two )HDFS Improvement

1、HDFS Erasure code

stay Hadoop3.X in ,HDFS Realized Erasure Coding This new feature ,Erasure coding Technical abbreviation for erasure correction code EC, It's a data protection technology . It was first used for data recovery in data transmission in the communication industry , Is a coding fault tolerance technique .HDFS Support data erasure coding , This makes HDFS Without compromising reliability , It can save half of the storage space .

2、 Support multiple NameNodes

The original HDFS NameNode high-availability Realization , Only one is provided Active NameNode And a Standby NameNode, And by copying the edit log to three JournalNodes On , This architecture can tolerate the failure of any node in the system . In actual development , Some deployments require greater fault tolerance , It can be realized through this new feature , It allows users to run multiple Standby NameNode.

( 3、 ... and )MapReduce Improvement

  • MapReduce Task level local optimization : Improve MapReduce The speed of , by MapReduce Added C/C++ Of map output collector Realization ( Include Spill,Sort and IFile etc. ), You can switch to this implementation by adjusting the job level parameters . about shuffle Intensive application , Its performance can be improved by about 30%.
  • MapReduce Memory parameters are automatically inferred : stay Hadoop 2.0 in , by MapReduce Job setting memory parameters is very cumbersome , Once the setting is unreasonable , It will cause a serious waste of memory resources , stay Hadoop3.x This situation is avoided in .

( Four )YARN The resource type

YARN Resource model (YARN resource model) It has been extended to support user-defined countable resource types (support user-defined countable resource types), It supports CPU And memory . For example, the cluster administrator can define such as GPUs、 software license (software licenses) Or local additional memory (locally-attached storage) Resources like that .YARN Tasks can be scheduled according to the availability of these resources .

Hadoop3.x Yes HDFS、MapReduce、YARN And so on . It also introduces some important functions and optimizations , Include HDFS Erasable encoding 、 many Namenode Support 、MR Native Task Optimize 、YARN be based on cgroup Memory and disk of IO Isolation 、YARN container resizing etc. .

copyright:author[User 8639654],Please bring the original link to reprint, thank you.