Disk equalizer: introduction to HDFS disk balancer, functions and functions of disk balancer, commands related to HDFS disk balancer, query, cancel, execute and report

yida&yueda 2022-02-13 07:51:47 阅读数:601

disk equalizer introduction hdfs disk

Disk equalizer :HDFS Disk Balancer


Compared to individuals PC, Servers can generally mount multiple disks to expand the storage capacity of a single machine .

 Server disk

stay Hadoop HDFS in ,DataNode Responsible for final data block The storage , Allocate data blocks between disks on the machine . When writing new block when ,DataNodes The policy will be selected according to ( Circular strategy or Free space policy ) Choose block Of disks ( volume ).

Circular strategy : It will be new block Evenly distributed on available disks . Default to this policy .

Free space policy : This policy writes data to a file with more free space ( By percentage ) Of disks .

 Percentage of disk storage

however , When adopting a circular strategy in a long-running cluster ,DataNode Sometimes its storage directory is filled unevenly ( disk / volume ), As a result, some disks are full while others are rarely used . The reason for this may be due to a large number of write and delete operations , It may also be due to the replacement of the disk . in addition , If we use a selection strategy based on available space , Then each new write will enter the newly added empty disk , This makes other disks idle during this period . This will create a bottleneck on the new disk . therefore , You need a Intra DataNode Balancing(DataNode Uniform distribution of data blocks within the ) To solve Intra-DataNode Deflection ( Uneven distribution of blocks on disk ), This skew occurs due to disk replacement or random write and delete . therefore , stay Hadoop 3.0 A name is introduced in Disk Balancer Tools for , The tool focuses on DataNode Distribute data within .

HDFS Disk Balancer brief introduction

HDFS disk balancer yes Hadoop 3 Command line tools introduced in , For balancing DataNode The data in is unevenly distributed among disks . Special attention should be paid here ,HDFS disk balancer And HDFS Balancer Is different : HDFS disk balancer For a given DataNode To operate , And move the block from one disk to another , yes DataNode Internal data is balanced between different disks ;HDFS Balancer It's balanced DataNode Distribution between nodes .

HDFS Disk Balancer function

HDFS Disk balancer Two main functions are supported , namely The report and Balance .

Data dissemination report

In order to define a method to measure which computers in the cluster are affected by uneven data distribution ,HDFS The disk balancer defines HDFS Volume Data Density metric( volume / Disk data density metrics ) and Node Data Density metric( Node data density metrics ).

HDFS The volume data density metric can compare the distribution of data on different volumes of a given node .

The node data density metric allows comparisons between nodes .

  • Volume data density metric The calculation process

Suppose there is one with four volumes / Disk computers -Disk1,Disk2,Disk3,Disk4, Usage of each disk :

Disk1 Disk2 Disk3 Disk4
capacity 200 GB 300 GB 350 GB 500 GB
dfsUsed 100 GB 76 GB 300 GB 475 GB
dfsUsedRatio 0.5 0.25 0.85 0.95
volumeDataDensity 0.20 0.45 -0.15 -0.24
Total capacity= 200 + 300 + 350 + 500 = 1350 GB
Total Used= 100 + 76 + 300 + 475 = 951 GB

therefore , Each volume / The ideal storage on disk is :

Ideal storage = total Used ÷ total capacity= 951÷1350 = 0.70

That is, each disk should be kept at 70% Ideal storage capacity .

VolumeDataDensity = idealStorage – dfs Used Ratio

such as Disk1 Volume data density = 0.70-0.50 = 0.20. other Disk And so on .

volumeDataDensity A positive value of indicates that the disk is underutilized , A negative value indicates that the disk utilization is too high relative to the current ideal storage target .

  • Node Data Density The calculation process

Node Data Density( Node data density )= All volumes on this node / disk volume data density Sum of absolute values .

 The node data density in the above example =|0.20|+|0.45|+|-0.15|+|-0.24| =1.04

Lower node Data Density Value indicates that the machine node has good scalability , A higher value indicates that the node has a more skewed data distribution .

Once there is volumeDataDensity and nodeDataDensity, You can find the nodes with skewed data distribution in the cluster , Or you can get the of a given node volumeDataDensity.

Disk balancing

When specifying a DataNode Nodes carry out disk Data balance , You can calculate or read the current volumeDataDensity( Disk data density ). I have this information , We can easily determine which volumes are over configured , Which volumes are insufficient . To move data from a volume to DataNode Another volume in ,Hadoop The development is based on RPC Agreed Disk Balancer.

HDFS Disk Balancer Turn on

HDFS Disk Balancer Operate by creating a plan , The plan is a set of statements , Describe how much data should be moved between two disks , And then in DataNode Execute this set of statements on . The plan contains multiple move steps . Each move step in the plan has a target disk , The address of the source disk . The move step also has the number of bytes to move . The plan is for operational DataNode Executive .

By default ,Hadoop Has been enabled on the cluster Disk Balancer function . By means of hdfs-site.xml Medium adjustment dfs.disk.balancer.enabled Parameter values , Choice in Hadoop Whether disk balancer is enabled in .

HDFS Disk Balancer Relevant command

Plan plan

command :hdfs diskbalancer -plan

-out // Control the output location of the plan file
-bandwidth // Set up to run Disk Balancer Maximum bandwidth . Default bandwidth 10 MB/s.
–thresholdPercentage // Defines the value at which the disk begins to participate in data reallocation or balancing operations . default thresholdPercentage The value is 10%, This means that only if the disk contains more data than the ideal storage value 10% Or less , Disks are used for balancing operations .
-maxerror // It allows the user to specify the number of errors to ignore for the move operation between two disks before aborting the move step .
-v // Detailed mode , Specifying this option will force plan Command in stdout Show a summary of the plan on .
-fs // This option specifies the... To use NameNode. If not specified , be Disk Balancer The default in the configuration will be used NameNode.


Execute perform
 command :hdfs diskbalancer -execute <JSON file path>
execute The command is for the... For which the plan is generated DataNode Implementation plan .
Query Inquire about
 command :hdfs diskbalancer -query <datanode>
query The command runs from the scheduled DataNode obtain HDFS Current state of the disk balancer .
Cancel Cancel
 command :hdfs diskbalancer -cancel <JSON file path>
hdfs diskbalancer -cancel planID node <nodename>
cancel Command to cancel the run schedule .
Report report
 command :hdfs diskbalancer -fs https://namenode.uri -report <file://>
cancel Command to cancel the run schedule . ```
Report report
 command :hdfs diskbalancer -fs https://namenode.uri -report <file://>
copyright:author[yida&yueda],Please bring the original link to reprint, thank you. https://en.javamana.com/2022/02/202202130751446949.html