Hive table data volume statistics principle and source code analysis

Book Recalls Jiangnan 2022-09-23 07:58:37 阅读数:559


When Hive explain obtains the execution plan, you often see the table data volume statistics shown in the following figure:

Then how does Hive calculate this amount of data?

1. Data size statistics

1.1, Hive source code

After Hive obtains the abstract syntax tree (AST) of SQL through the Antlr parser and generates a logical execution plan that verifies the metadata, the statistical rules (rule) will be used in the optimization phase, as shown in the following figure:

In the class AnnotateWithStatistics, the TableScanStatsRule rule is called when the execution plan is transformed, as shown in the following figure:

In the TableScanStatsRule matching rule, after getting the partition range (PrunedPartitionList) involved after cutting, the collectStatistics() method will be called to start the official statistics table Statistics information, as shown in the following figure:


After obtaining the column information involved in the select statement, etc., call the overloaded method with the same name, as shown in the following figure:

In the final overloaded collectStatistics() method, the getDataSize() method will be called to count the amount of data (of course, there are also function calls to count the number of rows), as shown in the following figure:

You can see that the logic of statistic data volume is to first get the raw data size from the parameters information of the Hive metastore (existing in MySQL), if not, it will still beGet the total data size information from the metastore, and then directly count the size of the HDFS directory file and multiply it by the deserialization factor (because the table file may be compressed and serialized, the actual capacity sizesmaller than the original).

1.2, Hadoop source code

So how does Hive get the file size in the HDFS statistics table directory?In the getDataSize() function, the method on Hadoop HDFS will be called, as shown in the following figure:

Go to the Hadoop source code, you can see that if it is a file, the length of the file will be calculated directly, and if it is a directory, it will be counted recursively, as shown in the following figure:

So how is this length calculated?The getLen() function uses a length variable, which is ultimately set here:

This will finally come to the JDK level, and the byte size will be returned, as shown in the following figure:

Second, Num rows statistics

In the collectStatistics() function of Hive above, getNumRows() is called to count the number of rows in the table. It can be seen that if the row number information cannot be obtained from the Hive metastore, thenTake the estimation approach as shown below:

In the estimateRowSizeFromSchema() function, after getting the information of each column of the table, Hive will judge the field type of the column, and accumulate the value represented by a field value of this type.The different capacity sizes of are divided into variable length types such as string, varchar, struct, map, and fixed length types such as int, double, and boolean, as shown in the following figure:

Accumulate the capacity of a field value of all columns of the table, that is, after the estimated capacity of a row, return to the beginning of getNumRows()The function will divide the statistical table capacity by the estimated capacity of a row of data, and finally get the estimated number of rows, as shown in the following figure:

If the number of rows is not obtained in this way (for example, the table capacity was not obtained before), the number of rows returned is one row.It can be seen that Hive's statistical method has a more rigorous response speed (preferably taken from the metastore) and fault tolerance (in case the statistics cannot be found).

copyright:author[Book Recalls Jiangnan],Please bring the original link to reprint, thank you.