What are the differences among order by, sort by, distribution by and cluster by in Apache hive?

Shockang 2022-01-27 05:02:57 阅读数:488

differences order sort distribution cluster

This is my participation 8 The fourth of the yuegengwen challenge 8 God , Check out the activity details :8 Yuegengwen challenge

Text

  1. order by It will sort all the data given globally , No matter how much data comes , Start only one reducer To deal with it .
  2. sort by yes Local sorting ,sort by One or more... Will be started according to the size of the data volume reducer Come to work , also , It will enter reduce Before for each reducer All produce a sort file .
  3. distribute by control map Distribution of results , It will... With the same field map The output is distributed to a reduce Processing on nodes .
  4. cluster by It can be understood as a special distribute by and sort by The combination of , When distribute by and sort by The following column names are the same , It's equivalent to using cluster by Keep up with the column name . But be cluster by The final sorting result of the specified column can only be descending , And you can't specify asc and desc.

Add

1. order by Global ordering

Global ordering , only one reduce

Use order by Clause ordering

  1. asc (ascend)—— Ascending ( Default )
  2. desc (descend)—— Descending

order by Clause in select End of statement

2. distribute by Partition sorting

distribute by similar MapReduce in partition,== collection hash Algorithm , stay map The end will query the results in hash Results with the same value are distributed to the corresponding reduce In file ==. Need to combine sort by Use .

Be careful : Hive requirement distribute by The statement is written in sort by The statement before .

3. cluster by

  • When distribute by and sort by Same field , have access to cluster by The way

  • except distribute by Function outside , It also sorts the fields , therefore cluster by = distribute by + sort by

-- The following two ways are equivalent 
insert overwrite local directory '/home/hadoop/hivedata/distribute_sort'
select * from student distribute by score sort by score;
insert overwrite local directory '/home/hadoop/hivedata/cluster'
select * from student cluster by score;
 Copy code 

practice

1. Check the student's grades , And in descending order of scores

select * from student s order by score desc;
 Copy code 

2. Sort by alias

  • Sort according to the average of students' scores
select s.sid,s.tname, avg(score) as score_avg from student s group by s.sid,s.tname order by score_avg desc;
 Copy code 

3. Multi column sorting

  • Sort in ascending order of students' scores and age
select * from student s order by score,age;
 Copy code 

4. Every MapReduce Internal sorting (Sort By) Local sorting

sort by: Every reducer Sort internally , Not sort for global result sets .

1、 Set up reduce Number

set mapreduce.job.reduces=3;
 Copy code 

2、 Check the Settings reduce Number

set mapreduce.job.reduces;
 Copy code 

3、 The query results are arranged in descending order

select * from student s sort by s.score;
 Copy code 

4、 Import the query results into a file ( In descending order of grades )

insert overwrite local directory '/home/hadoop/hivedata/sort' select * from student s sort by s.score;
 Copy code 

5. First according to the students sid partition , Then sort according to the students' grades

1、 Set up reduce The number of

set mapreduce.job.reduces=3;
 Copy code 

2、 adopt distribute by Partition data , Will be different sid Divided into corresponding reduce Go to of

insert overwrite local directory '/home/hadoop/hivedata/distribute' select * from student distribute by sid sort by score;
 Copy code 
copyright:author[Shockang],Please bring the original link to reprint, thank you. https://en.javamana.com/2022/01/202201270502509068.html