# After reading this article, you will master the use of hive over window function with exercises

Big data learning monk 2022-02-13 07:15:51 阅读数:393

Preface ： We are learning hive Window function , Be sure to understand the structure of the window function first . Instead of directly Baidu sum() over()、row_number() over()、 perhaps count() over() Usage of , If you do , You will never master the core of window function , Of course, I did the same thing when I first started .

Fortunately, I am more tenacious , stay HIVE I've been tossing about window function for half a month 、 After reading a lot of articles, I know over() Is the window function , and sum、row_number、count Only with over() Collocation analysis function , Of course, there are other functions besides these three functions .

Individual to over() Window understanding ： This is always a line corresponding to a window , As for the scope of this window, it depends on over() What are the constraints on the window range in the function （partition by order by between … and） adopt partition by Keyword to group windows , Special attention ： adopt order by Come on order by Open a window for the row sorted by the field , Just note that the window size of the first row of data is 1, The window range of the second row of data is the first 2 That's ok , The first n The window range of the row is the front n That's ok , And so on . If there are no conditions , Then each row corresponds to the whole table .

Special window functions such as rank(),rownumber(),dense() etc. , Even behind over() There are no conditions , The default window is similar to order by effect , That is, the window size of the first line is 1, The window size of the second line is 2, And so on , But the data just don't make any statistical sense , So it's usually in over() Add in partiton by and order by（ grouping , Sort ） etc. , Give meaning to it Such as ranking .

over(partition by ) and ordinary group by The difference between , Why is it different group by, Because there is group by, Can only select group by Later fields , And some aggregate functions sum(),avg(),max(),min() etc. , And I used it over(partition by), Can also select Other non partition by Field Or you can just “select *”, And for join Wait for better support .

## One 、hive Window function syntax

1、over() Syntax structure of window function
3.over(partition by name) Each line is based on name To distinguish windows
4.over(partition by name order by id) Each line is based on name To distinguish windows , According to order by Take the specific scope
2、 Often with over() Analysis functions used together ：
3、 Summary of window functions ：

## Two 、hive Window function exercise 28 Problem

The first set of exercises ：hive Simple window function over()
1、 Use over() Function for data statistics , Count the total number of each user and data in the table
Test data
Create tables and import data ：
Let's first look at the data in the table below ：
The second set of exercises
Need to be seek
Demand analysis
The third set of exercises

## One 、hive Window function syntax

In the preface, we have said avg()、sum()、max()、min() Is the analysis function , and over() Is the window function , Let's take a look at over() Syntax structure of window function 、 And often with over() Analysis functions used together

over() Syntax structure of window function
Often with over() Analysis functions used together
Summary of window functions

1、over() Syntax structure of window function
Analysis function over(partition by Name order by Name rows between Starting position and End position )
over() The function includes three functions ： Including partitions partition by Name 、 Sort order by Name 、 Specify the window range rows between Starting position and End position . We are using over() Window function ,over() The three functions in the function can be used in combination or not .

over() If these three functions are not used in the function , The window size is for all data generated by the query , If a partition is specified , The window size is the data for each partition .

1.over() By default, the window of each row is all rows

``````select *,count(1) over() from business;
``````

2.over(order by orderdate)

orderdate=1 The window has only one line ,orderdate=2 Your windows include orderdate=2017-01-01,orderdate=2017-01-02

``````select *,count(1) over(order by orderdate) from business;
``````

3.over(partition by name) Each line is based on name To distinguish windows

``````select *,sum(cost) over(partition by name) from business;
``````

4.over(partition by name order by id) Each line is based on name To distinguish windows , According to order by Take the specific scope

``````select *,sum(cost) over(partition by name order by orderdate) from business;
``````

1.1、over() Explain the three functions in the function
order by
order by It means sort , Is... In this window

A、partition by
partition by Can be understood as group by grouping .over(partition by Name ) When collocation analysis function , The analysis function is calculated according to each group of data .

B、rows between Starting position and End position
Is to specify the window range , For example, from the first line to the current line . And this range changes with the data .over(rows between Starting position and End position ) When collocation analysis function , The analysis function calculates according to this range .

Window range description ：
The window range we often use is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW（ Indicates from the starting point to the current line ）, This window is often used to calculate the accumulation .

``````PRECEDING： forward
FOLLOWING： Back up
CURRENT ROW： Current row
UNBOUNDED： The starting point （ General combination PRECEDING,FOLLOWING Use ）
UNBOUNDED PRECEDING Represents the front row of the window （ The starting point ）
UNBOUNDED FOLLOWING： Represents the last line of the window （ End ）
for instance ：
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW（ Indicates from the starting point to the current line ）
ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING（ It means forward 2 Go back 1 That's ok ）
ROWS BETWEEN 2 PRECEDING AND 1 CURRENT ROW（ It means forward 2 Line to current line ）
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING（ Indicates that the current line reaches the end point ）
``````

2、 Often with over() Analysis functions used together ：
2.1 Aggregate class

``````avg()、sum()、max()、min()
``````

2.2 Ranking class

``````row_number() A self incrementing number is generated when sorting by value , No repetition （ Such as ：1、2、3、4、5、6）
rank() A self incrementing number is generated when sorting by value , Repeat when values are equal , There will be vacancies （ Such as ：1、2、3、3、3、6）
dense_rank() A self incrementing number is generated when sorting by value , Repeat when values are equal , There is no vacancy （ Such as ：1、2、3、3、3、4）
``````

2.3 Other categories

``````lag( Name , The number of lines ahead ,[ The number of rows is null The default value of , Not specified as null]), You can calculate the last purchase time of the user , Or the user's next purchase time . Or last login time and next login time
lead( Name , Next rows ,[ The number of rows is null The default value of , Not specified as null])
ntile(n) Distribute rows in ordered partitions to groups of specified data , Each group has a number , Number from 1 Start , For each line ,ntile Returns the number of the group to which this row belongs
``````

3、 Summary of window functions ：
In fact, the logic of window function is more around , We can understand the window as grouping the data in the table , Sort, etc . To really understand HIVE Window function , It's still necessary to combine exercises . Now let's start HIVE Window function practice bar ！

## Two 、hive Window function exercise 28 Problem

The first set of exercises ：hive Simple window function over()
1、 Use over() Function for data statistics , Count the total number of each user and data in the table
Find the details of users and count the total number of users every day
Calculate all... From the first day to the present score Greater than 80 Total number of users per minute
Calculate that the score of each user to the current date is greater than 80 Days of
Test data

``````20191020,11111,85
20191020,22222,83
20191020,33333,86
20191021,11111,87
20191021,22222,65
20191021,33333,98
20191022,11111,67
20191022,22222,34
20191022,33333,88
20191023,11111,99
20191023,22222,33
``````

Create tables and import data ：

``````create table test_window
(logday string, #logday Time
userid string,
score int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath '/home/xiaowangzi/hive_test_data/test_window.txt' into table test_window;
``````

Let's first look at the data in the table below ：

``````select * from test_window;
``````

test_window

1、 Use over() Function for data statistics , Count the total number of each user and data in the table

``````select *, count(userid) over() as total from test_window;
``````

Use here over() And select count(*) It has the same effect , The advantage is , When the total number needs to be calculated, there is no need to correlate again .

2、 Find the details of users and count the total number of users every day
have access to partition by Partition data by date column , Such as ：over(partition by logday)

``````select *,count()over(partition by logday)as day_total from test_window;
``````

For the number of users per day, you can use select logday, count(userid) from recommend.test_window group by logday, But when you want to get userid Information , The advantage of this usage is obvious .

3、 Calculate all... From the first day to the present score Greater than 80 Total number of users per minute
At this point, simple partitioning cannot meet the requirements , Need to put order by and Use in conjunction with window definitions .

``````select *,count()over(order by logday rows between unbounded preceding and current row)as total
from test_window
where score > 80;
``````

adopt over() Calculate the cumulative value by date .
In fact, I was wrong when I first calculated , I don't have to think about it , direct select *,count()over()as total from test_window where score > 80; Calculate like this , If calculated in this way, only all items greater than... In the table will be displayed 80 The number of people , If I want to see 20191021 Or look 20191022 The number of people can't be seen .

4、 Calculate that the score of each user to the current date is greater than 80 Days of

``````select *,
count()over(partition by userid order by logday rows between unbounded preceding and current row) as total
from test_window
where score > 80 order by logday, userid;
``````

The second set of exercises
There are the following data ： Field called ：

``````name、orderdate、cost
Jack,2017-01-01,10
Tony,2017-01-02,15
Jack,2017-02-03,23
Tony,2017-01-04,29
Jack,2017-01-05,46
Jack,2017-04-06,42
Tony,2017-01-07,50
Jack,2017-01-08,55
Mark,2017-04-08,62
Mart,2017-04-09,68
Meil,2017-05-10,12
Mart,2017-04-11,75
Meil,2017-06-12,80
Mart,2017-04-13,94
``````

Need to be seek
1、 Inquire about 2017-04 Total number of customers who bought

2、 Customer purchase details and monthly total amount

3、 The above scenario , take cost Add up by date

4、 Query the customer's last purchase time

5、 Before query 20% Purchase order information
One 、 Create tables and import data ：

– Build table

``````create table business(
name string,
orderdate string,
cost int)
row format delimited
fields terminated by ",";
``````

– Import data

``````load data local inpath "/business.txt" into table business;
``````

– Query table ：

``````select * from business;
``````

Demand analysis
One 、 Inquire about 2017-04 Total number of customers who bought

a、 First think of using aggregate functions count()

– Find out first 2017-04 How many records are there this month

``````select count(*) from business where substr(orderdate,1,7) = "2017-04";
``````

The results are as follows ：

b、 Now group by customer

``````select name,count(*) from business where substr(orderdate,1,7) = "2017-04" group by name;
``````

The results are as follows ：

The data was divided into three groups ：

Use over() function ：over Works only on aggregate functions ,count Count the above three groups respectively ,over Count the total number of groups ( There is one count Add up once )

``````select
name,count(*) over() total_num
from business where substr(orderdate,1,7) = "2017-04" group by name;
``````

The results are shown below ：

Two 、 Query customer purchase details and monthly total amount

a、 First select all details ：

``````select * from business;
``````

b、 Total amount ：( This is the sum of all the data , Because there is no grouping (group by), therefore over() For every piece of data )

``````select *, sum(cost) over() from business;
``````

c、 For April data , We need to calculate the total amount ,

Ideas ： Partition or grouping , But use group by date, Can only query date,(select date ,name group by date) Other fields cannot be queried

solve ： Use window functions , And partition the window function over(distribute by()) perhaps over(partition by())

``````select *,sum(cost) over(distribute by month(orderdate)) from business;
``````

The result is shown in the figure ：

3、 ... and 、 The above scenario , take cost Accumulate by time

a、 First sort by purchase time

``````select * from business sort by orderdate;
``````

The result is shown in the figure ：

– Parameters on

``````-- sort by orderdate： Sort by purchase date
-- UNBOUNDED PRECEDING： From the beginning
-- CURRENT ROW： To the current line
-- Calculate the total cost from start to current time
``````
``````select
*,
sum(cost) over(sort by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW)
``````

The results are shown in the following figure ：

row function ：

``````current row： Current row
n PRECEDING： forward n That's ok
n FOLLOWING： Back up n That's ok
UNBOUNDED： The starting point
UNBOUNDED PRECEDING： Start from the front
UNBOUNDED FOLLOWING： To the back end
LAG(col,n)： The first step forward n That's ok
LEAD(col,n)： The next n That's ok
``````

– Parameters on
– sort by orderdate： Sort by time
– 1 preceding： Before the current line 1 That's ok
– 1 following： The next line after the current line
– Calculate the values of three adjacent rows ( The first line calculates the current line + The next line ; The last line calculates the current line + The previous line )

``````select
*,
sum(cost) over(sort by orderdate rows between 1 preceding and 1 following)
``````

The results are as follows ：

demo2：

– Parameters, ：
– distribute by name： Partition by name
– sort by orderdate： Sort by time in each partition
– UNBOUNDED PRECEDING and current row： From the starting line to the current line
– Calculate the total cost of each person

``````select
*,
sum(cost) over(distribute by name
sort by orderdate
rows between UNBOUNDED PRECEDING and current row)
``````

The results are shown in the following figure ：

demo3：

– Parameters on ：
– sort by orderdate： Sort by time
– current row and unbounded following： Current line to end line

``````select
*,
sum(cost) over(sort by orderdate
rows between current row and unbounded following)
``````

The results are shown in the following figure ：

Four 、 Query the customer's last purchase time , And the time of next purchase ( E-commerce websites are often used to find the time before and after page Jump )

analysis ：lag(clo,n)： Returns the first... Of the current line n That's ok

– Parameters, ：
– distribute by name： Group by name
– sort by orderdate： Sort by time
– lag(orderdate,1)： Returns the current orderdate The line before the line
– lead(orderdate,1)： Returns the current orderdate The last line of the line

``````select *,
lag(orderdate,1) over(distribute by name sort by orderdate),
lead(orderdate,1) over(distribute by name sort by orderdate)
``````

The results are shown in the following figure ：

5、 ... and 、 Before query 20% Purchase order information

analysis ： It can be divided into five equal parts according to time , Then return to the first one

NTILE(n)： Divide the data into n Share

``````select *, ntile(5) over(sort by orderdate) from business;
``````

The results are shown in the following figure ：

– The following statement reports an error , because ntile、sum、agg Equal functions cannot be placed in where Later as query criteria

``````select
*, ntile(5) over(sort by orderdate) as sorted
where sorted = 1;
``````

– The following statement reports an error , because having Must be with the group by Statement behind

``````select
*, ntile(5) over(sort by orderdate) as sorted
having sorted = 1;
``````

– So sub query is used , Put the result of the previous query in the clause

``````select
name,orderdate,cost
from (
select *,ntile(5) over(order by orderdate) sorted from business
) t
where sorted = 1;
``````

– Tips： Subquery cannot be used select *

The third set of exercises
Student achievement ranking of each subject ( Whether to rank side by side 、 There are three implementations of vacancy ranking )
Ranking of each subject top n Of the students
Raw data ( Student achievement information )

``````name subject score
The Monkey King Chinese language and literature 87
The Monkey King mathematics 95
The Monkey King English 68
The sea Chinese language and literature 94
The sea mathematics 56
The sea English 84
Song song Chinese language and literature 64
Song song mathematics 86
Song song English 84
tingting Chinese language and literature 65
tingting mathematics 85
tingting English 78
``````

``````create table score
(
name string,
subject string,
score int
) row format delimited fields terminated by "\t";
``````

``````load data local inpath '/home/fengGG/hive_test_data/score.txt' into table score;
``````

View the data

``````select * from score;
``````

1、 Student achievement ranking of each subject ( Whether to rank side by side 、 There are three implementations of vacancy ranking )

``````select *,
row_number()over(partition by subject order by score desc),
rank()over(partition by subject order by score desc),
dense_rank()over(partition by subject order by score desc)
from score;
``````

2、 Ranking of each subject top n Of the students

``````select *
from(
select
*,row_number() over(partition by subject order by score desc) rmp
from score
) t
where t.rmp<=3;
``````