After reading this article, you will master the use of hive over window function with exercises

Big data learning monk 2022-02-13 07:15:51 阅读数:393

reading article master use hive

Preface : We are learning hive Window function , Be sure to understand the structure of the window function first . Instead of directly Baidu sum() over()、row_number() over()、 perhaps count() over() Usage of , If you do , You will never master the core of window function , Of course, I did the same thing when I first started .

Fortunately, I am more tenacious , stay HIVE I've been tossing about window function for half a month 、 After reading a lot of articles, I know over() Is the window function , and sum、row_number、count Only with over() Collocation analysis function , Of course, there are other functions besides these three functions .

Individual to over() Window understanding : This is always a line corresponding to a window , As for the scope of this window, it depends on over() What are the constraints on the window range in the function (partition by order by between … and) adopt partition by Keyword to group windows , Special attention : adopt order by Come on order by Open a window for the row sorted by the field , Just note that the window size of the first row of data is 1, The window range of the second row of data is the first 2 That's ok , The first n The window range of the row is the front n That's ok , And so on . If there are no conditions , Then each row corresponds to the whole table .

Special window functions such as rank(),rownumber(),dense() etc. , Even behind over() There are no conditions , The default window is similar to order by effect , That is, the window size of the first line is 1, The window size of the second line is 2, And so on , But the data just don't make any statistical sense , So it's usually in over() Add in partiton by and order by( grouping , Sort ) etc. , Give meaning to it Such as ranking .

over(partition by ) and ordinary group by The difference between , Why is it different group by, Because there is group by, Can only select group by Later fields , And some aggregate functions sum(),avg(),max(),min() etc. , And I used it over(partition by), Can also select Other non partition by Field Or you can just “select *”, And for join Wait for better support .

One 、hive Window function syntax

1、over() Syntax structure of window function
3.over(partition by name) Each line is based on name To distinguish windows
4.over(partition by name order by id) Each line is based on name To distinguish windows , According to order by Take the specific scope
2、 Often with over() Analysis functions used together :
3、 Summary of window functions :

Two 、hive Window function exercise 28 Problem

The first set of exercises :hive Simple window function over()
1、 Use over() Function for data statistics , Count the total number of each user and data in the table
Test data
Create tables and import data :
Let's first look at the data in the table below :
The second set of exercises
Need to be seek
Demand analysis
The third set of exercises

One 、hive Window function syntax

In the preface, we have said avg()、sum()、max()、min() Is the analysis function , and over() Is the window function , Let's take a look at over() Syntax structure of window function 、 And often with over() Analysis functions used together

over() Syntax structure of window function
Often with over() Analysis functions used together
Summary of window functions

1、over() Syntax structure of window function
Analysis function over(partition by Name order by Name rows between Starting position and End position )
over() The function includes three functions : Including partitions partition by Name 、 Sort order by Name 、 Specify the window range rows between Starting position and End position . We are using over() Window function ,over() The three functions in the function can be used in combination or not .

over() If these three functions are not used in the function , The window size is for all data generated by the query , If a partition is specified , The window size is the data for each partition .

1.over() By default, the window of each row is all rows

select *,count(1) over() from business;

 Insert picture description here
2.over(order by orderdate)

orderdate=1 The window has only one line ,orderdate=2 Your windows include orderdate=2017-01-01,orderdate=2017-01-02

select *,count(1) over(order by orderdate) from business;

 Insert picture description here
3.over(partition by name) Each line is based on name To distinguish windows

select *,sum(cost) over(partition by name) from business;

 Insert picture description here
4.over(partition by name order by id) Each line is based on name To distinguish windows , According to order by Take the specific scope

select *,sum(cost) over(partition by name order by orderdate) from business;

 Insert picture description here
1.1、over() Explain the three functions in the function
order by
order by It means sort , Is... In this window

A、partition by
partition by Can be understood as group by grouping .over(partition by Name ) When collocation analysis function , The analysis function is calculated according to each group of data .

B、rows between Starting position and End position
Is to specify the window range , For example, from the first line to the current line . And this range changes with the data .over(rows between Starting position and End position ) When collocation analysis function , The analysis function calculates according to this range .

Window range description :
The window range we often use is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW( Indicates from the starting point to the current line ), This window is often used to calculate the accumulation .

PRECEDING: forward
FOLLOWING: Back up
CURRENT ROW: Current row
UNBOUNDED: The starting point ( General combination PRECEDING,FOLLOWING Use )
UNBOUNDED PRECEDING Represents the front row of the window ( The starting point )
UNBOUNDED FOLLOWING: Represents the last line of the window ( End )
for instance :
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW( Indicates from the starting point to the current line )
ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING( It means forward 2 Go back 1 That's ok )
ROWS BETWEEN 2 PRECEDING AND 1 CURRENT ROW( It means forward 2 Line to current line )
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING( Indicates that the current line reaches the end point )

2、 Often with over() Analysis functions used together :
2.1 Aggregate class

avg()sum()max()min()

2.2 Ranking class

row_number() A self incrementing number is generated when sorting by value , No repetition ( Such as :123456)
rank() A self incrementing number is generated when sorting by value , Repeat when values are equal , There will be vacancies ( Such as :123336)
dense_rank() A self incrementing number is generated when sorting by value , Repeat when values are equal , There is no vacancy ( Such as :123334

2.3 Other categories

lag( Name , The number of lines ahead ,[ The number of rows is null The default value of , Not specified as null]), You can calculate the last purchase time of the user , Or the user's next purchase time . Or last login time and next login time
lead( Name , Next rows ,[ The number of rows is null The default value of , Not specified as null])
ntile(n) Distribute rows in ordered partitions to groups of specified data , Each group has a number , Number from 1 Start , For each line ,ntile Returns the number of the group to which this row belongs

3、 Summary of window functions :
In fact, the logic of window function is more around , We can understand the window as grouping the data in the table , Sort, etc . To really understand HIVE Window function , It's still necessary to combine exercises . Now let's start HIVE Window function practice bar !

Two 、hive Window function exercise 28 Problem

The first set of exercises :hive Simple window function over()
1、 Use over() Function for data statistics , Count the total number of each user and data in the table
Find the details of users and count the total number of users every day
Calculate all... From the first day to the present score Greater than 80 Total number of users per minute
Calculate that the score of each user to the current date is greater than 80 Days of
Test data

20191020,11111,85
20191020,22222,83
20191020,33333,86
20191021,11111,87
20191021,22222,65
20191021,33333,98
20191022,11111,67
20191022,22222,34
20191022,33333,88
20191023,11111,99
20191023,22222,33

Create tables and import data :

create table test_window
(logday string, #logday Time 
userid string,
score int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
# Load data 
load data local inpath '/home/xiaowangzi/hive_test_data/test_window.txt' into table test_window;

Let's first look at the data in the table below :

select * from test_window;

 Insert picture description here
test_window

1、 Use over() Function for data statistics , Count the total number of each user and data in the table

select *, count(userid) over() as total from test_window;

 Insert picture description here
Use here over() And select count(*) It has the same effect , The advantage is , When the total number needs to be calculated, there is no need to correlate again .

2、 Find the details of users and count the total number of users every day
have access to partition by Partition data by date column , Such as :over(partition by logday)

select *,count()over(partition by logday)as day_total from test_window;

 Insert picture description here
For the number of users per day, you can use select logday, count(userid) from recommend.test_window group by logday, But when you want to get userid Information , The advantage of this usage is obvious .

3、 Calculate all... From the first day to the present score Greater than 80 Total number of users per minute
At this point, simple partitioning cannot meet the requirements , Need to put order by and Use in conjunction with window definitions .

select *,count()over(order by logday rows between unbounded preceding and current row)as total
from test_window
where score > 80;

 Insert picture description here
adopt over() Calculate the cumulative value by date .
In fact, I was wrong when I first calculated , I don't have to think about it , direct select *,count()over()as total from test_window where score > 80; Calculate like this , If calculated in this way, only all items greater than... In the table will be displayed 80 The number of people , If I want to see 20191021 Or look 20191022 The number of people can't be seen .

4、 Calculate that the score of each user to the current date is greater than 80 Days of

select *,
count()over(partition by userid order by logday rows between unbounded preceding and current row) as total
from test_window
where score > 80 order by logday, userid;

 Insert picture description here
The second set of exercises
There are the following data : Field called :

name、orderdate、cost
Jack,2017-01-01,10
Tony,2017-01-02,15
Jack,2017-02-03,23
Tony,2017-01-04,29
Jack,2017-01-05,46
Jack,2017-04-06,42
Tony,2017-01-07,50
Jack,2017-01-08,55
Mark,2017-04-08,62
Mart,2017-04-09,68
Meil,2017-05-10,12
Mart,2017-04-11,75
Meil,2017-06-12,80
Mart,2017-04-13,94

Need to be seek
1、 Inquire about 2017-04 Total number of customers who bought

2、 Customer purchase details and monthly total amount

3、 The above scenario , take cost Add up by date

4、 Query the customer's last purchase time

5、 Before query 20% Purchase order information
One 、 Create tables and import data :

– Build table

create table business(
name string,
orderdate string,
cost int)
row format delimited
fields terminated by ",";

– Import data

load data local inpath "/business.txt" into table business;

– Query table :

select * from business;

 Insert picture description here
Demand analysis
One 、 Inquire about 2017-04 Total number of customers who bought

a、 First think of using aggregate functions count()

– Find out first 2017-04 How many records are there this month

select count(*) from business where substr(orderdate,1,7) = "2017-04";

The results are as follows :
 Insert picture description here
b、 Now group by customer

select name,count(*) from business where substr(orderdate,1,7) = "2017-04" group by name;

The results are as follows :
 Insert picture description here
The data was divided into three groups :
 Insert picture description here
Use over() function :over Works only on aggregate functions ,count Count the above three groups respectively ,over Count the total number of groups ( There is one count Add up once )

select
name,count(*) over() total_num
from business where substr(orderdate,1,7) = "2017-04" group by name;

The results are shown below :
 Insert picture description here
Two 、 Query customer purchase details and monthly total amount

a、 First select all details :

select * from business;

 Insert picture description here
b、 Total amount :( This is the sum of all the data , Because there is no grouping (group by), therefore over() For every piece of data )

select *, sum(cost) over() from business;

 Insert picture description here
c、 For April data , We need to calculate the total amount ,

Ideas : Partition or grouping , But use group by date, Can only query date,(select date ,name group by date) Other fields cannot be queried

solve : Use window functions , And partition the window function over(distribute by()) perhaps over(partition by())

select *,sum(cost) over(distribute by month(orderdate)) from business;

The result is shown in the figure :
 Insert picture description here
3、 ... and 、 The above scenario , take cost Accumulate by time

a、 First sort by purchase time

select * from business sort by orderdate;

The result is shown in the figure :

 Insert picture description here
– Parameters on

-- sort by orderdate: Sort by purchase date
-- UNBOUNDED PRECEDING: From the beginning
-- CURRENT ROW: To the current line
-- Calculate the total cost from start to current time
select
*,
sum(cost) over(sort by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW)
from business;

The results are shown in the following figure :

 Insert picture description here
row function :

current row: Current row
n PRECEDING: forward n That's ok
n FOLLOWING: Back up n That's ok
UNBOUNDED: The starting point
UNBOUNDED PRECEDING: Start from the front
UNBOUNDED FOLLOWING: To the back end
LAG(col,n): The first step forward n That's ok
LEAD(col,n): The next n That's ok

– Parameters on
– sort by orderdate: Sort by time
– 1 preceding: Before the current line 1 That's ok
– 1 following: The next line after the current line
– Calculate the values of three adjacent rows ( The first line calculates the current line + The next line ; The last line calculates the current line + The previous line )

select
*,
sum(cost) over(sort by orderdate rows between 1 preceding and 1 following)
from business;

The results are as follows :
 Insert picture description here
demo2:

– Parameters, :
– distribute by name: Partition by name
– sort by orderdate: Sort by time in each partition
– UNBOUNDED PRECEDING and current row: From the starting line to the current line
– Calculate the total cost of each person

select
*,
sum(cost) over(distribute by name
sort by orderdate
rows between UNBOUNDED PRECEDING and current row)
from business;

The results are shown in the following figure :

 Insert picture description here
demo3:

– Parameters on :
– sort by orderdate: Sort by time
– current row and unbounded following: Current line to end line

select
*,
sum(cost) over(sort by orderdate
rows between current row and unbounded following)
from business;

The results are shown in the following figure :
 Insert picture description here
Four 、 Query the customer's last purchase time , And the time of next purchase ( E-commerce websites are often used to find the time before and after page Jump )

analysis :lag(clo,n): Returns the first... Of the current line n That's ok

– Parameters, :
– distribute by name: Group by name
– sort by orderdate: Sort by time
– lag(orderdate,1): Returns the current orderdate The line before the line
– lead(orderdate,1): Returns the current orderdate The last line of the line

select *,
lag(orderdate,1) over(distribute by name sort by orderdate),
lead(orderdate,1) over(distribute by name sort by orderdate)
from business;

The results are shown in the following figure :

 Insert picture description here
5、 ... and 、 Before query 20% Purchase order information

analysis : It can be divided into five equal parts according to time , Then return to the first one

NTILE(n): Divide the data into n Share

select *, ntile(5) over(sort by orderdate) from business;

The results are shown in the following figure :
 Insert picture description here
– The following statement reports an error , because ntile、sum、agg Equal functions cannot be placed in where Later as query criteria

select
*, ntile(5) over(sort by orderdate) as sorted
from business
where sorted = 1;

– The following statement reports an error , because having Must be with the group by Statement behind

select
*, ntile(5) over(sort by orderdate) as sorted
from business
having sorted = 1;

– So sub query is used , Put the result of the previous query in the clause

select
name,orderdate,cost
from (
select *,ntile(5) over(order by orderdate) sorted from business
) t
where sorted = 1;

– Tips: Subquery cannot be used select *
 Insert picture description here
The third set of exercises
Student achievement ranking of each subject ( Whether to rank side by side 、 There are three implementations of vacancy ranking )
Ranking of each subject top n Of the students
Raw data ( Student achievement information )

name subject score
The Monkey King Chinese language and literature 87
The Monkey King mathematics 95
The Monkey King English 68
The sea Chinese language and literature 94
The sea mathematics 56
The sea English 84
Song song Chinese language and literature 64
Song song mathematics 86
Song song English 84
tingting Chinese language and literature 65
tingting mathematics 85
tingting English 78

Create tables and load data

create table score
(
name string,
subject string,
score int
) row format delimited fields terminated by "\t";

# Load data

load data local inpath '/home/fengGG/hive_test_data/score.txt' into table score;

View the data

select * from score;

 Insert picture description here
1、 Student achievement ranking of each subject ( Whether to rank side by side 、 There are three implementations of vacancy ranking )

select *,
row_number()over(partition by subject order by score desc),
rank()over(partition by subject order by score desc),
dense_rank()over(partition by subject order by score desc)
from score;

 Insert picture description here
2、 Ranking of each subject top n Of the students

select *
from(
select
*,row_number() over(partition by subject order by score desc) rmp
from score
) t
where t.rmp<=3;

 Insert picture description here

copyright:author[Big data learning monk],Please bring the original link to reprint, thank you. https://en.javamana.com/2022/02/202202130715465984.html