HIVE笔记

发布时间:2023年12月28日

表关联

内连接(INNER JOIN)

返回两个表中满足关联条件的记录。

SELECT * 
FROM t1 
INNER JOIN t2 
ON t1.col1 = t2.col2;

左连接(LEFT JOIN)

返回左表中的所有记录,以及右表中满足关联条件的记录。

SELECT * 
FROM t1 
LEFT JOIN t2 
ON t1.col1 = t2.col2;

右连接(RIGHT JOIN)

返回右表中的所有记录,以及左表中满足关联条件的记录。

SELECT * 
FROM t1 
RIGHT JOIN t2 
ON t1.col1 = t2.col2;

全连接(FULL OUTER JOIN)

返回左表和右表中的所有记录。

hive full join多表多关联键联合查询

SELECT * 
FROM t1 
FULL OUTER JOIN t2 
ON t1.col1 = t2.col2;

DDL

字段操作

--添加字段
alter table app.table_name add columns(bu_name STRING COMMENT "事业部名称") CASCADE;
--修改字段类型(修改为double)
Alter table tmp.tmp_zp_tablename column columnname  columnname  double;
--调整列位置
alter table app.table_name change bu_name bu_name STRING after col_a;

注意不要直接对有数据的表进行字段顺序调整,会导致历史分区数据错误。

分区操作

--删除分区
alter table tmp.tmp_zp_tablename drop if exists partition(dt='2020-10-24');

常用函数

官网文档

sort_array

sort_array(Array) 只有一个参数
根据自然顺序按升序对输入数组进行排序

SELECT sort_array(array(5, 2, 8, 1, 7)) AS sorted_array;

使用中常和collect函数使用 sort_array(collect_set())

concat_ws

concat_ws(separator, string1, string2, …)
用于将多个字符串连接在一起,中间使用指定的分隔符进行分隔。

SELECT concat_ws(',', 'Hello', 'World') AS result;

常和数组集合函数使用,collect_set collect_list 将数据内容转为字符串
concat_ws(‘,’,collect_set(col) )

collect_set collect_list

collect_set函数可以将指定字段的所有不重复的值,以Set的形式返回。Set是一种无序且不包含重复元素的数据结构。
collect_list函数可以将指定字段的所有值,以List的形式返回。List是一种有序且允许重复元素的数据结构。

SELECT collect_set(name) FROM student;

注意
collect_set和collect_list函数只能应用于对一个字段进行聚合操作,不能对多个字段同时聚合。
collect_set和collect_list函数的性能较差,当数据量较大时,可能会影响查询性能。
collect_set和collect_list函数都是在Reducer阶段进行聚合操作,因此在分布式环境下,需要确保数据被正确分组。

length size

length(string A) Returns the length of the string.
size(Map<K.V>) Returns the number of elements in the map type.
size(Array) Returns the number of elements in the array type.

TRUNC

TRUNC(number,num_digits)Number需要截尾取整的数字。Num_digits用于指定取整精度的数字,默认值为0。TRUNC()函数截取时不进行四舍五入。

select trunc(123.458) from dual --123
select trunc(123.458,0) from dual --123
select trunc(123.458,1) from dual --123.4
select trunc(123.458,-1) from dual --120
select trunc(123.458,-4) from dual --0
select trunc(123.458,4) from dual --123.458
select trunc(123) from dual --123
select trunc(123,1) from dual --123
select trunc(123,-1) from dual --120

lag/lead

查询每个顾客上次的购买时间

select *,lag(orderdate) over(partition by name order by orderdate) from business;
+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | lag_window_0  |
+----------------+---------------------+----------------+---------------+--+
| jack           | 2017-01-01          | 10             | NULL          |
| jack           | 2017-01-05          | 46             | 2017-01-01    |
| jack           | 2017-01-08          | 55             | 2017-01-05    |
| jack           | 2017-02-03          | 23             | 2017-01-08    |
| jack           | 2017-04-06          | 42             | 2017-02-03    |
| mart           | 2017-04-08          | 62             | NULL          |
| mart           | 2017-04-09          | 68             | 2017-04-08    |
| mart           | 2017-04-11          | 75             | 2017-04-09    |
| mart           | 2017-04-13          | 94             | 2017-04-11    |
| neil           | 2017-05-10          | 12             | NULL          |
| neil           | 2017-06-12          | 80             | 2017-05-10    |
| tony           | 2017-01-02          | 15             | NULL          |
| tony           | 2017-01-04          | 29             | 2017-01-02    |
| tony           | 2017-01-07          | 50             | 2017-01-04    |
+----------------+---------------------+----------------+---------------+--+
select *,lag(orderdate,1,"1970-01-01") over(partition by name order by orderdate) from business;
--lag
--lag(col,n,DEFAULT) 第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
--与LAG相反
--LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值。第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | lag_window_0  |
+----------------+---------------------+----------------+---------------+--+
| jack           | 2017-01-01          | 10             | 1970-01-01    |
| jack           | 2017-01-05          | 46             | 2017-01-01    |
| jack           | 2017-01-08          | 55             | 2017-01-05    |
| jack           | 2017-02-03          | 23             | 2017-01-08    |
| jack           | 2017-04-06          | 42             | 2017-02-03    |
| mart           | 2017-04-08          | 62             | 1970-01-01    |
| mart           | 2017-04-09          | 68             | 2017-04-08    |
| mart           | 2017-04-11          | 75             | 2017-04-09    |
| mart           | 2017-04-13          | 94             | 2017-04-11    |
| neil           | 2017-05-10          | 12             | 1970-01-01    |
| neil           | 2017-06-12          | 80             | 2017-05-10    |
| tony           | 2017-01-02          | 15             | 1970-01-01    |
| tony           | 2017-01-04          | 29             | 2017-01-02    |
| tony           | 2017-01-07          | 50             | 2017-01-04    |
+----------------+---------------------+----------------+---------------+--+

ntile

用于将分组数据按照顺序切分成n片(不是严格等分),返回当前记录所在的切片值。
--查询前20%时间的订单信息
select *,ntile(5) tgroup over(order by orderdate) from business;
+----------------+---------------------+----------------+-----------------+--+
| business.name  | business.orderdate  | business.cost  | ntile_window_0  |
+----------------+---------------------+----------------+-----------------+--+
| jack           | 2017-01-01          | 10             | 1               |
| tony           | 2017-01-02          | 15             | 1               |
| tony           | 2017-01-04          | 29             | 1               |
| jack           | 2017-01-05          | 46             | 2               |
| tony           | 2017-01-07          | 50             | 2               |
| jack           | 2017-01-08          | 55             | 2               |
| jack           | 2017-02-03          | 23             | 3               |
| jack           | 2017-04-06          | 42             | 3               |
| mart           | 2017-04-08          | 62             | 3               |
| mart           | 2017-04-09          | 68             | 4               |
| mart           | 2017-04-11          | 75             | 4               |
| mart           | 2017-04-13          | 94             | 4               |
| neil           | 2017-05-10          | 12             | 5               |
| neil           | 2017-06-12          | 80             | 5               |
+----------------+---------------------+----------------+-----------------+--+

select * from (select *,ntile(5) tgroup over(order by orderdate) from business) t1 where t1.tgroup=1;

persent_rank

分组内当前行的RANK值-1/分组内总行数-1
select *,percent_rank() over(order by orderdate) pr from business;
+----------------+---------------------+----------------+----------------------+--+
| business.name  | business.orderdate  | business.cost  |          pr          |
+----------------+---------------------+----------------+----------------------+--+
| jack           | 2017-01-01          | 10             | 0.0                  |
| tony           | 2017-01-02          | 15             | 0.07692307692307693  |
| tony           | 2017-01-04          | 29             | 0.15384615384615385  |
| jack           | 2017-01-05          | 46             | 0.23076923076923078  |
| tony           | 2017-01-07          | 50             | 0.3076923076923077   |
| jack           | 2017-01-08          | 55             | 0.38461538461538464  |
| jack           | 2017-02-03          | 23             | 0.46153846153846156  |
| jack           | 2017-04-06          | 42             | 0.5384615384615384   |
| mart           | 2017-04-08          | 62             | 0.6153846153846154   |
| mart           | 2017-04-09          | 68             | 0.6923076923076923   |
| mart           | 2017-04-11          | 75             | 0.7692307692307693   |
| mart           | 2017-04-13          | 94             | 0.8461538461538461   |
| neil           | 2017-05-10          | 12             | 0.9230769230769231   |
| neil           | 2017-06-12          | 80             | 1.0                  |
+----------------+---------------------+----------------+----------------------+--+

开窗函数

示例表:

+----------------+---------------------+----------------+--+
| business.name  | business.orderdate  | business.cost  |
+----------------+---------------------+----------------+--+
| jack           | 2017-01-01          | 10             |
| tony           | 2017-01-02          | 15             |
| jack           | 2017-02-03          | 23             |
| tony           | 2017-01-04          | 29             |
| jack           | 2017-01-05          | 46             |
| jack           | 2017-04-06          | 42             |
| tony           | 2017-01-07          | 50             |
| jack           | 2017-01-08          | 55             |
| mart           | 2017-04-08          | 62             |
| mart           | 2017-04-09          | 68             |
| neil           | 2017-05-10          | 12             |
| mart           | 2017-04-11          | 75             |
| neil           | 2017-06-12          | 80             |
| mart           | 2017-04-13          | 94             |
+----------------+---------------------+----------------+--+```

```sql
select name,orderdate,cost, 
sum(cost) over() as sample1,--所有行相加 
sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加 
sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加 
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行 
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行 
from business;
其中sample3和sample4是一样的,都是按name分组,组内数据累加。上面总共开了7个窗口函数,select执行完了之后(select不需要执行MapReduce程序),每多一个窗口,就多一个MapReduce执行函数,但是这个前提是窗口开的不一样,只有窗口开的不一样才有额外的MapReduce,sample3~sample7的窗口都是一样的,只不过他们各自加的行的范围不一样而已,所以窗口都是一个窗口。

排序函数

排序函数有rank()、dense_rank()、row_number(),下面对比差异。

给定下表:

+-------------+----------------+--------------+--+
| score.name  | score.subject  | score.score  |
+-------------+----------------+--------------+--+
| 孙悟空         | 语文             | 87           |
| 孙悟空         | 数学             | 95           |
| 孙悟空         | 英语             | 68           |
| 大海          | 语文             | 94           |
| 大海          | 数学             | 56           |
| 大海          | 英语             | 84           |
| 宋宋          | 语文             | 64           |
| 宋宋          | 数学             | 86           |
| 宋宋          | 英语             | 84           |
| 婷婷          | 语文             | 65           |
| 婷婷          | 数学             | 85           |
| 婷婷          | 英语             | 78           |
+-------------+----------------+--------------+--+
select *,rank() over(partition by subject order by score desc) r, 
dense_rank() over(partition by subject order by score desc) dr,
row_number() over(partition by subject order by score desc) rr
from score;
+-------------+----------------+--------------+----+-----+-----+--+
| score.name  | score.subject  | score.score  | r  | dr  | rr  |
+-------------+----------------+--------------+----+-----+-----+--+
| 孙悟空         | 数学             | 95           | 1  | 1   | 1   |
| 宋宋          | 数学             | 86           | 2  | 2   | 2   |
| 婷婷          | 数学             | 85           | 3  | 3   | 3   |
| 大海          | 数学             | 56           | 4  | 4   | 4   |
| 宋宋          | 英语             | 84           | 1  | 1   | 1   |
| 大海          | 英语             | 84           | 1  | 1   | 2   |
| 婷婷          | 英语             | 78           | 3  | 2   | 3   |
| 孙悟空         | 英语             | 68           | 4  | 3   | 4   |
| 大海          | 语文             | 94           | 1  | 1   | 1   |
| 孙悟空         | 语文             | 87           | 2  | 2   | 2   |
| 婷婷          | 语文             | 65           | 3  | 3   | 3   |
| 宋宋          | 语文             | 64           | 4  | 4   | 4   |
+-------------+----------------+--------------+----+-----+-----+--+
注:排序还可以用累加至当前行实现,效果和row_number()相同
count(1) over(partition by subject order by score desc rows between unbounded preceding and current row) as rank

时间函数

months_between

	MONTHS_BETWEEN (date1, date2)用于计算date1和date2之间有几个月。如果date1在日历中比date2晚,那么MONTHS_BETWEEN()就返回一个正数。如果date1在日历中比date2早,那么MONTHS_BETWEEN()就返回一个负数。如果date1和date2日期一样,那MONTHS_BETWEEN()就返回一个0。
hive> select months_between('2020-10-21','2020-08-20');
OK
2.03225806
Time taken: 0.995 seconds, Fetched: 1 row(s)
hive> select months_between('2020-08-20','2020-10-21');
OK
-2.03225806
Time taken: 0.076 seconds, Fetched: 1 row(s)
hive> select months_between('2020-08-20','2020-08-20');
OK
0.0
Time taken: 0.056 seconds, Fetched: 1 row(s)




# 行专列/列转行
https://zhuanlan.zhihu.com/p/115913870
https://blog.csdn.net/jiantianming2/article/details/79189672


## Hive Map Reduce个数如何设置? 来自面试官的10大连环拷问
https://zhuanlan.zhihu.com/p/270002498
文章来源:https://blog.csdn.net/weixin_41229271/article/details/135249634
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。