返回两个表中满足关联条件的记录。
SELECT *
FROM t1
INNER JOIN t2
ON t1.col1 = t2.col2;
返回左表中的所有记录,以及右表中满足关联条件的记录。
SELECT *
FROM t1
LEFT JOIN t2
ON t1.col1 = t2.col2;
返回右表中的所有记录,以及左表中满足关联条件的记录。
SELECT *
FROM t1
RIGHT JOIN t2
ON t1.col1 = t2.col2;
返回左表和右表中的所有记录。
SELECT *
FROM t1
FULL OUTER JOIN t2
ON t1.col1 = t2.col2;
--添加字段
alter table app.table_name add columns(bu_name STRING COMMENT "事业部名称") CASCADE;
--修改字段类型(修改为double)
Alter table tmp.tmp_zp_tablename column columnname columnname double;
--调整列位置
alter table app.table_name change bu_name bu_name STRING after col_a;
注意不要直接对有数据的表进行字段顺序调整,会导致历史分区数据错误。
--删除分区
alter table tmp.tmp_zp_tablename drop if exists partition(dt='2020-10-24');
sort_array(Array) 只有一个参数
根据自然顺序按升序对输入数组进行排序
SELECT sort_array(array(5, 2, 8, 1, 7)) AS sorted_array;
使用中常和collect函数使用 sort_array(collect_set())
concat_ws(separator, string1, string2, …)
用于将多个字符串连接在一起,中间使用指定的分隔符进行分隔。
SELECT concat_ws(',', 'Hello', 'World') AS result;
常和数组集合函数使用,collect_set collect_list 将数据内容转为字符串
concat_ws(‘,’,collect_set(col) )
collect_set函数可以将指定字段的所有不重复的值,以Set的形式返回。Set是一种无序且不包含重复元素的数据结构。
collect_list函数可以将指定字段的所有值,以List的形式返回。List是一种有序且允许重复元素的数据结构。
SELECT collect_set(name) FROM student;
注意
collect_set和collect_list函数只能应用于对一个字段进行聚合操作,不能对多个字段同时聚合。
collect_set和collect_list函数的性能较差,当数据量较大时,可能会影响查询性能。
collect_set和collect_list函数都是在Reducer阶段进行聚合操作,因此在分布式环境下,需要确保数据被正确分组。
length(string A) Returns the length of the string.
size(Map<K.V>) Returns the number of elements in the map type.
size(Array) Returns the number of elements in the array type.
TRUNC(number,num_digits)Number需要截尾取整的数字。Num_digits用于指定取整精度的数字,默认值为0。TRUNC()函数截取时不进行四舍五入。
select trunc(123.458) from dual --123
select trunc(123.458,0) from dual --123
select trunc(123.458,1) from dual --123.4
select trunc(123.458,-1) from dual --120
select trunc(123.458,-4) from dual --0
select trunc(123.458,4) from dual --123.458
select trunc(123) from dual --123
select trunc(123,1) from dual --123
select trunc(123,-1) from dual --120
查询每个顾客上次的购买时间
select *,lag(orderdate) over(partition by name order by orderdate) from business;
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | lag_window_0 |
+----------------+---------------------+----------------+---------------+--+
| jack | 2017-01-01 | 10 | NULL |
| jack | 2017-01-05 | 46 | 2017-01-01 |
| jack | 2017-01-08 | 55 | 2017-01-05 |
| jack | 2017-02-03 | 23 | 2017-01-08 |
| jack | 2017-04-06 | 42 | 2017-02-03 |
| mart | 2017-04-08 | 62 | NULL |
| mart | 2017-04-09 | 68 | 2017-04-08 |
| mart | 2017-04-11 | 75 | 2017-04-09 |
| mart | 2017-04-13 | 94 | 2017-04-11 |
| neil | 2017-05-10 | 12 | NULL |
| neil | 2017-06-12 | 80 | 2017-05-10 |
| tony | 2017-01-02 | 15 | NULL |
| tony | 2017-01-04 | 29 | 2017-01-02 |
| tony | 2017-01-07 | 50 | 2017-01-04 |
+----------------+---------------------+----------------+---------------+--+
select *,lag(orderdate,1,"1970-01-01") over(partition by name order by orderdate) from business;
--lag
--lag(col,n,DEFAULT) 第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
--与LAG相反
--LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值。第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | lag_window_0 |
+----------------+---------------------+----------------+---------------+--+
| jack | 2017-01-01 | 10 | 1970-01-01 |
| jack | 2017-01-05 | 46 | 2017-01-01 |
| jack | 2017-01-08 | 55 | 2017-01-05 |
| jack | 2017-02-03 | 23 | 2017-01-08 |
| jack | 2017-04-06 | 42 | 2017-02-03 |
| mart | 2017-04-08 | 62 | 1970-01-01 |
| mart | 2017-04-09 | 68 | 2017-04-08 |
| mart | 2017-04-11 | 75 | 2017-04-09 |
| mart | 2017-04-13 | 94 | 2017-04-11 |
| neil | 2017-05-10 | 12 | 1970-01-01 |
| neil | 2017-06-12 | 80 | 2017-05-10 |
| tony | 2017-01-02 | 15 | 1970-01-01 |
| tony | 2017-01-04 | 29 | 2017-01-02 |
| tony | 2017-01-07 | 50 | 2017-01-04 |
+----------------+---------------------+----------------+---------------+--+
用于将分组数据按照顺序切分成n片(不是严格等分),返回当前记录所在的切片值。
--查询前20%时间的订单信息
select *,ntile(5) tgroup over(order by orderdate) from business;
+----------------+---------------------+----------------+-----------------+--+
| business.name | business.orderdate | business.cost | ntile_window_0 |
+----------------+---------------------+----------------+-----------------+--+
| jack | 2017-01-01 | 10 | 1 |
| tony | 2017-01-02 | 15 | 1 |
| tony | 2017-01-04 | 29 | 1 |
| jack | 2017-01-05 | 46 | 2 |
| tony | 2017-01-07 | 50 | 2 |
| jack | 2017-01-08 | 55 | 2 |
| jack | 2017-02-03 | 23 | 3 |
| jack | 2017-04-06 | 42 | 3 |
| mart | 2017-04-08 | 62 | 3 |
| mart | 2017-04-09 | 68 | 4 |
| mart | 2017-04-11 | 75 | 4 |
| mart | 2017-04-13 | 94 | 4 |
| neil | 2017-05-10 | 12 | 5 |
| neil | 2017-06-12 | 80 | 5 |
+----------------+---------------------+----------------+-----------------+--+
select * from (select *,ntile(5) tgroup over(order by orderdate) from business) t1 where t1.tgroup=1;
分组内当前行的RANK值-1/分组内总行数-1
select *,percent_rank() over(order by orderdate) pr from business;
+----------------+---------------------+----------------+----------------------+--+
| business.name | business.orderdate | business.cost | pr |
+----------------+---------------------+----------------+----------------------+--+
| jack | 2017-01-01 | 10 | 0.0 |
| tony | 2017-01-02 | 15 | 0.07692307692307693 |
| tony | 2017-01-04 | 29 | 0.15384615384615385 |
| jack | 2017-01-05 | 46 | 0.23076923076923078 |
| tony | 2017-01-07 | 50 | 0.3076923076923077 |
| jack | 2017-01-08 | 55 | 0.38461538461538464 |
| jack | 2017-02-03 | 23 | 0.46153846153846156 |
| jack | 2017-04-06 | 42 | 0.5384615384615384 |
| mart | 2017-04-08 | 62 | 0.6153846153846154 |
| mart | 2017-04-09 | 68 | 0.6923076923076923 |
| mart | 2017-04-11 | 75 | 0.7692307692307693 |
| mart | 2017-04-13 | 94 | 0.8461538461538461 |
| neil | 2017-05-10 | 12 | 0.9230769230769231 |
| neil | 2017-06-12 | 80 | 1.0 |
+----------------+---------------------+----------------+----------------------+--+
示例表:
+----------------+---------------------+----------------+--+
| business.name | business.orderdate | business.cost |
+----------------+---------------------+----------------+--+
| jack | 2017-01-01 | 10 |
| tony | 2017-01-02 | 15 |
| jack | 2017-02-03 | 23 |
| tony | 2017-01-04 | 29 |
| jack | 2017-01-05 | 46 |
| jack | 2017-04-06 | 42 |
| tony | 2017-01-07 | 50 |
| jack | 2017-01-08 | 55 |
| mart | 2017-04-08 | 62 |
| mart | 2017-04-09 | 68 |
| neil | 2017-05-10 | 12 |
| mart | 2017-04-11 | 75 |
| neil | 2017-06-12 | 80 |
| mart | 2017-04-13 | 94 |
+----------------+---------------------+----------------+--+```
```sql
select name,orderdate,cost,
sum(cost) over() as sample1,--所有行相加
sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加
sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行
from business;
其中sample3和sample4是一样的,都是按name分组,组内数据累加。上面总共开了7个窗口函数,select执行完了之后(select不需要执行MapReduce程序),每多一个窗口,就多一个MapReduce执行函数,但是这个前提是窗口开的不一样,只有窗口开的不一样才有额外的MapReduce,sample3~sample7的窗口都是一样的,只不过他们各自加的行的范围不一样而已,所以窗口都是一个窗口。
排序函数有rank()、dense_rank()、row_number(),下面对比差异。
给定下表:
+-------------+----------------+--------------+--+
| score.name | score.subject | score.score |
+-------------+----------------+--------------+--+
| 孙悟空 | 语文 | 87 |
| 孙悟空 | 数学 | 95 |
| 孙悟空 | 英语 | 68 |
| 大海 | 语文 | 94 |
| 大海 | 数学 | 56 |
| 大海 | 英语 | 84 |
| 宋宋 | 语文 | 64 |
| 宋宋 | 数学 | 86 |
| 宋宋 | 英语 | 84 |
| 婷婷 | 语文 | 65 |
| 婷婷 | 数学 | 85 |
| 婷婷 | 英语 | 78 |
+-------------+----------------+--------------+--+
select *,rank() over(partition by subject order by score desc) r,
dense_rank() over(partition by subject order by score desc) dr,
row_number() over(partition by subject order by score desc) rr
from score;
+-------------+----------------+--------------+----+-----+-----+--+
| score.name | score.subject | score.score | r | dr | rr |
+-------------+----------------+--------------+----+-----+-----+--+
| 孙悟空 | 数学 | 95 | 1 | 1 | 1 |
| 宋宋 | 数学 | 86 | 2 | 2 | 2 |
| 婷婷 | 数学 | 85 | 3 | 3 | 3 |
| 大海 | 数学 | 56 | 4 | 4 | 4 |
| 宋宋 | 英语 | 84 | 1 | 1 | 1 |
| 大海 | 英语 | 84 | 1 | 1 | 2 |
| 婷婷 | 英语 | 78 | 3 | 2 | 3 |
| 孙悟空 | 英语 | 68 | 4 | 3 | 4 |
| 大海 | 语文 | 94 | 1 | 1 | 1 |
| 孙悟空 | 语文 | 87 | 2 | 2 | 2 |
| 婷婷 | 语文 | 65 | 3 | 3 | 3 |
| 宋宋 | 语文 | 64 | 4 | 4 | 4 |
+-------------+----------------+--------------+----+-----+-----+--+
注:排序还可以用累加至当前行实现,效果和row_number()相同
count(1) over(partition by subject order by score desc rows between unbounded preceding and current row) as rank
MONTHS_BETWEEN (date1, date2)用于计算date1和date2之间有几个月。如果date1在日历中比date2晚,那么MONTHS_BETWEEN()就返回一个正数。如果date1在日历中比date2早,那么MONTHS_BETWEEN()就返回一个负数。如果date1和date2日期一样,那MONTHS_BETWEEN()就返回一个0。
hive> select months_between('2020-10-21','2020-08-20');
OK
2.03225806
Time taken: 0.995 seconds, Fetched: 1 row(s)
hive> select months_between('2020-08-20','2020-10-21');
OK
-2.03225806
Time taken: 0.076 seconds, Fetched: 1 row(s)
hive> select months_between('2020-08-20','2020-08-20');
OK
0.0
Time taken: 0.056 seconds, Fetched: 1 row(s)
# 行专列/列转行
https://zhuanlan.zhihu.com/p/115913870
https://blog.csdn.net/jiantianming2/article/details/79189672
## Hive Map Reduce个数如何设置? 来自面试官的10大连环拷问
https://zhuanlan.zhihu.com/p/270002498