下面是某电商网站的订单数据,包括order_id,user_id,order_status和operate_time四个字段,我们需要找出所有恶意购买的用户。恶意购买的用户定义是:同一个用户,在任意半小时内(含),取消订单次数>=2次的就被视为恶意买家。比如该样例数据中c用户就是恶意买家。
order_id user_id order_status operate_time
1101 a 已支付 2023-01-01 10:00:00
1102 a 已取消 2023-01-01 10:10:00
1103 a 待支付 2023-01-01 10:20:00
1104 b 已取消 2023-01-01 10:30:00
1105 a 待确认 2023-01-01 10:50:00
1106 a 已取消 2023-01-01 11:00:00
1107 b 已取消 2023-01-01 11:40:00
1108 b 已取消 2023-01-01 11:50:00
1109 b 已支付 2023-01-01 12:00:00
1110 b 已取消 2023-01-01 12:11:00
1111 c 已取消 2023-01-01 12:20:00
1112 c 已取消 2023-01-01 12:30:00
1113 c 已取消 2023-01-01 12:55:00
1114 c 已取消 2023-01-01 13:00:00
典型的滑动窗口的场景。Hive中也是有滑动窗口的功能的(按数据范围开窗,range between and )。针对这个例子,窗口大小就是半小时,然后按每条数据进行滑动,在窗口内判断该条数据对应的用户是否是恶意用户。
with temp as (
select 1101 as order_id ,'a' as user_id ,"已支付" as order_status, "2023-01-01 10:00:00" as operate_time
union all
select 1102 as order_id ,'a' as user_id ,"已取消" as order_status, "2023-01-01 10:10:00" as operate_time
union all
select 1103 as order_id ,'a' as user_id ,"待支付" as order_status, "2023-01-01 10:20:00" as operate_time
union all
select 1104 as order_id ,'b' as user_id ,"已取消" as order_status, "2023-01-01 10:30:00" as operate_time
union all
select 1105 as order_id ,'a' as user_id ,"待确认" as order_status, "2023-01-01 10:50:00" as operate_time
union all
select 1106 as order_id ,'a' as user_id ,"已取消" as order_status, "2023-01-01 11:00:00" as operate_time
union all
select 1107 as order_id ,'b' as user_id ,"已取消" as order_status, "2023-01-01 11:40:00" as operate_time
union all
select 1108 as order_id ,'b' as user_id ,"已取消" as order_status, "2023-01-01 11:50:00" as operate_time
union all
select 1109 as order_id ,'b' as user_id ,"已支付" as order_status, "2023-01-01 12:00:00" as operate_time
union all
select 1110 as order_id ,'b' as user_id ,"已取消" as order_status, "2023-01-01 12:11:00" as operate_time
union all
select 1111 as order_id ,'c' as user_id ,"已取消" as order_status, "2023-01-01 12:20:00" as operate_time
union all
select 1112 as order_id ,'c' as user_id ,"已取消" as order_status, "2023-01-01 12:30:00" as operate_time
union all
select 1113 as order_id ,'c' as user_id ,"已取消" as order_status, "2023-01-01 12:55:00" as operate_time
union all
select 1114 as order_id ,'c' as user_id ,"已取消" as order_status, "2023-01-01 13:00:00" as operate_time
)
---这里设定tmp表中放的就是上面提供的样例数据
select
distinct user_id
from
(
select
order_id
,user_id
,order_status
--通过range between以当前行为锚点,圈定数据范围为operate_time为近30分钟内,然后算该范围内的取消订单数
,count(case when order_status='已取消' then order_id end) over(partition by user_id order by operate_time range between 1800 preceding and current row) as cancel_order_cnt
from
(
select
order_id
,user_id
,order_status
,unix_timestamp(operate_time) as operate_time ---由于range...between只能整数比较,这里先转换为秒
from temp
)t1
)t1
where cancel_order_cnt>=2
;
----结果
user_id
b
c
函数补充:
关于range between…and…函数,表示以当前行为锚点,根据order by排序的字段和between…and给定的值得到窗口的上下界,从而圈定好窗口范围。比如count(distinct stu_id) over(partition by class_id order by score range between 30 and current row)就表示按class_id分组,按照score升序,以当前行为锚点,在[score-30,score]的分数范围内计算stu_id的数量。与rows不同的是,这个每一行的开窗的范围是固定的,但行数是不固定的。