SQL面试题挑战05:找出恶意购买用户

发布时间:2023年12月22日

问题:

下面是某电商网站的订单数据,包括order_id,user_id,order_status和operate_time四个字段,我们需要找出所有恶意购买的用户。恶意购买的用户定义是:同一个用户,在任意半小时内(含),取消订单次数>=2次的就被视为恶意买家。比如该样例数据中c用户就是恶意买家。

order_id    user_id    order_status     operate_time
1101         a         已支付        2023-01-01 10:00:00
1102         a         已取消        2023-01-01 10:10:00
1103         a         待支付        2023-01-01 10:20:00
1104         b         已取消        2023-01-01 10:30:00
1105         a         待确认        2023-01-01 10:50:00
1106         a         已取消        2023-01-01 11:00:00
1107         b         已取消        2023-01-01 11:40:00
1108         b         已取消        2023-01-01 11:50:00
1109         b         已支付        2023-01-01 12:00:00
1110         b         已取消        2023-01-01 12:11:00
1111         c         已取消        2023-01-01 12:20:00
1112         c         已取消        2023-01-01 12:30:00
1113         c         已取消        2023-01-01 12:55:00
1114         c         已取消        2023-01-01 13:00:00

SQL解答:

典型的滑动窗口的场景。Hive中也是有滑动窗口的功能的(按数据范围开窗,range between and )。针对这个例子,窗口大小就是半小时,然后按每条数据进行滑动,在窗口内判断该条数据对应的用户是否是恶意用户。

with temp  as (
select  1101 as order_id  ,'a'  as user_id  ,"已支付" as order_status,  "2023-01-01 10:00:00"  as operate_time
union all 
select  1102 as order_id  ,'a'  as user_id  ,"已取消" as order_status,  "2023-01-01 10:10:00"  as operate_time
union all 
select  1103 as order_id  ,'a'  as user_id  ,"待支付" as order_status,  "2023-01-01 10:20:00"  as operate_time
union all 
select  1104 as order_id  ,'b'  as user_id  ,"已取消" as order_status,  "2023-01-01 10:30:00"  as operate_time
union all 
select  1105 as order_id  ,'a'  as user_id  ,"待确认" as order_status,  "2023-01-01 10:50:00"  as operate_time
union all 
select  1106 as order_id  ,'a'  as user_id  ,"已取消" as order_status,  "2023-01-01 11:00:00"  as operate_time
union all 
select  1107 as order_id  ,'b'  as user_id  ,"已取消" as order_status,  "2023-01-01 11:40:00"  as operate_time
union all 
select  1108 as order_id  ,'b'  as user_id  ,"已取消" as order_status,  "2023-01-01 11:50:00"  as operate_time
union all 
select  1109 as order_id  ,'b'  as user_id  ,"已支付" as order_status,  "2023-01-01 12:00:00"  as operate_time
union all 
select  1110 as order_id  ,'b'  as user_id  ,"已取消" as order_status,  "2023-01-01 12:11:00"  as operate_time
union all 
select  1111 as order_id  ,'c'  as user_id  ,"已取消" as order_status,  "2023-01-01 12:20:00"  as operate_time
union all 
select  1112 as order_id  ,'c'  as user_id  ,"已取消" as order_status,  "2023-01-01 12:30:00"  as operate_time
union all 
select  1113 as order_id  ,'c'  as user_id  ,"已取消" as order_status,  "2023-01-01 12:55:00"  as operate_time
union all 
select  1114 as order_id  ,'c'  as user_id  ,"已取消" as order_status,  "2023-01-01 13:00:00"  as operate_time
)
---这里设定tmp表中放的就是上面提供的样例数据
select
distinct user_id
from
(
    select
    order_id
    ,user_id
    ,order_status
    --通过range between以当前行为锚点,圈定数据范围为operate_time为近30分钟内,然后算该范围内的取消订单数
    ,count(case when order_status='已取消' then order_id end) over(partition by user_id order by operate_time range between 1800 preceding and current row) as cancel_order_cnt
    from
    (
        select
        order_id
        ,user_id
        ,order_status
        ,unix_timestamp(operate_time) as operate_time  ---由于range...between只能整数比较,这里先转换为秒
        from temp
    )t1
)t1
where cancel_order_cnt>=2
;
----结果
user_id
b
c

函数补充:
关于range between…and…函数,表示以当前行为锚点,根据order by排序的字段和between…and给定的值得到窗口的上下界,从而圈定好窗口范围。比如count(distinct stu_id) over(partition by class_id order by score range between 30 and current row)就表示按class_id分组,按照score升序,以当前行为锚点,在[score-30,score]的分数范围内计算stu_id的数量。与rows不同的是,这个每一行的开窗的范围是固定的,但行数是不固定的。

文章来源:https://blog.csdn.net/weixin_43597208/article/details/135159644
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。