1、视频表
字段 | 备注 | 详细描述 |
---|---|---|
videoId | 视频唯一id(String) | 11位字符串 |
uploader | 视频上传者(String) | 上传视频的用户名String |
age | 视频年龄(int) | 视频在平台上的整天数 |
category | 视频类别(Array) | 上传视频指定的视频分类 |
length | 视频长度(Int) | 整形数字标识的视频长度 |
views | 观看次数(Int) | 视频被浏览的次数 |
rate | 视频评分(Double) | 满分5分 |
Ratings | 流量(Int) | 视频的流量,整形数字 |
comments | 评论数(Int) | 一个视频的整数评论数 |
relatedId | 相关视频id(Array) | 相关视频的id,最多20个 |
2、用户表
字段 | 备注 | 字段类型 |
---|---|---|
uploader | 上传者用户名 | string |
videos | 上传视频数 | int |
friends | 朋友数量 | int |
1、需要准备的表
1)创建原始数据表:gulivideo_ori,gulivideo_user_ori,
2)创建最终表:gulivideo_orc,gulivideo_user_orc
2、创建原始数据表
1)创建原始数据表gulivideo_ori
create external table gulivideo_ori(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>
)
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as textfile
location '/gulivideo/video';
2)创建原始数据表:gulivideo_user_ori
create external table gulivideo_user_ori(
uploader string,
videos int,
friends int
)
row format delimited
fields terminated by "\t"
stored as textfile
location '/gulivideo/user';
3)创建orc存储格式带snappy压缩的表gulivideo_orc
create table gulivideo_orc(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>
)
stored as orc
tblproperties("orc.compress"="SNAPPY");
4)创建orc存储格式带snappy压缩的表gulivideo_user_orc
create table gulivideo_user_orc(
uploader string,
videos int,
friends int
)
row format delimited
fields terminated by "\t"
stored as orc
tblproperties("orc.compress"="SNAPPY");
5)向ori表插入数据
load data local inpath "/opt/module/hive/datas/video" into table gulivideo_ori;
load data local inpath "/opt/module/hive/datas/user.txt" into table gulivideo_user_ori;
6)向orc表插入数据
insert into table gulivideo_orc select * from gulivideo_ori;
insert into table gulivideo_user_orc select * from gulivideo_user_ori;
1、思路:
使用order by按照views字段做一个全局排序即可,同时我们设置只显示前10条。
2、代码
select
videoId,
`views`
from gulivideo_orc
order by `views` desc
limit 10;
OK
videoid views
dMH0bHeiRNg 42513417
0XxI-hvPRRA 20282464
1dmVU08zVpA 16087899
RB-wUgnyGv0 15712924
QjA5faZF1A8 15256922
-_CSo1gOd48 13199833
49IDp76kjPw 11970018
tYnn51C3X_w 11823701
pv5zWaTEVkI 11672017
D2kJZOfq7zk 11184051
1、思路:
1)统计每个类别有多少个视频,显示出包含视频最多的前10个类别。
2)我们需要按照类别group by聚合,然后count组内的videoId个数即可。
3)因为当前表结构为:一个视频对应一个或多个类别。所以如果要group by类别,需要先将类别进行行列转化(展开),然后再进行count即可。
4)最后按照热度排序,显示前10条。
2、代码
select
tmp01.category_col,
count(tmp01.videoId) num
from (
select
videoId,
category_col
from gulivideo_orc
lateral view
explode(category) t as category_col
) tmp01
group by tmp01.category_col
order by num desc
limit 10;
// 结果显示
OK
tmp01.category_col num
Music 179049
Entertainment 127674
Comedy 87818
Animation 73293
Film 73293
Sports 67329
Gadgets 59817
Games 59817
Blogs 48890
People 48890
1、思路
1)先找到观看书最高的20个视频所属条目的所有信息(主要是类目),降序排列
2)先把20条信息中的category分裂出来(列转行),形成新的字段category_name
3)在第二步的结果下,按照炸开的视频类别category_name分组,然后统计组内的个数category_count
2、最终代码
select
table02.categroy_name,
count(table02.videoId) num
from (
select
videoId,
categroy_name
from (
select
videoId,
`views`,
category
from gulivideo_orc
order by `views` desc
limit 20
) table01
lateral view
explode(category) tmp as categroy_name
) table02
group by table02.categroy_nam;
// 结果显示
OK
table02.categroy_name num
Blogs 2
UNA 1
Comedy 6
Entertainment 6
Music 5
People 2
1、思路
1)先找到观看数前50的视频信息(主要是求出关联视频)
2)炸开第一步求出的关联视频array,形成一个新字段new_relatedid
3)用new_relatedid和gulivideo_orc表进行join,求出new_relatedid的类别
4)炸开第三步结果的category,形成新字段category_name
5)按照catedory_name分组,然后求出每个分组的个数category_count
6)对category_count进行排序,利用开窗函数
2、代码
select
t5.category_name,
t5.num,
rank() over(order by t5.num desc ) rk
from (
select
t4.category_name,
count(t4.realte_id) num
from (
select
t3.realte_id,
category_name
from (
select
t2.realte_id,
g.category
from (
select
realte_id
from (
select
videoId,
relatedId,
`views`
from gulivideo_orc
order by `views` desc
limit 50
) t1
lateral view
explode(t1.relatedId) tmp as realte_id
) t2 join gulivideo_orc g on t2.realte_id = g.videoId
) t3
lateral view
explode(t3.category) tmp as category_name
) t4
group by t4.category_name
) t5 ;
// 结果显示OK
t5.category_name t5.num rk
Comedy 237 1
Entertainment 216 2
Music 195 3
People 51 4
Blogs 51 4
Animation 47 6
Film 47 6
News 24 8
Politics 24 8
Games 22 10
Gadgets 22 10
Sports 19 12
Howto 14 13
DIY 14 13
UNA 13 15
Travel 12 16
Places 12 16
Animals 11 18
Pets 11 18
Autos 4 20
Vehicles 4 20
1、思路
1)要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开成新的字段categary_name。
2)然后通过category_name过滤“Music”分类的所有视频信息,按照视频观看数倒序排序,取前10
3)统计对应类别(Music)中的视频热度
2、代码
select
videoId,
`views` hot
from (
select
videoId,
category_name,
`views`
from gulivideo_orc
lateral view
explode(category) tmp as category_name
) t1
where category_name = "Music"
order by hot desc
limit 10;
// 结果显示
OK
videoid hot
QjA5faZF1A8 15256922
tYnn51C3X_w 11823701
pv5zWaTEVkI 11672017
8bbTtPL1jRs 9579911
UMf40daefsI 7533070
-xEzGIuY7kw 6946033
d6C0bNDqf3Y 6935578
HSoVKUVOnfQ 6193057
3URfWTEPmtE 5581171
thtmaZnxk_0 5142238
1、思路
1)把每个原始表的类别炸开,形成新的字段category_name
2)按照炸裂开的类别字段category_name分区,按照视频观看数views倒叙排序进行开窗,求出每个类别下的所有视频的观看次数排序rk
3)按照rk字段对全表进行where过滤,求出每个类别观看书Top10
2、代码
select
t2.category_name,
t2.views,
t2.rk
from (
select
t1.category_name,
t1.views,
rank() over(partition by t1.category_name order by t1.views desc ) rk
from (
select
category_name,
`views`
from gulivideo_orc
lateral view
explode(category) tmp as category_name
) t1
) t2
where rk <= 10;
// 结果显示
OK
t2.category_name t2.views t2.rk
Comedy 42513417 1
Comedy 20282464 2
Comedy 11970018 3
Comedy 10107491 4
Comedy 9566609 5
Comedy 7066676 6
Comedy 6322117 7
Comedy 5826923 8
Comedy 5587299 9
Comedy 5508079 10
News 4706030 1
News 2899397 2
News 2817078 3
News 2803520 4
News 2348709 5
News 2335060 6
News 2326680 7
News 2318782 8
News 2310583 9
News 2291369 10
……
Time taken: 11.376 seconds, Fetched: 210 row(s)
有三种理解
理解一:取Top10中所有人上传的视频的观看次数前20
1、思路
1)去用户表gulivideo_user_orc求出上传视频最多的十个用户
2)关联gulivideo_orc表,求出这10个用户上传的所有的视频,按照观看数取前20
2、代码
SELECT
t1.uploader,
t2.videoid,
t2.views
FROM
(
select
uploader,
videos
from gulivideo_user_orc
order by videos DESC
limit 10
) t1
JOIN
gulivideo_orc t2
on t1.uploader = t2.uploader
ORDER BY t2.views DESC
LIMIT 20;
// 结果显示
OK
t1.uploader t2.videoid t2.views
expertvillage -IxHBW0YpZw 39059
expertvillage BU-fT5XI_8I 29975
expertvillage ADOcaBYbMl0 26270
expertvillage yAqsULIDJFE 25511
expertvillage vcm-t0TJXNg 25366
expertvillage 0KYGFawp14c 24659
expertvillage j4DpuPvMLF4 22593
expertvillage Msu4lZb2oeQ 18822
expertvillage ZHZVj44rpjE 16304
expertvillage foATQY3wovI 13576
expertvillage -UnQ8rcBOQs 13450
expertvillage crtNd46CDks 11639
expertvillage D1leA0JKHhE 11553
expertvillage NJu2oG1Wm98 11452
expertvillage CapbXdyv4j4 10915
expertvillage epr5erraEp4 10817
expertvillage IyQoDgaLM7U 10597
expertvillage tbZibBnusLQ 10402
expertvillage _GnCHodc7mk 9422
expertvillage hvEYlSlRitU 7123
Time taken: 57.272 seconds, Fetched: 20 row(s)
理解二:取Top10中每个人上传的视频的观看次数前20
1、思路
1)去用户表gulivideo_user_orc求出上传视频最多的10个用户
2)关联gulivideo_orc表,求出这10个用户上传的所有视频id,视频观看次数,还要按照uploader分区,views倒叙排序,求出每个uploder的上传的视频的观看排名
3)按照rk进行where过滤,求出rk<=20的数据
2、代码
select
t3.uploader,
t3.videoId,
t3.views,
t3.rk
from (
select
t2.uploader,
t2.videoId,
t2.views,
rank() over(partition by uploader order by t2.views desc ) rk
from (
select
t1.uploader,
g.videoId,
g.`views`
from (
select
uploader
from gulivideo_user_orc
order by videos desc
limit 10
) t1 join gulivideo_orc g on t1.uploader = g.uploader
) t2
) t3
where rk <=20;
// 结果显示
OK
t3.uploader t3.videoid t3.views t3.rk
expertvillage -IxHBW0YpZw 39059 1
expertvillage BU-fT5XI_8I 29975 2
expertvillage ADOcaBYbMl0 26270 3
expertvillage yAqsULIDJFE 25511 4
expertvillage vcm-t0TJXNg 25366 5
expertvillage 0KYGFawp14c 24659 6
expertvillage j4DpuPvMLF4 22593 7
expertvillage Msu4lZb2oeQ 18822 8
expertvillage ZHZVj44rpjE 16304 9
expertvillage foATQY3wovI 13576 10
expertvillage -UnQ8rcBOQs 13450 11
expertvillage crtNd46CDks 11639 12
expertvillage D1leA0JKHhE 11553 13
expertvillage NJu2oG1Wm98 11452 14
expertvillage CapbXdyv4j4 10915 15
expertvillage epr5erraEp4 10817 16
expertvillage IyQoDgaLM7U 10597 17
expertvillage tbZibBnusLQ 10402 18
expertvillage _GnCHodc7mk 9422 19
expertvillage hvEYlSlRitU 7123 20
Ruchaneewan 5_T5Inddsuo 3132 1
Ruchaneewan wje4lUtbYNU 1086 2
Ruchaneewan i8rLbOUhAlM 549 3
Ruchaneewan OwnEtde9_Co 453 4
Ruchaneewan 5Zf0lbAdJP0 441 5
Ruchaneewan wenI5MrYT20 426 6
Ruchaneewan Iq4e3SopjxQ 420 7
Ruchaneewan 3hzOiFP-5so 420 7
Ruchaneewan JgyOlXjjuw0 418 9
Ruchaneewan fGBVShTsuyo 395 10
Ruchaneewan O3aoL70DlVc 389 11
Ruchaneewan q4y2ZS5OQ88 344 12
Ruchaneewan lyUJB2eMVVg 271 13
Ruchaneewan _RF_3VhaQpw 242 14
Ruchaneewan DDl2cjI-aJs 231 15
Ruchaneewan xbYyjUdhtJw 227 16
Ruchaneewan 4dkKeIUkN7E 226 17
Ruchaneewan qCfuQA6N4K0 213 18
Ruchaneewan TmYbGQaRcNM 209 19
Ruchaneewan dOlfPsFSjw0 206 20
Time taken: 30.772 seconds, Fetched: 40 row(s
理解三:Top10用户上传的所有视频,有哪些视频是在视频观看次数前20的视频
1、思路
1)去用户表gulivideo_user_orc求出上传视频最多的10个用户
2)关联gulivideo_orc表,求出这10个用户上传的所有的视频id,视频观看次数
3)在第二步的结果上,与视频表观看次数前20的数据进行内连接,求出Top10用户上传的视频有哪些是观看次数前20的视频
2、代码
SELECT
t3.uploader,
t3.videoid,
t3.views
FROM
(
SELECT
t1.uploader,
t2.videoid,
t2.views
FROM
(
select
uploader,
videos
from gulivideo_user_orc
order by videos DESC
limit 10
) t1
JOIN
gulivideo_orc t2
on t1.uploader = t2.uploader
) t3
JOIN
(
select
videoid,
`views`
from gulivideo_orc
order by `views` desc
limit 20
) t4
on t3.videoid = t4.videoid;