爬取豆瓣电影评论内容、星级、评论时间、支持人数

发布时间:2024年01月01日

? ? ? 大家好,我是带我去滑雪,每天教你一个小技巧!

? ? ? 本期爬取豆瓣电影评论人、评论时间、星级、支持人数、评论内容。话不多说,直接上代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

items=[]

for i in range(0,25):
    url=f'https://movie.douban.com/subject/30334073/comments?start={20*i}&limit=20=P&sort=new_score'
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
             'Referer':'https://movie.douban.com/subject/30334073/comments?sort=time&status=P',
             'Cookie':'bid=4HaXgwTES9U; __gads=ID=85e62e18d05513eb-2291e0501ccb00d5:T=1629877067:RT=1629877067:S=ALNI_MZYsnYWOu5VfO1vceNcKg66gwaMZQ; ll="118209"; __yadk_uid=ccg5plgEoNnVKRg6YOB3aKAChcQneXdk; _vwo_uuid_v2=DD8C0C94BE8722E387E94ECAB6722025A|642230c75b7a8e04a58060320d542d9e; ct=y; push_doumail_num=0; push_noty_num=0; _ga=GA1.2.637371737.1629877067; UM_distinctid=17bd361c41028e-096ad5aa89803-a7d193d-1fa400-17bd361c411840; Hm_lvt_19fc7b106453f97b6a84d64302f21a04=1631339005; __utmv=30149280.6183; ap_v=0,6.0; __utmc=30149280; __utmz=30149280.1632719355.16.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utmc=223695111; __utmz=223695111.1632719356.13.5.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utma=30149280.637371737.1629877067.1632719355.1632722102.17; __utma=223695111.1603523566.1629877067.1632719356.1632722102.14; __utmb=223695111.0.10.1632722102; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1632722102%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DubNOD-vH_WgE_3tx3fkI3PF0djcVWGVrXh1AaMJu2SH2-5ojOwvOmXLUmvW-Sk2R%26wd%3D%26eqid%3D97dfe06d000c888d00000003615151f6%22%5D; _pk_ses.100001.4cf6=*; __utmb=30149280.3.10.1632722102; dbcl2="150297594:qnZRek3HTwI"; ck=_D-k; _pk_id.100001.4cf6=6a177a97f3dfd6a4.1629877067.14.1632724817.1632719534.'}
    r=requests.get(url,headers=headers)
    time.sleep(1)
    text=r.text

    soup=BeautifulSoup(r.text,'html.parser')
    comments_list=soup.find_all('div',class_="comment-item")
    for comment in comments_list:
        votes=comment.find('span',class_='votes vote-count').text
        content=comment.find('span',class_='short').text
        author=comment.find('span',class_="comment-info").find('a').text
        comment_time=comment.find('span',class_="comment-time").get('title')
        star=comment.find('span',class_="comment-info").find_all('span')[1].get('class')[0][-2]
        item=[author,comment_time,star,votes,content]
        items.append(item)

df=pd.DataFrame(items,columns=['评论人','评论时间','星级','支持人数','评论内容']) 
df.to_csv('调音师.csv',encoding='utf_8_sig')

输出结果展示:

83b1e15c8fbf4f3b9a373047b2d5e143.png

需要数据集的家人们可以去百度网盘(永久有效)获取:

链接:https://pan.baidu.com/s/173deLlgLYUz789M3KHYw-Q?pwd=0ly6
提取码:2138?


更多优质内容持续发布中,请移步主页查看。

若有问题可邮箱联系:1736732074@qq.com?

博主的WeChat:TCB1736732074

? ?点赞+关注,下次不迷路!

?

?

文章来源:https://blog.csdn.net/qq_45856698/article/details/135327038
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。