python-Scrapy框架入门

发布时间：2024年01月10日

此网站爬取江南大学官网新闻信息
https://news.jiangnan.edu.cn/yw.htm

Scrapy安装

mac | Linux : pip install scrapy
windows:

pip install wheel
pip install pywin32
安装Twisted ： pip install Twisted_iocpsupport-1.0.2-cp310-cp310-win_amd64.whl
(该文件去此网站下载，注意文件中的cp310代表python3.10版本。https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted)
pip install scrapy

安装完成后在此项目终端输入scrapy ,查看是否安装成功（不报错）

Scrapy工程创建

创建工程：scrapy startproject 文件名称

创建爬虫文件：scrapy genspider spiderName www.xxx.com

scrapy genspider 命令
spiderName 文件名称
www.xxx.com 爬取的网站，后面可以更改

执行后spiders中会多出一个文件，用来编写爬取规则
执行工程： scrapy crawl spiderName(指定执行的爬虫文件)

数据解析

Re

正则表达式，各语言差不多

bs4

安装

pip install bs4
pip install lxml

实例化

from bs4 import BeautifulSoup
本地实例化

fp = open(‘./txt.html’,‘r’,encoding=‘utf-8’)
soup = BeautifulSoup(fp,‘lxml’)

抓取网络页面

page_text = response.text
soup = BeautifulSoup(page_text,‘lxml’)

数据解析

soup.tagName
- soup.tagName:返回的是文档中第一次出现的tagName对应的标签
soup.find()
- soup.find(‘tagName’):等同于soup.div属性定位
- soup,find(‘div’,class_/id/attr='song’soup.
- find_all(‘tagName’):返回符合要求的所有标签 (列表)
select
- select('某种选择器 (id，class，标签…选择器)),返回的是一个列表
- 层级选择器:soup.select(.tang > ul > li > a’):>表示的是一个层级
- soup.select( tang > ul a’): 空格表示的多个层级获取标签之间的文本数据
- soup.a.text/string/get_text()
  - text/get text():可以获取某一个标签中所有的文本内容
  - string: 只可以获取该标签下的文本内容
获取属性
- soup.a[‘href’]

Xpath

实例化etree对象

pip install lxml

本地

etree.parse(filepath)

网络

etree.HTML(page_text)

解析

/ 从根节点开始定位。表示的是一个层级
// 表示的是多个层级。
属性定位: //div[@class=‘song’] 例：div[@class=‘song’]
索引定位： div[@class=“song”]/p[3] 索引从1开始
取文本：
- /text() 标签中的直系文本内容
- //text() 标签下的所有文本
取属性
- /@attrName img/@src

持久化存储

基于终端指令

parse函数return 要保存的数据
执行文件时加入参数 scrapy crawl 文件名 -o filePath 注意文件格式，这里有限制

管道

items.py定义好保存的字段

class SchoolItem(scrapy.Item):
   
    school = scrapy.Field()
    Time = scrapy.Field()
    Col = scrapy.Field()
    Title = scrapy.Field()
    Text = scrapy.Field()
    Provenance = scrapy.Field()
    URL = scrapy.Field()
    FWLCount = scrapy.Field()
    Heat = scrapy.Field()

spider中 def parse 函数中将数据保存

item = SchoolItem()
item["school"] = "江南大学"
item["Col"] = "综合新闻"
item["Heat"] = random.randint(500, 1000)
item["FWLCount"] = random.randint(100, 2000)

item["Time"] = data
item["URL"] = handle_url
item["Title"] = title
item["Text"] = content
item["Provenance"] = source
yield item

setting.py 开启管道

ITEM_PIPELINES = {
  # 数值表示管道的权重  小的靠前
   "school.pipelines.SchoolPipeline": 300,    
}

编写管道规则

class SchoolPipeline:

    conn = None
    cursor = None
    new_Num = 0
    def open_spider(self,spider):
        print("网站信息开始收集...")
        self.conn = pymysql.Connect(
            user='root',
            password='root',
            host='localhost',
            port=3306,
            database='yu'
        )

    def process_item(self, item, spider):

        self.cursor = self.conn.cursor()
        try:
            query = "insert into app01_schoolnews values ('%s','%s','%s','%s','%s','%s','%s','%s','%s')"
            values = (
            item['school'], item['Time'], item['Col'], item['Title'], item['Text'],item['URL'],item['Provenance'], item['Heat'],
            item['FWLCount'])
            QUERY = format(query%values)
            # print(QUERY)
            self.cursor.execute(QUERY)
            self.conn.commit()
            print("插入数据库成功...")
            self.new_Num += 1
        except Exception as e:
            print("mysql连接异常...",e)
        # return item

    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()
        print("关闭数据库连接。共入库%d条数据..."%self.new_Num)

表模型

CREATE TABLE `app01_schoolnews` (
  `school` varchar(255) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `Time` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
  `Col` longtext COLLATE utf8mb4_general_ci NOT NULL,
  `Title` longtext COLLATE utf8mb4_general_ci NOT NULL,
  `Text` longtext COLLATE utf8mb4_general_ci NOT NULL,
  `Provenance` longtext COLLATE utf8mb4_general_ci NOT NULL,
  `URL` varchar(255) COLLATE utf8mb4_general_ci NOT NULL,
  `FWLCount` int NOT NULL,
  `Heat` double NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

gitee地址：https://gitee.com/xiongjinwang/python

文章来源:https://blog.csdn.net/m0_49000161/article/details/135431708
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若内容造成侵权/违法违规/事实不符，请联系我的编程经验分享网邮箱：chenni525@qq.com进行投诉反馈，一经查实，立即删除！