接着上一篇的笔记,Scrapy爬取普通无反爬、静态页面的网页时可以顺利爬取我们要的信息。但是大部分情况下我们要的数据所在的网页它是动态加载出来的(ajax请求后传回前端页面渲染、js调用function等)。这种情况下需要使用selenium进行模拟人工操作浏览器行为,实现自动化采集动态网页数据。
SPIDER_MIDDLEWARES = {
'stock_spider.middlewares.StockSpiderSpiderMiddleware': 543,
}
DOWNLOADER_MIDDLEWARES = {
'stock_spider.middlewares.StockSpiderDownloaderMiddleware': 543,
}
ITEM_PIPELINES = {
'stock_spider.pipelines.StockSpiderPipeline': 300,
}
from selenium.webdriver.firefox.options import Options as firefox_options
spider.driver = webdriver.Firefox(options=firefox_options()) # 指定使用的浏览器
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
spider.driver.get("http://www.baidu.com")
return None
from scrapy.http import HtmlResponse
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
response_body = spider.driver.page_source
return HtmlResponse(url=request.url, body=response_body, encoding='utf-8', request=request)
启动爬虫后就可以看到爬虫启动了浏览器驱动,接下来就可以实现各种模拟人工操作了