此项目的功能是爬取知乎用户信息以及人际拓扑关系,爬虫框架使用scrapy,数据存储使用mongo,下载这些数据感觉也没什么用,就当为大家学习scrapy提供一个例子吧。
## 使用方法
### 本地运行
爬虫程序依赖mongo和rabbitmq,因此这两个服务必须正常运行和配置。为了加快下载效率,图片下载是异步任务,因此在启动爬虫进程执行需要启动异步worker,启动方式是进入zhihu_spider/zhihu目录后执行下面命令:
```
celery -A zhihu.tools.async worker --loglevel=info
```
### docker部署
进入zhihu_spider后执行```docker-compose up``` ,进入container后和本地运行方法相同,依次启动mongo、rabbitmq、异步任务、爬虫进程即可。docker采用的image可以参见我的另一个项目[spider-docker](https://github.com/LiuRoy/spider_docker)获取。
## 流程图
![流程图](doc/流程图.png)
* 请求[https://www.zhihu.com](https://www.zhihu.com)获取页面中的_xsrf数据,知乎开启了跨站请求伪造功能,所有的POST请求都必须带上此参数。
* 提交用户名,密码已经第一步解析的_xsrf参数到[https://www.zhihu.com/login/email](https://www.zhihu.com/login/email),登陆获取cookies
* 访问用户主页,以我的主页为例[https://www.zhihu.com/people/weizhi-xiazhi](https://www.zhihu.com/people/weizhi-xiazhi), 如下图:
![](doc/主页.png)
解析的用户信息包括昵称,头像链接,个人基本信息还有关注人的数量和粉丝的数量。这个页面还能获取关注人页面和粉丝页面。
* 由上一步获取的分页列表页面和关注人页面获取用户人际关系,这两个页面类似,唯一麻烦的是得到的静态页面最多只有二十个,获取全部的人员必须通过POST请求,解析到的个人主页再由上一步来解析。
## 代码解释
scrapy文档非常详细,在此我就不详细讲解,你所能碰到的任何疑问都可以在文档中找到解答。
![代码](doc/代码.png)
* 爬虫框架从start\_requests开始执行,此部分会提交知乎主页的访问请求给引擎,并设置回调函数为post_login.
* post\_login解析主页获取\_xsrf保存为成员变量中,并提交登陆的POST请求,设置回调函数为after\_login.
* after\_login拿到登陆后的cookie,提交一个start\_url的GET请求给爬虫引擎,设置回调函数parse\_people.
* parse\_people解析个人主页,一次提交关注人和粉丝列表页面到爬虫引擎,回调函数是parse\_follow, 并把解析好的个人数据提交爬虫引擎写入mongo。
* parse\_follow会解析用户列表,同时把动态的人员列表POST请求发送只引擎,回调函数是parse\_post\_follow,把解析好的用户主页链接请求也发送到引擎,人员关系写入mongo。
* parse\_post\_follow单纯解析用户列表,提交用户主页请求至引擎。
## 效果图
![people](doc/people.png)
![relation](doc/relation.png)
![image](doc/image.png)
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
from pymongo import MongoClient
from zhihu.settings import MONGO_URI, PROJECT_DIR
from zhihu.items import ZhihuPeopleItem, ZhihuRelationItem
from zhihu.tools.async import download_pic
class ZhihuPipeline(object):
? ? """
? ? 存储数据
? ? """
? ? def __init__(self, mongo_uri, mongo_db, image_dir):
? ? ? ? self.mongo_uri = mongo_uri
? ? ? ? self.mongo_db = mongo_db
? ? ? ? self.image_dir = image_dir
? ? ? ? self.client = None
? ? ? ? self.db= None
? ? @classmethod
? ? def from_crawler(cls, crawler):
? ? ? ? return cls(
? ? ? ? ? ? mongo_uri=MONGO_URI,
? ? ? ? ? ? mongo_db='zhihu',
? ? ? ? ? ? image_dir=os.path.join(PROJECT_DIR, 'images')
? ? ? ? )
? ? def open_spider(self, spider):
? ? ? ? self.client = MongoClient(self.mongo_uri)
? ? ? ? self.db = self.client[self.mongo_db]
? ? ? ? if not os.path.exists(self.image_dir):
? ? ? ? ? ? os.mkdir(self.image_dir)
? ? def close_spider(self, spider):
? ? ? ? self.client.close()
? ? def _process_people(self, item):
? ? ? ? """
? ? ? ? 存储用户信息
? ? ? ? """
? ? ? ? collection = self.db['people']
? ? ? ? zhihu_id = item['zhihu_id']
? ? ? ? collection.update({'zhihu_id': zhihu_id},
? ? ? ? ? ? ? ? ? ? ? ? ? dict(item), upsert=True)
? ? ? ? image_url = item['image_url']
? ? ? ? if image_url and zhihu_id:
? ? ? ? ? ? image_path = os.path.join(self.image_dir, '{}.jpg'.format(zhihu_id))
? ? ? ? ? ? download_pic.delay(image_url, image_path)
? ? def _process_relation(self, item):
? ? ? ? """
? ? ? ? 存储人际拓扑关系
? ? ? ? """
? ? ? ? collection = self.db['relation']
? ? ? ? data = collection.find_one({
? ? ? ? ? ? 'zhihu_id': item['zhihu_id'],
? ? ? ? ? ? 'user_type': item['user_type']})
? ? ? ? if not data:
? ? ? ? ? ? self.db['relation'].insert(dict(item))
? ? ? ? else:
? ? ? ? ? ? origin_list = data['user_list']
? ? ? ? ? ? new_list = item['user_list']
? ? ? ? ? ? data['user_list'] = list(set(origin_list) | set(new_list))
? ? ? ? ? ? collection.update({'zhihu_id': item['zhihu_id'],
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?'user_type': item['user_type']}, data)
? ? def process_item(self, item, spider):
? ? ? ? """
? ? ? ? 处理item
? ? ? ? """
? ? ? ? if isinstance(item, ZhihuPeopleItem):
? ? ? ? ? ? self._process_people(item)
? ? ? ? elif isinstance(item, ZhihuRelationItem):
? ? ? ? ? ? self._process_relation(item)
? ? ? ? return item
?
# -*- coding: utf-8 -*-
# Scrapy settings for zhihu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# ? ? http://doc.scrapy.org/en/latest/topics/settings.html
# ? ? http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# ? ? http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
import os
BOT_NAME = 'zhihu'
SPIDER_MODULES = ['zhihu.spiders']
NEWSPIDER_MODULE = 'zhihu.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) ' \
? ? ? ? ? ? ?'AppleWebKit/537.36 (KHTML, like Gecko) ' \
? ? ? ? ? ? ?'Chrome/49.0.2623.87 Safari/537.36'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS=32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY=3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN=16
#CONCURRENT_REQUESTS_PER_IP=16
# Disable cookies (enabled by default)
COOKIES_ENABLED = True
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED=False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
? 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
? 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',
? 'Accept-Encoding': 'gzip, deflate, sdch',
? 'Connection': 'keep-alive'
}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# ? ?'zhihu.middlewares.MyCustomSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# ? ?'zhihu.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# ? ?'scrapy.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# ? ?'zhihu.pipelines.SomePipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
# NOTE: AutoThrottle will honour the standard settings for concurrency and delay
#AUTOTHROTTLE_ENABLED=True
# The initial download delay
#AUTOTHROTTLE_START_DELAY=5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY=60
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG=False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED=True
#HTTPCACHE_EXPIRATION_SECS=0
#HTTPCACHE_DIR='httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES=[]
#HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'
# 广度优先
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
# 项目路径
PROJECT_DIR = os.path.dirname(os.path.abspath(os.path.curdir))
# mongodb配置
MONGO_URI = 'mongodb://localhost:27017'
# pipeline设置
ITEM_PIPELINES = {
? ? 'zhihu.pipelines.ZhihuPipeline': 500,
}
# 异步任务队列
BROKER_URL = 'amqp://guest:guest@localhost:5672//'
?
# -*- coding: utf-8 -*-
# Scrapy settings for zhihu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# ? ? http://doc.scrapy.org/en/latest/topics/settings.html
# ? ? http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# ? ? http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
import os
BOT_NAME = 'zhihu'
SPIDER_MODULES = ['zhihu.spiders']
NEWSPIDER_MODULE = 'zhihu.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) ' \
? ? ? ? ? ? ?'AppleWebKit/537.36 (KHTML, like Gecko) ' \
? ? ? ? ? ? ?'Chrome/49.0.2623.87 Safari/537.36'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS=32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY=3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN=16
#CONCURRENT_REQUESTS_PER_IP=16
# Disable cookies (enabled by default)
COOKIES_ENABLED = True
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED=False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
? 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
? 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',
? 'Accept-Encoding': 'gzip, deflate, sdch',
? 'Connection': 'keep-alive'
}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# ? ?'zhihu.middlewares.MyCustomSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# ? ?'zhihu.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# ? ?'scrapy.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# ? ?'zhihu.pipelines.SomePipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
# NOTE: AutoThrottle will honour the standard settings for concurrency and delay
#AUTOTHROTTLE_ENABLED=True
# The initial download delay
#AUTOTHROTTLE_START_DELAY=5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY=60
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG=False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED=True
#HTTPCACHE_EXPIRATION_SECS=0
#HTTPCACHE_DIR='httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES=[]
#HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'
# 广度优先
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
# 项目路径
PROJECT_DIR = os.path.dirname(os.path.abspath(os.path.curdir))
# mongodb配置
MONGO_URI = 'mongodb://localhost:27017'
# pipeline设置
ITEM_PIPELINES = {
? ? 'zhihu.pipelines.ZhihuPipeline': 500,
}
# 异步任务队列
BROKER_URL = 'amqp://guest:guest@localhost:5672//'
# -*- coding=utf8 -*-
"""
? ? 常量定义
"""
from zhihu.settings import USER_AGENT
class Gender(object):
? ? """
? ? 性别定义
? ? """
? ? MALE = 1
? ? FEMALE = 2
class People(object):
? ? """
? ? 人员类型
? ? """
? ? Followee = 1
? ? Follower = 2
HEADER = {
? ? 'Host': 'www.zhihu.com',
? ? 'Connection': 'keep-alive',
? ? 'Pragma': 'no-cache',
? ? 'Cache-Control': 'no-cache',
? ? 'Accept': '*/*',
? ? 'Origin': 'https://www.zhihu.com',
? ? 'X-Requested-With': 'XMLHttpRequest',
? ? 'User-Agent': USER_AGENT,
? ? 'Content-Type': 'application/x-www-form-urlencoded',
? ? 'Accept-Encoding': 'gzip, deflate',
? ? 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',
}
?