七仔的博客

七仔的博客GithubPages分博

0%

Scrapy实战:双向爬取(豆瓣电影Top250)

爬取豆瓣电影排行的前250名,要爬取的信息有电影名,海报,评分,导演,编剧,主演,类型,片长,首次上映日期,简介

Scrapy实战:双向爬取(豆瓣电影Top250)

需要爬取的信息:豆瓣电影排行的前250名

爬取网页:https://movie.douban.com/top250

分析:

Top排行页上部分

Top排行页上部分

Top排行页下部分

Top排行页下部分

电影详情页:

Top排行页下部分

我们想爬取的信息:电影名,海报,评分,导演,编剧,主演,类型,片长,首次上映日期,简介

思路:

由于Top250页每页只有25条,另外电影信息在列表页中才有,所以需要进行双向爬取,在索引页中爬取信息是横向爬取,在分页中爬取下一页是纵向爬取。

过程:

要爬取的信息:items.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field


class DoubanItem(Item):
# DouBan Movies
name = Field() # 电影名
image_url = Field() # 海报
average = Field() # 评分
director = Field() # 导演
screenwriter = Field() # 编剧
star = Field() # 主演
genre = Field() # 类型
runtime = Field() # 片长
initialReleaseDate = Field() # 首次上映日期
summary = Field() # 简介

爬虫核心:basic.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import scrapy
from urllib.parse import urljoin
from scrapy.loader import ItemLoader
from douban.items import DoubanItem
from scrapy.loader.processors import MapCompose, Join
from scrapy.http import Request


class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]
start_urls = (
'https://movie.douban.com/top250',
)

def parse(self, response):
next_selector = response.xpath('//span[@class="next"]/a/@href')
for url in next_selector.extract():
yield Request(urljoin(response.url, url), dont_filter=True)

item_selector = response.xpath('//div[@class="hd"]/a/@href')
for url in item_selector.extract():
yield Request(urljoin(response.url, url), callback=self.parse_item, dont_filter=True)


def parse_item(self, response):
""" This function parses a property page
@url http://movie.douban.com/subject/26683723
@return items 1
@scrapes name average director screenwriter star genre runtime initialReleaseDate summary
"""

item_loader = ItemLoader(item=DoubanItem(), response=response)

item_loader.add_xpath('name', '//*[@property="v:itemreviewed"]/text()') # 电影名
item_loader.add_xpath('image_url', '//*[@rel="v:image"]/@src') # 海报
item_loader.add_xpath('average', '//*[@property="v:average"]/text()') # 评分
item_loader.add_xpath('director', '//div[@id ="info"]/span[1]/span[@class="attrs"]/a/text()') # 导演
item_loader.add_xpath('screenwriter', '//div[@id ="info"]/span[2]/span[@class="attrs"]/a/text()',
Join()) # 编剧
item_loader.add_xpath('star', '//div[@id ="info"]/span[@class="actor"]/span[@class="attrs"]/a/text()',
Join()) # 主演
item_loader.add_xpath('genre', '//*[@property="v:genre"]/text()',
Join()) # 类型
item_loader.add_xpath('runtime', '//*[@property="v:runtime"]/text()') # 片长
item_loader.add_xpath('initialReleaseDate', '//*[@property="v:initialReleaseDate"]/text()') # 上映年限
item_loader.add_xpath('summary', '//*[@property="v:summary"]//text()',
MapCompose(lambda i: i.replace(' ', '')),
MapCompose(lambda i: i.replace('\n', '')),
MapCompose(lambda i: i.replace('\u3000', '')),
Join()) # 简介

return item_loader.load_item()

可以清楚的看到双向爬取的过程:next_selector是下一页的url,item_selector是列表中每条的url跳转

保存文件的:pipelines.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


import pymysql
from scrapy.conf import settings

class DoubanPipeline(object):
def process_item(self, item, spider):
host = settings['MYSQL_HOST']
user = settings['MYSQL_USER']
psd = settings['MYSQL_PASSWORD']
db = settings['MYSQL_DB']
c = settings['CHARSET']
con = pymysql.connect(host=host, user=user, passwd=psd, db=db, charset=c)
cur = con.cursor()
sql = ("insert into movie(name, image_url, average, director, screenwriter, star, genre, runtime, initialReleaseDate, summary) "
"values(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)")
list = [item['name'], item['image_url'], item['average'], item['director'], item['screenwriter'],
item['star'], item['genre'], item['runtime'], item['initialReleaseDate'], item['summary']]
cur.execute(sql, list)
con.commit()
cur.close()
con.close()
return item

配置文件:settings.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
BOT_NAME = 'douban'

SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'

# 文件编码格式
FEED_EXPORT_ENCODING = 'utf-8'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 10

# Mysql数据库的配置信息
MYSQL_HOST = '127.0.0.1'
MYSQL_DB = 'douban' # 数据库名字
MYSQL_USER = 'root' # 数据库账号
MYSQL_PASSWORD = '你数据库的密码' # 数据库密码
MYSQL_PORT = 3306 # 数据库端口
CHARSET = 'utf8' # 格式


# 当使用pipeline保存抓取内容时,需要设置相应的pipeline类
ITEM_PIPELINES = {
'douban.pipelines.DoubanPipeline': 300,
}

此为博主副博客,留言请去主博客,转载请注明出处:https://www.baby7blog.com/myBlog/20.html

欢迎关注我的其它发布渠道