七仔的博客

七仔的博客GithubPages分博

0%

Scrapy实战:单页列表爬取

爬取信息如下图红色框内所示,包括(序号,音频编号,专家ID,专家名,感受,音频URL),根据爬取信息自行设计MySql数据库表,并将爬取信息存储数据库。

Scrapy实战:单页列表爬取

python课的作业,挺久没用Scrpay了,正好趁这个机会再熟悉熟悉。

题目:

爬取网页信息,网址为:http://47.104.173.82/neu_emotion/Home/index/showPartData.html

要求:爬取信息如下图红色框内所示,包括(序号,音频编号,专家ID,专家名,感受,音频URL),根据爬取信息自行设计MySql数据库表,并将爬取信息存储数据库。

分析:

列表截图

html结构截图

通过分析可以得到列表中有几百条信息,每条信息在一个class为table table-striped的

中,
中有七条信息,前六条是我们需要的,前四条都是相同类型的,所以可以用差不多的XPath表达式,只需要更改数组的下标,第五条是在多条
,然后./tbody/tr/td[1]/b/text()中更改下标可以取得前四条信息,./tbody/tr/td/select/option[@selected]/text()可以获得被选中的信息,最后./tbody/tr/td/audio/@src获取路径信息。

过程:

数据库设计:

数据库设计

前四个为整形,剩下三个为varchar

创建Scrpay项目:

1
2
3
$ scrapy startproject grade
$ cd grade
$ scrapy genspider basic web

首先是items.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field


class GradeItem(Item):
Id = Field() # 序号
AudioId = Field() # 音频编号
ExpertId = Field() # 专家ID
ExpertName = Field() # 专家名
Feel = Field() # 感受
AudioURL = Field() # 音频URL
pass

爬取的信息是这六个

然后是爬虫的核心:basic.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# -*- coding: utf-8 -*-
import scrapy
from grade.items import GradeItem


class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["'web'"]
start_urls = (
'http://47.104.173.82/neu_emotion/Home/index/showPartData.html',
)

def parse(self, response):
""" This function parses a property page
@url http://movie.douban.com/subject/26683723
@return items 1
@scrapes name average director screenwriter star genre runtime initialReleaseDate summary
"""

for each in response.xpath('//*[@class="table table-striped"]'):
item = GradeItem()
item['Id'] = each.xpath("./tbody/tr/td[1]/b/text()").extract()[0] # 序号
item['AudioId'] = each.xpath("./tbody/tr/td[2]/b/text()").extract()[0] # 音频编号
item['ExpertId'] = each.xpath("./tbody/tr/td[3]/b/text()").extract()[0] # 专家ID
item['ExpertName'] = each.xpath("./tbody/tr/td[4]/b/text()").extract()[0] # 专家名
item['Feel'] = each.xpath("./tbody/tr/td/select/option[@selected]/text()").extract()[0] # 感受
item['AudioURL'] = each.xpath("./tbody/tr/td/audio/@src").extract()[0] # 音频URL
yield item

接下来是将爬取的信息保存到MySQL中:pipelines.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# -*- coding: utf-8 -*-
import pymysql
from scrapy.conf import settings


class GradePipeline(object):
def process_item(self, item, spider):
host = settings['MYSQL_HOST']
user = settings['MYSQL_USER']
psd = settings['MYSQL_PASSWORD']
db = settings['MYSQL_DB']
c = settings['CHARSET']
con = pymysql.connect(host=host, user=user, passwd=psd, db=db, charset=c)
cur = con.cursor()
sql = "INSERT INTO test VALUES(%s,%s,%s,%s,%s,%s)"
item_list = [item['Id'].replace("序号:", ""), item['AudioId'].replace("#", "").replace("音频编号", ""),
item['ExpertId'].replace("专家ID#", ""), item['ExpertName'].replace("专家名字_", ""), item['Feel'],
item['AudioURL']]
cur.execute(sql, item_list)
con.commit()
cur.close()
con.close()
return item

其中的数据库连接信息我放在了settings.py中,后面就是个插入SQL语句,在插入之前去掉多余的信息

最后是配置文件settings.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
BOT_NAME = 'grade'

SPIDER_MODULES = ['grade.spiders']
NEWSPIDER_MODULE = 'grade.spiders'


# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Mysql数据库的配置信息
MYSQL_HOST = '127.0.0.1'
MYSQL_DB = 'pythonscrapy' # 数据库名字
MYSQL_USER = 'root' # 数据库账号
MYSQL_PASSWORD = '你自己的密码' # 数据库密码
MYSQL_PORT = 3306 # 数据库端口
CHARSET = 'utf8' # 格式

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'grade.pipelines.GradePipeline': 300,
}

ROBOTSTXT_OBEY = False是为了不遵守爬虫协议啦啦啦,后面的grade.pipelines.GradePipeline代表pipelines.py中的GradePipeline,当爬虫爬取到信息后会调用GradePipeline的process_item保存。

此为博主副博客,留言请去主博客,转载请注明出处:https://www.baby7blog.com/myBlog/19.html

欢迎关注我的其它发布渠道