比如这个网页
http://www.cs.com.cn/xwzx/hg/201409/t20140924_4521344.html
正文分成了好几个分页来显示,骗PV可耻啊
我想用scrapy把这个正文合并起来,看文档中没找到合并的办法
就想到一个思路,如果发现有分页,将网页内容合并起来,再用lxml和xpath提取
测试了一下,思路是可行的,我就想问一下,scrapy是否有自带的方法能更优雅的实现这一过程?
核心代码 片断
from lxml import html
import HTMLParser
import requests
import re
def innerHTML(node):
buildString = ''
for child in node:
buildString += html.tostring(child)
return HTMLParser.HTMLParser().unescape(buildString)
encoding = 'gbk'
source=response.body.decode(encoding,'ignore')
p=re.search(r'countPage = (.*?)/',response.body)
if p:
for i in xrange(1,int(p.group(1))):
url='%s_%d.html' %(response.url.replace('.html',''),i)
source =source+ requests.get(url,headers=headers).content.decode(encoding,'ignore')
content=html.fromstring(source).xpath('//div[@class="Dtext z_content"]')
content=innerHTML(content)
http://www.cs.com.cn/xwzx/hg/201409/t20140924_4521344.html
正文分成了好几个分页来显示,骗PV可耻啊
我想用scrapy把这个正文合并起来,看文档中没找到合并的办法
就想到一个思路,如果发现有分页,将网页内容合并起来,再用lxml和xpath提取
测试了一下,思路是可行的,我就想问一下,scrapy是否有自带的方法能更优雅的实现这一过程?
核心代码 片断
from lxml import html
import HTMLParser
import requests
import re
def innerHTML(node):
buildString = ''
for child in node:
buildString += html.tostring(child)
return HTMLParser.HTMLParser().unescape(buildString)
encoding = 'gbk'
source=response.body.decode(encoding,'ignore')
p=re.search(r'countPage = (.*?)/',response.body)
if p:
for i in xrange(1,int(p.group(1))):
url='%s_%d.html' %(response.url.replace('.html',''),i)
source =source+ requests.get(url,headers=headers).content.decode(encoding,'ignore')
content=html.fromstring(source).xpath('//div[@class="Dtext z_content"]')
content=innerHTML(content)