Scrapy redirects to homepage for some urls -
i new scrapy framework & using extract articles multiple 'health & wellness' websites. of requests, scrapy redirecting homepage(this behavior not observed in browser). below example:
command: scrapy shell "http://www.bornfitness.com/blog/page/10/" result: 2015-06-19 21:32:15+0530 [scrapy] debug: web service listening on 127.0.0.1:6080 2015-06-19 21:32:15+0530 [default] info: spider opened 2015-06-19 21:32:15+0530 [default] debug: redirecting (301) http://www.bornfitness.com/> http://www.bornfitness.com/blog/page/10/> 2015-06-19 21:32:16+0530 [default] debug: crawled (200) http://www.bornfitness.com/> (referer: none)
note page number in url(10) two-digit number. don't see issue urls single-sigit page number(8 example). result: 2015-06-19 21:43:15+0530 [default] info: spider opened 2015-06-19 21:43:16+0530 [default] debug: crawled (200) http://www.bornfitness.com/blog/page/8/> (referer: none)
when have trouble replicating browser behavior using scrapy, want @ things being communicated differently when browser talking website compared when spider talking website. remember website (almost always) not designed nice webcrawlers, interact web browsers.
for situation, if @ headers being sent scrapy request, should see like:
in [1]: request.headers out[1]: {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'accept-encoding': 'gzip,deflate', 'accept-language': 'en', 'user-agent': 'scrapy/0.24.6 (+http://scrapy.org)'}
if examine headers sent request same page web browser, might see like:
**request headers** /blog/page/10/ http/1.1 host: www.bornfitness.com connection: keep-alive accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 user-agent: mozilla/5.0 (windows nt 6.3; wow64) applewebkit/537.36 (khtml, gecko) chrome/43.0.2357.124 safari/537.36 dnt: 1 referer: http://www.bornfitness.com/blog/page/11/ accept-encoding: gzip, deflate, sdch accept-language: en-us,en;q=0.8 cookie: fealty_segment_registeronce=1; ... ... ...
try changing user-agent
in request. should allow around redirect.
Comments
Post a Comment