Scrapy redirects to homepage for some urls -


i new scrapy framework & using extract articles multiple 'health & wellness' websites. of requests, scrapy redirecting homepage(this behavior not observed in browser). below example:

command: scrapy shell "http://www.bornfitness.com/blog/page/10/" result: 2015-06-19 21:32:15+0530 [scrapy] debug: web service listening on 127.0.0.1:6080 2015-06-19 21:32:15+0530 [default] info: spider opened 2015-06-19 21:32:15+0530 [default] debug: redirecting (301) http://www.bornfitness.com/> http://www.bornfitness.com/blog/page/10/> 2015-06-19 21:32:16+0530 [default] debug: crawled (200) http://www.bornfitness.com/> (referer: none)

note page number in url(10) two-digit number. don't see issue urls single-sigit page number(8 example). result: 2015-06-19 21:43:15+0530 [default] info: spider opened 2015-06-19 21:43:16+0530 [default] debug: crawled (200) http://www.bornfitness.com/blog/page/8/> (referer: none)

when have trouble replicating browser behavior using scrapy, want @ things being communicated differently when browser talking website compared when spider talking website. remember website (almost always) not designed nice webcrawlers, interact web browsers.

for situation, if @ headers being sent scrapy request, should see like:

in [1]: request.headers out[1]: {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',  'accept-encoding': 'gzip,deflate',  'accept-language': 'en',  'user-agent': 'scrapy/0.24.6 (+http://scrapy.org)'} 

if examine headers sent request same page web browser, might see like:

**request headers**  /blog/page/10/ http/1.1     host: www.bornfitness.com     connection: keep-alive     accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 user-agent: mozilla/5.0 (windows nt 6.3; wow64) applewebkit/537.36 (khtml, gecko) chrome/43.0.2357.124 safari/537.36 dnt: 1     referer: http://www.bornfitness.com/blog/page/11/ accept-encoding: gzip, deflate, sdch     accept-language: en-us,en;q=0.8 cookie: fealty_segment_registeronce=1; ... ... ... 

try changing user-agent in request. should allow around redirect.


Comments

Popular posts from this blog

powershell Start-Process exit code -1073741502 when used with Credential from a windows service environment -

twig - Using Twigbridge in a Laravel 5.1 Package -

c# - LINQ join Entities from HashSet's, Join vs Dictionary vs HashSet performance -