python - Link Harvesting in Scrapy -
i both amazed , frustrated scrapy. seems there power under hood, making steep learning curve. apparently, scrapy can used program myself, problem figuring out how make want.
for now, writing simple link harvester. want export 2 files: 1 internal links , link text, , external link , link text.
i have been trying -o file.csv command, lumps each page url single cell list, , includes duplicates.
the alternative have write own code in 'parse' , manually create list of links , check see if exist in list before adding them, , manually parse url see if domain in internal or external.
it seems scrapy should few commands. there built-in method this?
here's code working with. commented out title part bc think need make item object those. i've abandoned part now.
def parse_items(self, response): item = webconnectitem() sel = selector(response) items = [] # item["title"] = sel.xpath('//title/text()').extract() # item["current_url"] = response.url item["link_url"] = sel.xpath('//a/@href').extract() item["link_text"] = sel.xpath('//a/text()').extract() items.append(item) return items
scrapy has extensive documentation , the tutorial introduction.
it's built on top of twisted have think in terms of asynchronous requests , responses, quite different python-requests , bs4. python-requests blocks thread when issuing http requests. scrapy not, lets process responses while other requests may on wire.
you can use bs4 in scrapy callbacks (e.g. in parse_items
method).
you're right scrapy output 1 item per line in output. not deduplication of urls because items items scrapy. happen contain urls in case. scrapy no deduplication of items based on contain. you'd have instruct (with item pipeline example)
as urls represented lists in link_url
, link_text
fields, it's because sel.xpath('//a/@href').extract()
returns lists
scrapy 1.0 (soon released) adds .extract_first()
method in case.
Comments
Post a Comment