python - Link Harvesting in Scrapy -

August 15, 2011

i both amazed , frustrated scrapy. seems there power under hood, making steep learning curve. apparently, scrapy can used program myself, problem figuring out how make want.

for now, writing simple link harvester. want export 2 files: 1 internal links , link text, , external link , link text.

i have been trying -o file.csv command, lumps each page url single cell list, , includes duplicates.

the alternative have write own code in 'parse' , manually create list of links , check see if exist in list before adding them, , manually parse url see if domain in internal or external.

it seems scrapy should few commands. there built-in method this?

here's code working with. commented out title part bc think need make item object those. i've abandoned part now.

    def parse_items(self, response):     item = webconnectitem()     sel = selector(response)     items = [] #    item["title"] = sel.xpath('//title/text()').extract() #    item["current_url"] = response.url     item["link_url"] = sel.xpath('//a/@href').extract()     item["link_text"] = sel.xpath('//a/text()').extract()     items.append(item)     return items

scrapy has extensive documentation , the tutorial introduction.

it's built on top of twisted have think in terms of asynchronous requests , responses, quite different python-requests , bs4. python-requests blocks thread when issuing http requests. scrapy not, lets process responses while other requests may on wire.

you can use bs4 in scrapy callbacks (e.g. in parse_items method).

you're right scrapy output 1 item per line in output. not deduplication of urls because items items scrapy. happen contain urls in case. scrapy no deduplication of items based on contain. you'd have instruct (with item pipeline example)

as urls represented lists in link_url , link_text fields, it's because sel.xpath('//a/@href').extract() returns lists

scrapy 1.0 (soon released) adds .extract_first() method in case.

Search This Blog

Macro

python - Link Harvesting in Scrapy -

Comments

Post a Comment

Popular posts from this blog

symfony - TEST environment only: The database schema is not in sync with the current mapping file -

twig - Using Twigbridge in a Laravel 5.1 Package -

jdbc - Not able to establish database connection in eclipse -