http - Download only the text from a webpage content in Python -

July 15, 2014

how can download only text/html/javascript of webpage in python?

i'm trying statistics text written authors of blogs. needing text, want increase program speed avoiding download of images, etc.

i'm able separate text html markup language. intention avoiding downloading aditional content in webpage (like images, .swf or like)

so far use:

user_agent = 'mozilla/5.0 (macintosh; u; intel mac os x 10_6_4; en-us) applewebkit/534.3 (khtml, gecko) chrome/6.0.472.63 safari/534.3'         headers = {'user-agent': user_agent} req = urllib2.request(url, none, headers) response = urllib2.urlopen(req, timeout=60) content_type = response.info().getheader('content-type') if 'text/html' in content_type:    return response.read()

but i'm not sure if i'm doing right thing (i.e. downloading text only)

python beautifulsoup 1 of best parsing webpages

import bs4 import urllib.request  webpage=str(urllib.request.urlopen(link).read()) soup = bs4.beautifulsoup(webpage)  print(soup.get_text())

Search This Blog

Macro

http - Download only the text from a webpage content in Python -

Comments

Post a Comment

Popular posts from this blog

symfony - TEST environment only: The database schema is not in sync with the current mapping file -

twig - Using Twigbridge in a Laravel 5.1 Package -

jdbc - Not able to establish database connection in eclipse -