http - Download only the text from a webpage content in Python -
how can download only text/html/javascript of webpage in python?
i'm trying statistics text written authors of blogs. needing text, want increase program speed avoiding download of images, etc.
i'm able separate text html markup language. intention avoiding downloading aditional content in webpage (like images, .swf or like)
so far use:
user_agent = 'mozilla/5.0 (macintosh; u; intel mac os x 10_6_4; en-us) applewebkit/534.3 (khtml, gecko) chrome/6.0.472.63 safari/534.3' headers = {'user-agent': user_agent} req = urllib2.request(url, none, headers) response = urllib2.urlopen(req, timeout=60) content_type = response.info().getheader('content-type') if 'text/html' in content_type: return response.read()
but i'm not sure if i'm doing right thing (i.e. downloading text only)
python beautifulsoup 1 of best parsing webpages
import bs4 import urllib.request webpage=str(urllib.request.urlopen(link).read()) soup = bs4.beautifulsoup(webpage) print(soup.get_text())
Comments
Post a Comment