http - Download only the text from a webpage content in Python -


how can download only text/html/javascript of webpage in python?

i'm trying statistics text written authors of blogs. needing text, want increase program speed avoiding download of images, etc.

i'm able separate text html markup language. intention avoiding downloading aditional content in webpage (like images, .swf or like)

so far use:

user_agent = 'mozilla/5.0 (macintosh; u; intel mac os x 10_6_4; en-us) applewebkit/534.3 (khtml, gecko) chrome/6.0.472.63 safari/534.3'         headers = {'user-agent': user_agent} req = urllib2.request(url, none, headers) response = urllib2.urlopen(req, timeout=60) content_type = response.info().getheader('content-type') if 'text/html' in content_type:    return response.read() 

but i'm not sure if i'm doing right thing (i.e. downloading text only)

python beautifulsoup 1 of best parsing webpages

import bs4 import urllib.request  webpage=str(urllib.request.urlopen(link).read()) soup = bs4.beautifulsoup(webpage)  print(soup.get_text()) 

Comments

Popular posts from this blog

powershell Start-Process exit code -1073741502 when used with Credential from a windows service environment -

twig - Using Twigbridge in a Laravel 5.1 Package -

c# - LINQ join Entities from HashSet's, Join vs Dictionary vs HashSet performance -