http - Download only the text from a webpage content in Python -


how can download only text/html/javascript of webpage in python?

i'm trying statistics text written authors of blogs. needing text, want increase program speed avoiding download of images, etc.

i'm able separate text html markup language. intention avoiding downloading aditional content in webpage (like images, .swf or like)

so far use:

user_agent = 'mozilla/5.0 (macintosh; u; intel mac os x 10_6_4; en-us) applewebkit/534.3 (khtml, gecko) chrome/6.0.472.63 safari/534.3'         headers = {'user-agent': user_agent} req = urllib2.request(url, none, headers) response = urllib2.urlopen(req, timeout=60) content_type = response.info().getheader('content-type') if 'text/html' in content_type:    return response.read() 

but i'm not sure if i'm doing right thing (i.e. downloading text only)

python beautifulsoup 1 of best parsing webpages

import bs4 import urllib.request  webpage=str(urllib.request.urlopen(link).read()) soup = bs4.beautifulsoup(webpage)  print(soup.get_text()) 

Comments

Popular posts from this blog

gcc - MinGW's ld cannot perform PE operations on non PE output file -

timeout - Handshake_timeout on RabbitMQ using python and pika from remote vm -

c# - Search and Add Comment with OpenXML for Word -