defining specific PHP curl options for specific web page retrieval -
i'm trying scrape web page: siriusxmu "now playing" information. here's code i've got far:
$timeout = 60; $url = 'http://www.siriusxm.com/siriusxmu'; $agent= 'mozilla/5.0 (windows nt 6.3; wow64; rv:38.0) gecko/20100101 firefox/38.0'; $referer = 'http://www.siriusxm.com/channellineup/'; $header[] = "accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"; $header[] = "cache-control: max-age=0"; $header[] = "connection: keep-alive"; //$header[] = "keep-alive: 300"; //$header[] = "accept-charset: iso-8859-1,utf-8;q=0.7,*;q=0.7"; $header[] = "accept-language: en-us,en;q=0.5"; $ch = curl_init(); curl_setopt($ch, curlopt_url, $url);//the url fetch. can set when initializing session curl_init(). curl_setopt($ch, curlopt_useragent, $agent);//the contents of "user-agent: " header used in http request. curl_setopt($ch, curlopt_httpheader, $header);//an array of http header fields set, in format array('content-type: text/plain', 'content-length: 100') curl_setopt($ch, curlopt_header, true);//true include header in output. curl_setopt($ch, curlopt_referer, $referer);//the contents of "referer: " header used in http request. curl_setopt($ch, curlopt_encoding, 'gzip,deflate');//the contents of "accept-encoding: " header. enables decoding of response. supported encodings "identity", "deflate", , "gzip". if empty string, "", set, header containing supported encoding types sent. //curl_setopt($ch, curlopt_autoreferer, true);//true automatically set referer: field in requests follows location: redirect. //curl_setopt($ch, curlopt_followlocation, true);//true follow "location: " header server sends part of http header (note recursive, php follow many "location: " headers sent, unless curlopt_maxredirs set). curl_setopt($ch, curlopt_timeout, $timeout);//the maximum number of seconds allow curl functions execute. //curl_setopt($ch, curlopt_ssl_verifypeer, false);//false stop curl verifying peer's certificate. alternate certificates verify against can specified curlopt_cainfo option or certificate directory can specified curlopt_capath option. //curl_setopt($ch, curlopt_ssl_verifyhost, 2);1 check existence of common name in ssl peer certificate. 2 check existence of common name , verify matches hostname provided. in production environments value of option should kept @ 2 (default value). //curl_setopt($ch, curlopt_verbose, true);//true output verbose information. writes output stderr, or file specified using curlopt_stderr. curl_setopt($ch, curlopt_returntransfer, true);//if curlopt_returntransfer option set, return result on success, false on failure. // $result = curl_exec($ch);//returns true on success or false on failure. however, if curlopt_returntransfer option set, return result on success, false on failure. curl_close($ch);
i've been studying http headers browser sends enables web page's "on air" section shows what's playing. however, when simulate headers curl, "one air" section of web page returns "sorry, program information not available selected platform."
firefox addon httpfox shows following main page:
00:00:03.904 0.163 1524 209 200 text/html http://www.siriusxm.com/siriusxmu (request-line) /siriusxmu http/1.1 host www.siriusxm.com user-agent mozilla/5.0 (windows nt 6.3; wow64; rv:38.0) gecko/20100101 firefox/38.0 accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 accept-language en-us,en;q=0.5 accept-encoding gzip, deflate referer http://www.siriusxm.com/channellineup/ cookie mmcore.tst=0.557; mmid=-318486443%7cbqaaaao2jyezegwaaa%3d%3d; mmcore.pd=111492824%7cbqaaaaobqjylgtmsdpt9evucaj3zfneyenjidwaaaiq4rsgcenjiaaaaap//////////abb3d3cuc2lyaxvzeg0uy29tahimagaaaaaaaaaaaad///////////////8aaaaaaaff; mmcore.srv=cg5.usw; __utma=1.1327546933.1434659528.1434659528.1434723665.2; __utmz=1.1434659528.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); s_nr=1434723821271-repeat; s_vnum=1435723200051%26vn%3d2; s_lastvisit=1434723660883; s_vi=[cs]v1|2ac1956485078c76-6000010e20030c67[ce]; mm_pc=%7b%22vehiclenewness%22%3a%22new%22%2c%22pc2%22%3a%22%22%7d; sxm_platform=xm; __utmv=1.|5=servicetype=xm=1; _hjuserid=86ab277e-6c63-4dd1-975c-3424e32502e6; __insp_slim=1434659556045; __insp_wid=800165747; __insp_nv=true; __insp_ref=ahr0cdovl3d3dy5zaxjpdxn4bs5jb20vc3ryzwftaw5n; __insp_norec_sess=true; _hjincludedinsample=1; __utmc=1; s_cc=true; sc_links=%5b%5bb%5d%5d; s_sq=%5b%5bb%5d%5d; s_sv_sid=797366592635; qsi_historysession=http%3a%2f%2fwww.siriusxm.com%2fstreaming~1434659533837%7chttp%3a%2f%2fwww.siriusxm.com%2fchannellineup%2f%23~1434659556190%7chttp%3a%2f%2fwww.siriusxm.com%2fsiriusxmu~1434659575429; s_invisit=true; __utmb=1.8.10.1434723665 connection keep-alive
and following when requesting javascript "one air" part:
00:00:05.293 1.186 1609 (137) 304 text/javascript http://www.siriusxm.com/static/app/js/sxm-channel-ontheair.js (request-line) /static/app/js/sxm-channel-ontheair.js http/1.1 host www.siriusxm.com user-agent mozilla/5.0 (windows nt 6.3; wow64; rv:38.0) gecko/20100101 firefox/38.0 accept */* accept-language en-us,en;q=0.5 accept-encoding gzip, deflate referer http://www.siriusxm.com/siriusxmu cookie mmcore.tst=0.557; mmid=-318486443%7cbqaaaao2jyezegwaaa%3d%3d; mmcore.pd=111492824%7cbqaaaaobqjylgtmsdpt9evucaj3zfneyenjidwaaaiq4rsgcenjiaaaaap//////////abb3d3cuc2lyaxvzeg0uy29tahimagaaaaaaaaaaaad///////////////8aaaaaaaff; mmcore.srv=cg5.usw; __utma=1.1327546933.1434659528.1434659528.1434723665.2; __utmz=1.1434659528.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); s_nr=1434723821271-repeat; s_vnum=1435723200051%26vn%3d2; s_lastvisit=1434723660883; s_vi=[cs]v1|2ac1956485078c76-6000010e20030c67[ce]; mm_pc=%7b%22vehiclenewness%22%3a%22new%22%2c%22pc2%22%3a%22%22%7d; sxm_platform=xm; __utmv=1.|5=servicetype=xm=1; _hjuserid=86ab277e-6c63-4dd1-975c-3424e32502e6; __insp_slim=1434659556045; __insp_wid=800165747; __insp_nv=true; __insp_ref=ahr0cdovl3d3dy5zaxjpdxn4bs5jb20vc3ryzwftaw5n; __insp_norec_sess=true; _hjincludedinsample=1; __utmc=1; s_cc=true; sc_links=%5b%5bb%5d%5d; s_sq=%5b%5bb%5d%5d; s_sv_sid=797366592635; qsi_historysession=http%3a%2f%2fwww.siriusxm.com%2fstreaming~1434659533837%7chttp%3a%2f%2fwww.siriusxm.com%2fchannellineup%2f%23~1434659556190%7chttp%3a%2f%2fwww.siriusxm.com%2fsiriusxmu~1434659575429; s_invisit=true; __utmb=1.8.10.1434723665 connection keep-alive if-modified-since fri, 22 may 2015 02:06:57 gmt if-none-match "ab841364-8501-516a21d70499b" cache-control max-age=0
the web server determining invalid curl request , not enabling "on air" javascript stuff , says "sorry, program information not available selected platform."
how can curl work , emulate browser , return valid web page results web server?
it appears you'll need run client has javascript interpreter.
the html includes following:
<div id="on-the-air-unavailable"><p>sorry, program information not available selected platform.</p></div>
the js includes following (not together):
$("#on-the-air-unavailable").hide(); $("#on-the-air-unavailable").show();
to have javascript interact html need run them together.
there headless http clients have js interpreters or browser automation tools selenium may able use.
Comments
Post a Comment