defining specific PHP curl options for specific web page retrieval -


i'm trying scrape web page: siriusxmu "now playing" information. here's code i've got far:

    $timeout = 60;     $url = 'http://www.siriusxm.com/siriusxmu';     $agent= 'mozilla/5.0 (windows nt 6.3; wow64; rv:38.0) gecko/20100101 firefox/38.0';     $referer = 'http://www.siriusxm.com/channellineup/';      $header[] = "accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";     $header[] = "cache-control: max-age=0";     $header[] = "connection: keep-alive";     //$header[] = "keep-alive: 300";     //$header[] = "accept-charset: iso-8859-1,utf-8;q=0.7,*;q=0.7";     $header[] = "accept-language: en-us,en;q=0.5";      $ch = curl_init();     curl_setopt($ch, curlopt_url, $url);//the url fetch. can set when initializing session curl_init().     curl_setopt($ch, curlopt_useragent, $agent);//the contents of "user-agent: " header used in http request.     curl_setopt($ch, curlopt_httpheader, $header);//an array of http header fields set, in format array('content-type: text/plain', 'content-length: 100')     curl_setopt($ch, curlopt_header, true);//true include header in output.     curl_setopt($ch, curlopt_referer, $referer);//the contents of "referer: " header used in http request.     curl_setopt($ch, curlopt_encoding, 'gzip,deflate');//the contents of "accept-encoding: " header. enables decoding of response. supported encodings "identity", "deflate", , "gzip". if empty string, "", set, header containing supported encoding types sent.     //curl_setopt($ch, curlopt_autoreferer, true);//true automatically set referer: field in requests follows location: redirect.     //curl_setopt($ch, curlopt_followlocation, true);//true follow "location: " header server sends part of http header (note recursive, php follow many "location: " headers sent, unless curlopt_maxredirs set).     curl_setopt($ch, curlopt_timeout, $timeout);//the maximum number of seconds allow curl functions execute.     //curl_setopt($ch, curlopt_ssl_verifypeer, false);//false stop curl verifying peer's certificate. alternate certificates verify against can specified curlopt_cainfo option or certificate directory can specified curlopt_capath option.     //curl_setopt($ch, curlopt_ssl_verifyhost, 2);1 check existence of common name in ssl peer certificate. 2 check existence of common name , verify matches hostname provided. in production environments value of option should kept @ 2 (default value).     //curl_setopt($ch, curlopt_verbose, true);//true output verbose information. writes output stderr, or file specified using curlopt_stderr.     curl_setopt($ch, curlopt_returntransfer, true);//if curlopt_returntransfer option set, return result on success, false on failure.      //           $result = curl_exec($ch);//returns true on success or false on failure. however, if curlopt_returntransfer option set, return result on success, false on failure.     curl_close($ch); 

i've been studying http headers browser sends enables web page's "on air" section shows what's playing. however, when simulate headers curl, "one air" section of web page returns "sorry, program information not available selected platform."

firefox addon httpfox shows following main page:

00:00:03.904    0.163   1524    209 200 text/html   http://www.siriusxm.com/siriusxmu  (request-line)  /siriusxmu http/1.1 host    www.siriusxm.com user-agent  mozilla/5.0 (windows nt 6.3; wow64; rv:38.0) gecko/20100101 firefox/38.0 accept  text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 accept-language en-us,en;q=0.5 accept-encoding gzip, deflate referer http://www.siriusxm.com/channellineup/ cookie  mmcore.tst=0.557; mmid=-318486443%7cbqaaaao2jyezegwaaa%3d%3d; mmcore.pd=111492824%7cbqaaaaobqjylgtmsdpt9evucaj3zfneyenjidwaaaiq4rsgcenjiaaaaap//////////abb3d3cuc2lyaxvzeg0uy29tahimagaaaaaaaaaaaad///////////////8aaaaaaaff; mmcore.srv=cg5.usw; __utma=1.1327546933.1434659528.1434659528.1434723665.2; __utmz=1.1434659528.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); s_nr=1434723821271-repeat; s_vnum=1435723200051%26vn%3d2; s_lastvisit=1434723660883; s_vi=[cs]v1|2ac1956485078c76-6000010e20030c67[ce]; mm_pc=%7b%22vehiclenewness%22%3a%22new%22%2c%22pc2%22%3a%22%22%7d; sxm_platform=xm; __utmv=1.|5=servicetype=xm=1; _hjuserid=86ab277e-6c63-4dd1-975c-3424e32502e6; __insp_slim=1434659556045; __insp_wid=800165747; __insp_nv=true; __insp_ref=ahr0cdovl3d3dy5zaxjpdxn4bs5jb20vc3ryzwftaw5n; __insp_norec_sess=true; _hjincludedinsample=1; __utmc=1; s_cc=true; sc_links=%5b%5bb%5d%5d; s_sq=%5b%5bb%5d%5d; s_sv_sid=797366592635; qsi_historysession=http%3a%2f%2fwww.siriusxm.com%2fstreaming~1434659533837%7chttp%3a%2f%2fwww.siriusxm.com%2fchannellineup%2f%23~1434659556190%7chttp%3a%2f%2fwww.siriusxm.com%2fsiriusxmu~1434659575429; s_invisit=true; __utmb=1.8.10.1434723665 connection  keep-alive 

and following when requesting javascript "one air" part:

00:00:05.293    1.186   1609    (137)   304 text/javascript http://www.siriusxm.com/static/app/js/sxm-channel-ontheair.js  (request-line)  /static/app/js/sxm-channel-ontheair.js http/1.1 host    www.siriusxm.com user-agent  mozilla/5.0 (windows nt 6.3; wow64; rv:38.0) gecko/20100101 firefox/38.0 accept  */* accept-language en-us,en;q=0.5 accept-encoding gzip, deflate referer http://www.siriusxm.com/siriusxmu cookie  mmcore.tst=0.557; mmid=-318486443%7cbqaaaao2jyezegwaaa%3d%3d; mmcore.pd=111492824%7cbqaaaaobqjylgtmsdpt9evucaj3zfneyenjidwaaaiq4rsgcenjiaaaaap//////////abb3d3cuc2lyaxvzeg0uy29tahimagaaaaaaaaaaaad///////////////8aaaaaaaff; mmcore.srv=cg5.usw; __utma=1.1327546933.1434659528.1434659528.1434723665.2; __utmz=1.1434659528.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); s_nr=1434723821271-repeat; s_vnum=1435723200051%26vn%3d2; s_lastvisit=1434723660883; s_vi=[cs]v1|2ac1956485078c76-6000010e20030c67[ce]; mm_pc=%7b%22vehiclenewness%22%3a%22new%22%2c%22pc2%22%3a%22%22%7d; sxm_platform=xm; __utmv=1.|5=servicetype=xm=1; _hjuserid=86ab277e-6c63-4dd1-975c-3424e32502e6; __insp_slim=1434659556045; __insp_wid=800165747; __insp_nv=true; __insp_ref=ahr0cdovl3d3dy5zaxjpdxn4bs5jb20vc3ryzwftaw5n; __insp_norec_sess=true; _hjincludedinsample=1; __utmc=1; s_cc=true; sc_links=%5b%5bb%5d%5d; s_sq=%5b%5bb%5d%5d; s_sv_sid=797366592635; qsi_historysession=http%3a%2f%2fwww.siriusxm.com%2fstreaming~1434659533837%7chttp%3a%2f%2fwww.siriusxm.com%2fchannellineup%2f%23~1434659556190%7chttp%3a%2f%2fwww.siriusxm.com%2fsiriusxmu~1434659575429; s_invisit=true; __utmb=1.8.10.1434723665 connection  keep-alive if-modified-since   fri, 22 may 2015 02:06:57 gmt if-none-match   "ab841364-8501-516a21d70499b" cache-control   max-age=0 

the web server determining invalid curl request , not enabling "on air" javascript stuff , says "sorry, program information not available selected platform."

how can curl work , emulate browser , return valid web page results web server?

it appears you'll need run client has javascript interpreter.

the html includes following:

<div id="on-the-air-unavailable"><p>sorry, program information not available selected platform.</p></div> 

the js includes following (not together):

$("#on-the-air-unavailable").hide(); $("#on-the-air-unavailable").show(); 

to have javascript interact html need run them together.

there headless http clients have js interpreters or browser automation tools selenium may able use.


Comments

Popular posts from this blog

powershell Start-Process exit code -1073741502 when used with Credential from a windows service environment -

twig - Using Twigbridge in a Laravel 5.1 Package -

c# - LINQ join Entities from HashSet's, Join vs Dictionary vs HashSet performance -