This is a discussion on Post response embeds weird stuff in html code within the PHP Language forums, part of the PHP Programming Forums category; Hello there, I'm really stumped... I'm fetching a web page with a script and parsing it. There is ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
Hello there,
I'm really stumped... I'm fetching a web page with a script and parsing it. There is a problem because the response inserts '8 1ff8' in random places. For example, I get things like 8< tr1ff8> or class=mytabl8 rowclass1ff8 Obviously my parsing doesn't work. I'm able to remove 1ff8 with regex but not the first '8'. This following is never true: preg_match("/.*8.*1ff8.*/",$page) the response also prints this at the top of the page: HTTP/1.1 200 Date: Thu, 24 Feb 2005 19:20:26 GMTServer: Apache/1.3.23 (Unix) (Red-Hat/Linux) mod_jk/1.2.4Set-Cookie: JSESSIONID=9BAB933BDC5C23784D65084CF9967645; Path=/portalConnection: closeTransfer-Encoding: chunkedContent-Type: text/html;charset=ISO-8859-11ff8 this my post : $mainpage = getpost(80,"english.montrealplus.ca","/portal/exploreSearch.do","siteId=6§ion=79&pageIndex=0 &maxLinkPerPage=1000&maxPagePerSection=15&category =sportEventByType&subCategory="); function getpost($portnb,$host,$path,$data) { $fp = fsockopen ($host,$portnb); if (!$fp) { return false; } else { $response=""; fputs($fp, "POST $path HTTP/1.1\r\n"); fputs($fp, "Host: $host\r\n"); fputs($fp, "Content-type: application/x-www-form-urlencoded\r\n"); fputs($fp, "Content-length: ".strlen($data)."\r\n"); fputs($fp, "Connection: close\r\n\r\n"); fputs($fp, $data); while(!feof($fp)) $response.=fgets($fp, 1024); fclose($fp); return $response; } } another page i fetched had no such problem and the response header displayed at the top of the page had a different charset: HTTP/1.1 200 Date: Thu, 24 Feb 2005 19:22:21 GMT Server: Apache/1.3.23 (Unix) (Red-Hat/Linux) mod_jk/1.2.4 Set-Cookie: JSESSIONID=5C311056FED1528E46126B87D7425533; Path=/portal Connection: close Content-Type: text/html;charset=ISO-8859-1 so i tried adding that charset in my post - ";charset=ISO-8859-1" after "Content-type: application/x-www-form-urlencoded" but no success. |
|
|||
|
On 24 Feb 2005 11:28:09 -0800, myahact@yahoo.ca (zorro) wrote:
>I'm fetching a web page with a script and parsing it. >There is a problem because the response inserts '8 1ff8' in random >places. > >For example, I get things like >8< tr1ff8> >or >class=mytabl8 rowclass1ff8 > >Obviously my parsing doesn't work. I'm able to remove 1ff8 with regex >but not the first '8'. This following is never true: >preg_match("/.*8.*1ff8.*/",$page) > > >the response also prints this at the top of the page: >HTTP/1.1 200 OK, clue #1 - this is an HTTP/1.1 response. >Date: Thu, 24 Feb 2005 19:20:26 GMTServer: Apache/1.3.23 >(Unix) (Red-Hat/Linux) mod_jk/1.2.4Set-Cookie: >JSESSIONID=9BAB933BDC5C23784D65084CF9967645; Path=/portalConnection: >closeTransfer-Encoding: chunked Clue #2 - this is chunked encoded. HTTP/1.1 clients MUST be able to accept chunked encoding. RFC2616 HTTP/1.1 sec 4.4 "Message Length", a few paragraphs down: " All HTTP/1.1 applications that receive entities MUST accept the "chunked" transfer-coding (section 3.6), thus allowing this mechanism to be used for messages when the message length cannot be determined in advance. " >function getpost($portnb,$host,$path,$data) >{ > $fp = fsockopen ($host,$portnb); > if (!$fp) > { return false; > } > else > { $response=""; > fputs($fp, "POST $path HTTP/1.1\r\n"); You're claiming you're an HTTP/1.1 client... but you're not... > while(!feof($fp)) > $response.=fgets($fp, 1024); > > fclose($fp); > return $response; ... because you're not handling Chunked encoding. > } >} >another page i fetched had no such problem and the response header >displayed at the top of the page had a different charset: >HTTP/1.1 200 Date: Thu, 24 Feb 2005 19:22:21 GMT Server: Apache/1.3.23 >(Unix) (Red-Hat/Linux) mod_jk/1.2.4 Set-Cookie: >JSESSIONID=5C311056FED1528E46126B87D7425533; Path=/portal Connection: >close Content-Type: text/html;charset=ISO-8859-1 The charset is a red herring; it's the transfer encoding that's tripping you up. I think your options are: (a) Don't claim you're an HTTP/1.1 client - use HTTP/1.0. (b) Be an HTTP/1.1 client - implement Chunked transfer-encoding decoding. (c) Use an HTTP/1.1 client library - cURL is a good bet as PHP has native support for it. -- Andy Hassall / <andy@andyh.co.uk> / <http://www.andyh.co.uk> <http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool |
|
|||
|
"Andy Hassall" <andy@andyh.co.uk> wrote in message
news:pqes119itt7ibtvg7lpttpbu87ofq9ji1g@4ax.com... > The charset is a red herring; it's the transfer encoding that's tripping you > up. I think your options are: > > (a) Don't claim you're an HTTP/1.1 client - use HTTP/1.0. > (b) Be an HTTP/1.1 client - implement Chunked transfer-encoding decoding. > (c) Use an HTTP/1.1 client library - cURL is a good bet as PHP has native > support for it. I'm on a mission to convert people to using stream context, hence: http://www.php.net/stream_context_create/. |