Post response embeds weird stuff in html code

This is a discussion on Post response embeds weird stuff in html code within the PHP Language forums, part of the PHP Programming Forums category; Hello there, I'm really stumped... I'm fetching a web page with a script and parsing it. There is ...


Go Back   Usenet Forums > PHP Programming Forums > PHP Language

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 02-24-2005
zorro
 
Posts: n/a
Default Post response embeds weird stuff in html code

Hello there,
I'm really stumped...

I'm fetching a web page with a script and parsing it.
There is a problem because the response inserts '8 1ff8' in random
places.

For example, I get things like
8< tr1ff8>
or
class=mytabl8 rowclass1ff8

Obviously my parsing doesn't work. I'm able to remove 1ff8 with regex
but not the first '8'. This following is never true:
preg_match("/.*8.*1ff8.*/",$page)


the response also prints this at the top of the page:
HTTP/1.1 200 Date: Thu, 24 Feb 2005 19:20:26 GMTServer: Apache/1.3.23
(Unix) (Red-Hat/Linux) mod_jk/1.2.4Set-Cookie:
JSESSIONID=9BAB933BDC5C23784D65084CF9967645; Path=/portalConnection:
closeTransfer-Encoding: chunkedContent-Type:
text/html;charset=ISO-8859-11ff8

this my post :

$mainpage = getpost(80,"english.montrealplus.ca","/portal/exploreSearch.do","siteId=6&section=79&pageIndex=0 &maxLinkPerPage=1000&maxPagePerSection=15&category =sportEventByType&subCategory=");
function getpost($portnb,$host,$path,$data)
{
$fp = fsockopen ($host,$portnb);
if (!$fp)
{ return false;
}
else
{ $response="";
fputs($fp, "POST $path HTTP/1.1\r\n");
fputs($fp, "Host: $host\r\n");
fputs($fp, "Content-type: application/x-www-form-urlencoded\r\n");
fputs($fp, "Content-length: ".strlen($data)."\r\n");
fputs($fp, "Connection: close\r\n\r\n");
fputs($fp, $data);

while(!feof($fp))
$response.=fgets($fp, 1024);

fclose($fp);
return $response;
}
}
another page i fetched had no such problem and the response header
displayed at the top of the page had a different charset:
HTTP/1.1 200 Date: Thu, 24 Feb 2005 19:22:21 GMT Server: Apache/1.3.23
(Unix) (Red-Hat/Linux) mod_jk/1.2.4 Set-Cookie:
JSESSIONID=5C311056FED1528E46126B87D7425533; Path=/portal Connection:
close Content-Type: text/html;charset=ISO-8859-1



so i tried adding that charset in my post - ";charset=ISO-8859-1"
after "Content-type: application/x-www-form-urlencoded" but no
success.
Reply With Quote
  #2 (permalink)  
Old 02-24-2005
Andy Hassall
 
Posts: n/a
Default Re: Post response embeds weird stuff in html code

On 24 Feb 2005 11:28:09 -0800, myahact@yahoo.ca (zorro) wrote:

>I'm fetching a web page with a script and parsing it.
>There is a problem because the response inserts '8 1ff8' in random
>places.
>
>For example, I get things like
>8< tr1ff8>
>or
>class=mytabl8 rowclass1ff8
>
>Obviously my parsing doesn't work. I'm able to remove 1ff8 with regex
>but not the first '8'. This following is never true:
>preg_match("/.*8.*1ff8.*/",$page)
>
>
>the response also prints this at the top of the page:
>HTTP/1.1 200


OK, clue #1 - this is an HTTP/1.1 response.

>Date: Thu, 24 Feb 2005 19:20:26 GMTServer: Apache/1.3.23
>(Unix) (Red-Hat/Linux) mod_jk/1.2.4Set-Cookie:
>JSESSIONID=9BAB933BDC5C23784D65084CF9967645; Path=/portalConnection:
>closeTransfer-Encoding: chunked


Clue #2 - this is chunked encoded. HTTP/1.1 clients MUST be able to accept
chunked encoding.

RFC2616 HTTP/1.1 sec 4.4 "Message Length", a few paragraphs down:
"
All HTTP/1.1 applications that receive entities MUST accept the
"chunked" transfer-coding (section 3.6), thus allowing this mechanism
to be used for messages when the message length cannot be determined
in advance.
"

>function getpost($portnb,$host,$path,$data)
>{
> $fp = fsockopen ($host,$portnb);
> if (!$fp)
> { return false;
> }
> else
> { $response="";
> fputs($fp, "POST $path HTTP/1.1\r\n");


You're claiming you're an HTTP/1.1 client... but you're not...

> while(!feof($fp))
> $response.=fgets($fp, 1024);
>
> fclose($fp);
> return $response;


... because you're not handling Chunked encoding.

> }
>}
>another page i fetched had no such problem and the response header
>displayed at the top of the page had a different charset:
>HTTP/1.1 200 Date: Thu, 24 Feb 2005 19:22:21 GMT Server: Apache/1.3.23
>(Unix) (Red-Hat/Linux) mod_jk/1.2.4 Set-Cookie:
>JSESSIONID=5C311056FED1528E46126B87D7425533; Path=/portal Connection:
>close Content-Type: text/html;charset=ISO-8859-1


The charset is a red herring; it's the transfer encoding that's tripping you
up. I think your options are:

(a) Don't claim you're an HTTP/1.1 client - use HTTP/1.0.
(b) Be an HTTP/1.1 client - implement Chunked transfer-encoding decoding.
(c) Use an HTTP/1.1 client library - cURL is a good bet as PHP has native
support for it.

--
Andy Hassall / <andy@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Reply With Quote
  #3 (permalink)  
Old 02-24-2005
John Dunlop
 
Posts: n/a
Default Re: Post response embeds weird stuff in html code

Andy Hassall wrote:

> (b) Be an HTTP/1.1 client - implement Chunked transfer-encoding decoding.


Pseudo-code for which is given in appendix 19.4.6.

--
Jock
Reply With Quote
  #4 (permalink)  
Old 02-26-2005
Chung Leong
 
Posts: n/a
Default Re: Post response embeds weird stuff in html code

"Andy Hassall" <andy@andyh.co.uk> wrote in message
news:pqes119itt7ibtvg7lpttpbu87ofq9ji1g@4ax.com...
> The charset is a red herring; it's the transfer encoding that's tripping

you
> up. I think your options are:
>
> (a) Don't claim you're an HTTP/1.1 client - use HTTP/1.0.
> (b) Be an HTTP/1.1 client - implement Chunked transfer-encoding decoding.
> (c) Use an HTTP/1.1 client library - cURL is a good bet as PHP has native
> support for it.


I'm on a mission to convert people to using stream context, hence:

http://www.php.net/stream_context_create/.



Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT +1. The time now is 10:45 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0