web page download question

This is a discussion on web page download question within the PHP General forums, part of the PHP Programming Forums category; I'm attempting to "scrape" a web page to pull out some pertinent info. The URL looks similar ...


Go Back   Usenet Forums > PHP Programming Forums > PHP General

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 11-12-2007
David Calkins
 
Posts: n/a
Default web page download question

I'm attempting to "scrape" a web page to pull out some pertinent info.
The URL looks similar to the below.

http://www.someserver.com/user-info.xml?user=myusername

If I paste the above into my web browser, the page comes up and
displays the information. If I try "view source", I get an XML
document. Not an XHTML document, but a plain XML document with just
the data fields (no formatting info).

However, if I use PHP fopen() to read this same URL, I get the XHTML
file with all the formatting info, etc.

So, somehow the web browser (this happens in FireFox and IE7) is
showing something other than what I get with the plain fopen().

I'd like to get at the plain XML file with just the data fields as is
shown in the browser.

Any ideas how to do this?

Thanks!
Reply With Quote
  #2 (permalink)  
Old 11-13-2007
Chris
 
Posts: n/a
Default Re: [PHP] web page download question

David Calkins wrote:
> I'm attempting to "scrape" a web page to pull out some pertinent info.
> The URL looks similar to the below.
>
> http://www.someserver.com/user-info.xml?user=myusername
>
> If I paste the above into my web browser, the page comes up and
> displays the information. If I try "view source", I get an XML
> document. Not an XHTML document, but a plain XML document with just
> the data fields (no formatting info).
>
> However, if I use PHP fopen() to read this same URL, I get the XHTML
> file with all the formatting info, etc.
>
> So, somehow the web browser (this happens in FireFox and IE7) is
> showing something other than what I get with the plain fopen().
>
> I'd like to get at the plain XML file with just the data fields as is
> shown in the browser.


Ask the owner of the website. They may have some sort of browser
detection in there to show different things to search engines vs ie/firefox.

--
Postgresql & php tutorials
http://www.designmagick.com/
Reply With Quote
  #3 (permalink)  
Old 11-13-2007
David Calkins
 
Posts: n/a
Default Re: [PHP] web page download question

Its a big company. They provide the page to help people look up the
info, but I don't think I could expect support from them on it. I was
thinking it was some standard sort of thing, like maybe the way you
issue the HTTP request or some way of decoding an alternate document
from the source, etc.

Also, IE is rendering the page graphically. i.e. when you view the
page in IE (or FireFox for that matter) it shows up with all the
graphics and formatting. But when you do view source, all you get is
the plain XML. Not XHTML with CSS references or something like that,
just plain, raw XML with the relevent fields and their data. So the
page IE is rendering does not match what it shows you when you do view
source. So somehow, there are 2 documents and I'm just missing how to
retrieve the other one.

I don't think its a browser detection thing since IE shows both. The
graphically rendered, formatted page, and the raw data XML.


On Nov 12, 2007 8:23 PM, Chris <dmagick@gmail.com> wrote:
>
> David Calkins wrote:
> > I'm attempting to "scrape" a web page to pull out some pertinent info.
> > The URL looks similar to the below.
> >
> > http://www.someserver.com/user-info.xml?user=myusername
> >
> > If I paste the above into my web browser, the page comes up and
> > displays the information. If I try "view source", I get an XML
> > document. Not an XHTML document, but a plain XML document with just
> > the data fields (no formatting info).
> >
> > However, if I use PHP fopen() to read this same URL, I get the XHTML
> > file with all the formatting info, etc.
> >
> > So, somehow the web browser (this happens in FireFox and IE7) is
> > showing something other than what I get with the plain fopen().
> >
> > I'd like to get at the plain XML file with just the data fields as is
> > shown in the browser.

>
> Ask the owner of the website. They may have some sort of browser
> detection in there to show different things to search engines vs ie/firefox.
>
> --
> Postgresql & php tutorials
> http://www.designmagick.com/
>

Reply With Quote
  #4 (permalink)  
Old 11-13-2007
Chris
 
Posts: n/a
Default Re: [PHP] web page download question

David Calkins wrote:
> Its a big company. They provide the page to help people look up the
> info, but I don't think I could expect support from them on it. I was
> thinking it was some standard sort of thing, like maybe the way you
> issue the HTTP request or some way of decoding an alternate document
> from the source, etc.
>
> Also, IE is rendering the page graphically. i.e. when you view the
> page in IE (or FireFox for that matter) it shows up with all the
> graphics and formatting. But when you do view source, all you get is
> the plain XML. Not XHTML with CSS references or something like that,
> just plain, raw XML with the relevent fields and their data. So the
> page IE is rendering does not match what it shows you when you do view
> source. So somehow, there are 2 documents and I'm just missing how to
> retrieve the other one.
>
> I don't think its a browser detection thing since IE shows both. The
> graphically rendered, formatted page, and the raw data XML.


Is there a DTD at the top?

--
Postgresql & php tutorials
http://www.designmagick.com/
Reply With Quote
  #5 (permalink)  
Old 11-13-2007
admin@buskirkgraphics.com
 
Posts: n/a
Default RE: [PHP] web page download question

http://www.catavitch.com

The Script I have written actually does that for predefined websites.
The content is LIVE pull directly from the website listed.
MOST important thing to remember "GET PERMISSION" to scrape as you call it.

I can store the data if I want or dish it up in the example. I can even
display the entire website.

I do not use an fopen();
Some may suggest a wget();
I will disagree strongly.
Why waste the space with stored files.

Try curl.




-----Original Message-----
From: David Calkins [mailto:coder1024@gmail.com]
Sent: Monday, November 12, 2007 9:40 AM
To: php-general@lists.php.net
Subject: [php] web page download question

I'm attempting to "scrape" a web page to pull out some pertinent info.
The URL looks similar to the below.

http://www.someserver.com/user-info.xml?user=myusername

If I paste the above into my web browser, the page comes up and
displays the information. If I try "view source", I get an XML
document. Not an XHTML document, but a plain XML document with just
the data fields (no formatting info).

However, if I use PHP fopen() to read this same URL, I get the XHTML
file with all the formatting info, etc.

So, somehow the web browser (this happens in FireFox and IE7) is
showing something other than what I get with the plain fopen().

I'd like to get at the plain XML file with just the data fields as is
shown in the browser.

Any ideas how to do this?

Thanks!

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT +1. The time now is 03:39 PM.


Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0