This is a discussion on web page download question within the PHP General forums, part of the PHP Programming Forums category; I'm attempting to "scrape" a web page to pull out some pertinent info. The URL looks similar ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
I'm attempting to "scrape" a web page to pull out some pertinent info.
The URL looks similar to the below. http://www.someserver.com/user-info.xml?user=myusername If I paste the above into my web browser, the page comes up and displays the information. If I try "view source", I get an XML document. Not an XHTML document, but a plain XML document with just the data fields (no formatting info). However, if I use PHP fopen() to read this same URL, I get the XHTML file with all the formatting info, etc. So, somehow the web browser (this happens in FireFox and IE7) is showing something other than what I get with the plain fopen(). I'd like to get at the plain XML file with just the data fields as is shown in the browser. Any ideas how to do this? Thanks! |
|
|||
|
David Calkins wrote:
> I'm attempting to "scrape" a web page to pull out some pertinent info. > The URL looks similar to the below. > > http://www.someserver.com/user-info.xml?user=myusername > > If I paste the above into my web browser, the page comes up and > displays the information. If I try "view source", I get an XML > document. Not an XHTML document, but a plain XML document with just > the data fields (no formatting info). > > However, if I use PHP fopen() to read this same URL, I get the XHTML > file with all the formatting info, etc. > > So, somehow the web browser (this happens in FireFox and IE7) is > showing something other than what I get with the plain fopen(). > > I'd like to get at the plain XML file with just the data fields as is > shown in the browser. Ask the owner of the website. They may have some sort of browser detection in there to show different things to search engines vs ie/firefox. -- Postgresql & php tutorials http://www.designmagick.com/ |
|
|||
|
Its a big company. They provide the page to help people look up the
info, but I don't think I could expect support from them on it. I was thinking it was some standard sort of thing, like maybe the way you issue the HTTP request or some way of decoding an alternate document from the source, etc. Also, IE is rendering the page graphically. i.e. when you view the page in IE (or FireFox for that matter) it shows up with all the graphics and formatting. But when you do view source, all you get is the plain XML. Not XHTML with CSS references or something like that, just plain, raw XML with the relevent fields and their data. So the page IE is rendering does not match what it shows you when you do view source. So somehow, there are 2 documents and I'm just missing how to retrieve the other one. I don't think its a browser detection thing since IE shows both. The graphically rendered, formatted page, and the raw data XML. On Nov 12, 2007 8:23 PM, Chris <dmagick@gmail.com> wrote: > > David Calkins wrote: > > I'm attempting to "scrape" a web page to pull out some pertinent info. > > The URL looks similar to the below. > > > > http://www.someserver.com/user-info.xml?user=myusername > > > > If I paste the above into my web browser, the page comes up and > > displays the information. If I try "view source", I get an XML > > document. Not an XHTML document, but a plain XML document with just > > the data fields (no formatting info). > > > > However, if I use PHP fopen() to read this same URL, I get the XHTML > > file with all the formatting info, etc. > > > > So, somehow the web browser (this happens in FireFox and IE7) is > > showing something other than what I get with the plain fopen(). > > > > I'd like to get at the plain XML file with just the data fields as is > > shown in the browser. > > Ask the owner of the website. They may have some sort of browser > detection in there to show different things to search engines vs ie/firefox. > > -- > Postgresql & php tutorials > http://www.designmagick.com/ > |
|
|||
|
David Calkins wrote:
> Its a big company. They provide the page to help people look up the > info, but I don't think I could expect support from them on it. I was > thinking it was some standard sort of thing, like maybe the way you > issue the HTTP request or some way of decoding an alternate document > from the source, etc. > > Also, IE is rendering the page graphically. i.e. when you view the > page in IE (or FireFox for that matter) it shows up with all the > graphics and formatting. But when you do view source, all you get is > the plain XML. Not XHTML with CSS references or something like that, > just plain, raw XML with the relevent fields and their data. So the > page IE is rendering does not match what it shows you when you do view > source. So somehow, there are 2 documents and I'm just missing how to > retrieve the other one. > > I don't think its a browser detection thing since IE shows both. The > graphically rendered, formatted page, and the raw data XML. Is there a DTD at the top? -- Postgresql & php tutorials http://www.designmagick.com/ |
|
|||
|
http://www.catavitch.com
The Script I have written actually does that for predefined websites. The content is LIVE pull directly from the website listed. MOST important thing to remember "GET PERMISSION" to scrape as you call it. I can store the data if I want or dish it up in the example. I can even display the entire website. I do not use an fopen(); Some may suggest a wget(); I will disagree strongly. Why waste the space with stored files. Try curl. -----Original Message----- From: David Calkins [mailto:coder1024@gmail.com] Sent: Monday, November 12, 2007 9:40 AM To: php-general@lists.php.net Subject: [php] web page download question I'm attempting to "scrape" a web page to pull out some pertinent info. The URL looks similar to the below. http://www.someserver.com/user-info.xml?user=myusername If I paste the above into my web browser, the page comes up and displays the information. If I try "view source", I get an XML document. Not an XHTML document, but a plain XML document with just the data fields (no formatting info). However, if I use PHP fopen() to read this same URL, I get the XHTML file with all the formatting info, etc. So, somehow the web browser (this happens in FireFox and IE7) is showing something other than what I get with the plain fopen(). I'd like to get at the plain XML file with just the data fields as is shown in the browser. Any ideas how to do this? Thanks! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php |
![]() |
| Thread Tools | |
| Display Modes | |
|
|