search engine challenge

This is a discussion on search engine challenge within the PHP Language forums, part of the PHP Programming Forums category; Hello, I'm running a site with +20.000 articles. The articles (html files) are saved on the server as ...


Go Back   Usenet Forums > PHP Programming Forums > PHP Language

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 01-26-2004
Frank
 
Posts: n/a
Default search engine challenge

Hello,

I'm running a site with +20.000 articles. The articles (html files) are
saved on the server as txt files. Alle other data (author, date, category
and so on) are in a MySQL db. Before we had the articles put in the db also
and then performed SQL queries for the search engine. But this is no longer
feasable since there are too many articles and the db has gotten too big.
The search engine does all of the db and the server cpu goes max.
I'm looking for a php type search engine that automatically indexes the txt
files, produces 1 index file with all indexed words + the id's of articles
having those words. Like that the search script doesn't have to query all
the articles (the whole db) anymore but just this one index file. Would be
nice also if there would be possibility to have a blacklist of words (the,
a,...) and other admin things.

Anyone has experience with this?

Greetz,
Frank.



Reply With Quote
  #2 (permalink)  
Old 01-26-2004
Philipp Lenssen
 
Posts: n/a
Default Re: search engine challenge

Frank wrote:

>
> I'm running a site with +20.000 articles. The articles (html files)
> are saved on the server as txt files. Alle other data (author, date,
> category and so on) are in a MySQL db. Before we had the articles put
> in the db also and then performed SQL queries for the search engine.
> But this is no longer feasable since there are too many articles and
> the db has gotten too big. The search engine does all of the db and
> the server cpu goes max. I'm looking for a php type search engine
> that automatically indexes the txt files, produces 1 index file with
> all indexed words + the id's of articles having those words. Like
> that the search script doesn't have to query all the articles (the
> whole db) anymore but just this one index file. Would be nice also if
> there would be possibility to have a blacklist of words (the, a,...)
> and other admin things.
>


If the site is public, have you thought about letting Google do the
hard work, and then either using the Google site search, or the Google
Web API to display results? Google is getting _very_ fast in indexing
large amounts of data on one's site. They picked up thousands of my
pages recently while I was playing around with the htaccess... even too
fast for my taste since I changed it again the next day...

--
Google Blogoscoped
http://blog.outer-court.com
Reply With Quote
  #3 (permalink)  
Old 01-26-2004
Frank
 
Posts: n/a
Default Re: search engine challenge

I don't think it's possible to have Google index an MySQL db? And the html
files on the server are not .html

"Philipp Lenssen" <info@outer-court.com> wrote in message
news:bv3682$m6l26$1@ID-203055.news.uni-berlin.de...
> Frank wrote:
>
> >
> > I'm running a site with +20.000 articles. The articles (html files)
> > are saved on the server as txt files. Alle other data (author, date,
> > category and so on) are in a MySQL db. Before we had the articles put
> > in the db also and then performed SQL queries for the search engine.
> > But this is no longer feasable since there are too many articles and
> > the db has gotten too big. The search engine does all of the db and
> > the server cpu goes max. I'm looking for a php type search engine
> > that automatically indexes the txt files, produces 1 index file with
> > all indexed words + the id's of articles having those words. Like
> > that the search script doesn't have to query all the articles (the
> > whole db) anymore but just this one index file. Would be nice also if
> > there would be possibility to have a blacklist of words (the, a,...)
> > and other admin things.
> >

>
> If the site is public, have you thought about letting Google do the
> hard work, and then either using the Google site search, or the Google
> Web API to display results? Google is getting _very_ fast in indexing
> large amounts of data on one's site. They picked up thousands of my
> pages recently while I was playing around with the htaccess... even too
> fast for my taste since I changed it again the next day...
>
> --
> Google Blogoscoped
> http://blog.outer-court.com
>



Reply With Quote
  #4 (permalink)  
Old 01-27-2004
Manuel Lemos
 
Posts: n/a
Default Re: search engine challenge

Hello,

On 01/26/2004 10:26 AM, Frank wrote:
> I'm running a site with +20.000 articles. The articles (html files) are
> saved on the server as txt files. Alle other data (author, date, category
> and so on) are in a MySQL db. Before we had the articles put in the db also
> and then performed SQL queries for the search engine. But this is no longer
> feasable since there are too many articles and the db has gotten too big.
> The search engine does all of the db and the server cpu goes max.
> I'm looking for a php type search engine that automatically indexes the txt
> files, produces 1 index file with all indexed words + the id's of articles
> having those words. Like that the search script doesn't have to query all
> the articles (the whole db) anymore but just this one index file. Would be
> nice also if there would be possibility to have a blacklist of words (the,
> a,...) and other admin things.
>
> Anyone has experience with this?


Real search engines do not use SQL. It may be usable for small sites but
for large sites like yours, it is very slow and will suck your server
resources (disk space, memory, overall speed) as you already noticed.

A better solution is to use a dedicated crawler that uses flat files as
databases optimized for full text search operations. I use and recommend
Ht://Dig in the phpclasses.org site . That is also what php.net site and
mirrors use.

Htdig is available at www.htdig.org . You may also want to take a look
at this class to interface with HtDig from PHP. It will save you a lot
of time and patience to configure, index and search your site with htdig:

http://www.phpclasses.org/htdiginterface


--

Regards,
Manuel Lemos

Free ready to use OOP components written in PHP
http://www.phpclasses.org/

MetaL - XML based meta-programming language
http://www.meta-language.net/

Reply With Quote
  #5 (permalink)  
Old 01-27-2004
Philipp Lenssen
 
Posts: n/a
Default Re: search engine challenge

Frank wrote:

> I don't think it's possible to have Google index an MySQL db? And the
> html files on the server are not .html
>


The HTML files may not have the extension "html", but extensions do not
matter to most search engines these days (not the most important one,
Google). So you serve as text/html and that's fine. If you don't expose
session IDs as parameters, and you don't use a dozen parameters, it
gets indexed fine. You can still use htaccess to display the URLs as
"....html", by the way (which might be nicer for users and for PageRank
etc.)

--
Google Blogoscoped
http://blog.outer-court.com
Reply With Quote
  #6 (permalink)  
Old 01-27-2004
Philipp Lenssen
 
Posts: n/a
Default Re: search engine challenge

Frank wrote:

> I don't think it's possible to have Google index an MySQL db? And the
> html files on the server are not .html
>


The HTML files may not have the extension "html", but extensions do not
matter to most search engines these days (not the most important one,
Google). So you serve as text/html and that's fine. If you don't expose
session IDs as parameters, and you don't use a dozen parameters, it
gets indexed fine. You can still use htaccess to display the URLs as
"....html", by the way (which might be nicer for users and for PageRank
etc.)

--
Google Blogoscoped
http://blog.outer-court.com
Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT +1. The time now is 07:14 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0