This is a discussion on search engine challenge within the PHP Language forums, part of the PHP Programming Forums category; Hello, I'm running a site with +20.000 articles. The articles (html files) are saved on the server as ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
Hello,
I'm running a site with +20.000 articles. The articles (html files) are saved on the server as txt files. Alle other data (author, date, category and so on) are in a MySQL db. Before we had the articles put in the db also and then performed SQL queries for the search engine. But this is no longer feasable since there are too many articles and the db has gotten too big. The search engine does all of the db and the server cpu goes max. I'm looking for a php type search engine that automatically indexes the txt files, produces 1 index file with all indexed words + the id's of articles having those words. Like that the search script doesn't have to query all the articles (the whole db) anymore but just this one index file. Would be nice also if there would be possibility to have a blacklist of words (the, a,...) and other admin things. Anyone has experience with this? Greetz, Frank. |
|
|||
|
Frank wrote:
> > I'm running a site with +20.000 articles. The articles (html files) > are saved on the server as txt files. Alle other data (author, date, > category and so on) are in a MySQL db. Before we had the articles put > in the db also and then performed SQL queries for the search engine. > But this is no longer feasable since there are too many articles and > the db has gotten too big. The search engine does all of the db and > the server cpu goes max. I'm looking for a php type search engine > that automatically indexes the txt files, produces 1 index file with > all indexed words + the id's of articles having those words. Like > that the search script doesn't have to query all the articles (the > whole db) anymore but just this one index file. Would be nice also if > there would be possibility to have a blacklist of words (the, a,...) > and other admin things. > If the site is public, have you thought about letting Google do the hard work, and then either using the Google site search, or the Google Web API to display results? Google is getting _very_ fast in indexing large amounts of data on one's site. They picked up thousands of my pages recently while I was playing around with the htaccess... even too fast for my taste since I changed it again the next day... -- Google Blogoscoped http://blog.outer-court.com |
|
|||
|
I don't think it's possible to have Google index an MySQL db? And the html
files on the server are not .html "Philipp Lenssen" <info@outer-court.com> wrote in message news:bv3682$m6l26$1@ID-203055.news.uni-berlin.de... > Frank wrote: > > > > > I'm running a site with +20.000 articles. The articles (html files) > > are saved on the server as txt files. Alle other data (author, date, > > category and so on) are in a MySQL db. Before we had the articles put > > in the db also and then performed SQL queries for the search engine. > > But this is no longer feasable since there are too many articles and > > the db has gotten too big. The search engine does all of the db and > > the server cpu goes max. I'm looking for a php type search engine > > that automatically indexes the txt files, produces 1 index file with > > all indexed words + the id's of articles having those words. Like > > that the search script doesn't have to query all the articles (the > > whole db) anymore but just this one index file. Would be nice also if > > there would be possibility to have a blacklist of words (the, a,...) > > and other admin things. > > > > If the site is public, have you thought about letting Google do the > hard work, and then either using the Google site search, or the Google > Web API to display results? Google is getting _very_ fast in indexing > large amounts of data on one's site. They picked up thousands of my > pages recently while I was playing around with the htaccess... even too > fast for my taste since I changed it again the next day... > > -- > Google Blogoscoped > http://blog.outer-court.com > |
|
|||
|
Hello,
On 01/26/2004 10:26 AM, Frank wrote: > I'm running a site with +20.000 articles. The articles (html files) are > saved on the server as txt files. Alle other data (author, date, category > and so on) are in a MySQL db. Before we had the articles put in the db also > and then performed SQL queries for the search engine. But this is no longer > feasable since there are too many articles and the db has gotten too big. > The search engine does all of the db and the server cpu goes max. > I'm looking for a php type search engine that automatically indexes the txt > files, produces 1 index file with all indexed words + the id's of articles > having those words. Like that the search script doesn't have to query all > the articles (the whole db) anymore but just this one index file. Would be > nice also if there would be possibility to have a blacklist of words (the, > a,...) and other admin things. > > Anyone has experience with this? Real search engines do not use SQL. It may be usable for small sites but for large sites like yours, it is very slow and will suck your server resources (disk space, memory, overall speed) as you already noticed. A better solution is to use a dedicated crawler that uses flat files as databases optimized for full text search operations. I use and recommend Ht://Dig in the phpclasses.org site . That is also what php.net site and mirrors use. Htdig is available at www.htdig.org . You may also want to take a look at this class to interface with HtDig from PHP. It will save you a lot of time and patience to configure, index and search your site with htdig: http://www.phpclasses.org/htdiginterface -- Regards, Manuel Lemos Free ready to use OOP components written in PHP http://www.phpclasses.org/ MetaL - XML based meta-programming language http://www.meta-language.net/ |
|
|||
|
Frank wrote:
> I don't think it's possible to have Google index an MySQL db? And the > html files on the server are not .html > The HTML files may not have the extension "html", but extensions do not matter to most search engines these days (not the most important one, Google). So you serve as text/html and that's fine. If you don't expose session IDs as parameters, and you don't use a dozen parameters, it gets indexed fine. You can still use htaccess to display the URLs as "....html", by the way (which might be nicer for users and for PageRank etc.) -- Google Blogoscoped http://blog.outer-court.com |
|
|||
|
Frank wrote:
> I don't think it's possible to have Google index an MySQL db? And the > html files on the server are not .html > The HTML files may not have the extension "html", but extensions do not matter to most search engines these days (not the most important one, Google). So you serve as text/html and that's fine. If you don't expose session IDs as parameters, and you don't use a dozen parameters, it gets indexed fine. You can still use htaccess to display the URLs as "....html", by the way (which might be nicer for users and for PageRank etc.) -- Google Blogoscoped http://blog.outer-court.com |