This is a discussion on Searching Google n-gram corpus? within the MySQL Database forums, part of the Database Forums category; Hi, Google released a corpus of n-grams collected from the Web. http://googleresearch.blogspot.com/2...ng-to-you....
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
Hi,
Google released a corpus of n-grams collected from the Web. http://googleresearch.blogspot.com/2...ng-to-you.html It contains all 1..5grams that occur more than 40 times in their web crawl. It comes as 5 folders, each folder containing around 120 files. Each file contains 10,000,000 (10^7) lines. A line looks like: "this is a four gram 65" where the last number is the frequency of that exact phrase. The total unzipped size of the 3 grams alone is 19GB, each individual file around 200MB. All the unzipped data is around 100GB. I would like to be able to search through all this and return all lines that contain a particular word or phrase. I have no idea where to start with this, but I was wondering would an SQL database be feasible. For the 5-grams i would need a billion rows and of 6 columns. What sort of hard disk space would I need, and what kind of time would i be looking at per search on on ordinary mahcine?, I would like to be able to find every line where a particular word occurs, no matter which position it occurs in, and ideally I would like to be able to find particular bigrams as well. thanks. |
|
|||
|
On Fri, 07 Sep 2007 15:55:17 -0000, bobterwillinger@gmail.com
wrote: >Hi, > >Google released a corpus of n-grams collected from the Web. > >http://googleresearch.blogspot.com/2...ng-to-you.html > >It contains all 1..5grams that occur more than 40 times in their web >crawl. It comes as 5 folders, each folder containing around 120 files. >Each file contains 10,000,000 (10^7) lines. A line looks like: > >"this is a four gram 65" > >where the last number is the frequency of that exact phrase. >The total unzipped size of the 3 grams alone is 19GB, each individual >file around 200MB. >All the unzipped data is around 100GB. > >I would like to be able to search through all this and return all >lines that contain a particular word or phrase. >I have no idea where to start with this, but I was wondering would an >SQL database be feasible. For the 5-grams i would need a billion rows >and of 6 columns. What sort of hard disk space would I need, and what >kind of time would i be looking at per search on on ordinary mahcine?, > >I would like to be able to find every line where a particular word >occurs, no matter which position it occurs in, and ideally I would >like to be able to find particular bigrams as well. > >thanks. I risk getting off topic, but in my humble opinion you'd be much better off by using the Google search engine, which is probably fed by those same n-grams, by enclosing your google query in double quotes, like "this is a five gram". What you get back is a series of response pages with hyperlinks to the original documents, which you can subsequently access and analyse. Google Advanced Search gives some options to refine the query to your needs, like 100 results per page. All of that is encoded in the URL, so you can manipulate it quite easily. http://www.google.com/intl/en/help/refinesearch.html -- ( Kees ) c[_] Invalid thought detected. Close all mental processes and restart body. (#409) |
|
|||
|
> I risk getting off topic, but in my humble opinion you'd be much
> better off by using the Google search engine, which is probably > fed by those same n-grams, by enclosing your google query in > double quotes, like "this is a five gram". > > What you get back is a series of response pages with hyperlinks > to the original documents, which you can subsequently access and > analyse. > > Google Advanced Search gives some options to refine the query to > your needs, like 100 results per page. All of that is encoded in > the URL, so you can manipulate it quite easily.http://www.google.com/intl/en/help/refinesearch.html thanks, I want to do a lot of searches and if you do a lot of automated queries they can block your IP, which would get my whole college blocked. They have a SOAP API too, but it's sort of decommissioned and was pretty rubbish anyway. |
|
|||
|
On Sat, 08 Sep 2007 12:55:44 -0000, Paul Nulty
<paul.nulty@gmail.com> wrote: >> I risk getting off topic, but in my humble opinion you'd be much >> better off by using the Google search engine, which is probably >> fed by those same n-grams, by enclosing your google query in >> double quotes, like "this is a five gram". >> >> What you get back is a series of response pages with hyperlinks >> to the original documents, which you can subsequently access and >> analyse. >> >> Google Advanced Search gives some options to refine the query to >> your needs, like 100 results per page. All of that is encoded in >> the URL, so you can manipulate it quite easily.http://www.google.com/intl/en/help/refinesearch.html > > >thanks, I want to do a lot of searches and if you do a lot of >automated queries they can block your IP, which would get my whole >college blocked. Oops, i don't want to seduce you into that ;) >They have a SOAP API too, but it's sort of >decommissioned and was pretty rubbish anyway. Ok, I'll leave it for the real MySQL gurus then ;) -- ( Kees ) c[_] The secret of being miserable is to have leisure to bother about whether you are happy or not. The cure for it is occupation. (George Bernard Shaw 1856-1950) (#472) |
|
|||
|
I have a similar system using a different database engine. Why not put each file in a simple table. 1 file per table. each table with three columns phrase, grams count example "this is a phrase" 4 10000 you can create a large view with all the tables via a union all clause create view alldata as select * from file1 union all select * from file2 union all select * from file3 union all select * from file4 union all select * from file5 union all .... try to spread you tables amongst many disks. disk stripping could also help having smaller tables to work with will be a lot easier than having a 40 trillion record table :-) if your database supports word indexing, you could use a word index on phrase, and search via the contains predicate. what do you plan to do with the dataset anyway? |