Searching Google n-gram corpus?

This is a discussion on Searching Google n-gram corpus? within the MySQL Database forums, part of the Database Forums category; Hi, Google released a corpus of n-grams collected from the Web. http://googleresearch.blogspot.com/2...ng-to-you....


Go Back   Usenet Forums > Database Forums > MySQL Database

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 09-07-2007
bobterwillinger@gmail.com
 
Posts: n/a
Default Searching Google n-gram corpus?

Hi,

Google released a corpus of n-grams collected from the Web.

http://googleresearch.blogspot.com/2...ng-to-you.html

It contains all 1..5grams that occur more than 40 times in their web
crawl. It comes as 5 folders, each folder containing around 120 files.
Each file contains 10,000,000 (10^7) lines. A line looks like:

"this is a four gram 65"

where the last number is the frequency of that exact phrase.
The total unzipped size of the 3 grams alone is 19GB, each individual
file around 200MB.
All the unzipped data is around 100GB.

I would like to be able to search through all this and return all
lines that contain a particular word or phrase.
I have no idea where to start with this, but I was wondering would an
SQL database be feasible. For the 5-grams i would need a billion rows
and of 6 columns. What sort of hard disk space would I need, and what
kind of time would i be looking at per search on on ordinary mahcine?,

I would like to be able to find every line where a particular word
occurs, no matter which position it occurs in, and ideally I would
like to be able to find particular bigrams as well.

thanks.

Reply With Quote
  #2 (permalink)  
Old 09-07-2007
Kees Nuyt
 
Posts: n/a
Default Re: Searching Google n-gram corpus?

On Fri, 07 Sep 2007 15:55:17 -0000, bobterwillinger@gmail.com
wrote:

>Hi,
>
>Google released a corpus of n-grams collected from the Web.
>
>http://googleresearch.blogspot.com/2...ng-to-you.html
>
>It contains all 1..5grams that occur more than 40 times in their web
>crawl. It comes as 5 folders, each folder containing around 120 files.
>Each file contains 10,000,000 (10^7) lines. A line looks like:
>
>"this is a four gram 65"
>
>where the last number is the frequency of that exact phrase.
>The total unzipped size of the 3 grams alone is 19GB, each individual
>file around 200MB.
>All the unzipped data is around 100GB.
>
>I would like to be able to search through all this and return all
>lines that contain a particular word or phrase.
>I have no idea where to start with this, but I was wondering would an
>SQL database be feasible. For the 5-grams i would need a billion rows
>and of 6 columns. What sort of hard disk space would I need, and what
>kind of time would i be looking at per search on on ordinary mahcine?,
>
>I would like to be able to find every line where a particular word
>occurs, no matter which position it occurs in, and ideally I would
>like to be able to find particular bigrams as well.
>
>thanks.


I risk getting off topic, but in my humble opinion you'd be much
better off by using the Google search engine, which is probably
fed by those same n-grams, by enclosing your google query in
double quotes, like "this is a five gram".

What you get back is a series of response pages with hyperlinks
to the original documents, which you can subsequently access and
analyse.

Google Advanced Search gives some options to refine the query to
your needs, like 100 results per page. All of that is encoded in
the URL, so you can manipulate it quite easily.
http://www.google.com/intl/en/help/refinesearch.html
--
( Kees
)
c[_] Invalid thought detected. Close all
mental processes and restart body. (#409)
Reply With Quote
  #3 (permalink)  
Old 09-08-2007
Paul Nulty
 
Posts: n/a
Default Re: Searching Google n-gram corpus?

> I risk getting off topic, but in my humble opinion you'd be much
> better off by using the Google search engine, which is probably
> fed by those same n-grams, by enclosing your google query in
> double quotes, like "this is a five gram".
>
> What you get back is a series of response pages with hyperlinks
> to the original documents, which you can subsequently access and
> analyse.
>
> Google Advanced Search gives some options to refine the query to
> your needs, like 100 results per page. All of that is encoded in
> the URL, so you can manipulate it quite easily.http://www.google.com/intl/en/help/refinesearch.html



thanks, I want to do a lot of searches and if you do a lot of
automated queries they can block your IP, which would get my whole
college blocked. They have a SOAP API too, but it's sort of
decommissioned and was pretty rubbish anyway.


Reply With Quote
  #4 (permalink)  
Old 09-08-2007
Kees Nuyt
 
Posts: n/a
Default Re: Searching Google n-gram corpus?

On Sat, 08 Sep 2007 12:55:44 -0000, Paul Nulty
<paul.nulty@gmail.com> wrote:

>> I risk getting off topic, but in my humble opinion you'd be much
>> better off by using the Google search engine, which is probably
>> fed by those same n-grams, by enclosing your google query in
>> double quotes, like "this is a five gram".
>>
>> What you get back is a series of response pages with hyperlinks
>> to the original documents, which you can subsequently access and
>> analyse.
>>
>> Google Advanced Search gives some options to refine the query to
>> your needs, like 100 results per page. All of that is encoded in
>> the URL, so you can manipulate it quite easily.http://www.google.com/intl/en/help/refinesearch.html

>
>
>thanks, I want to do a lot of searches and if you do a lot of
>automated queries they can block your IP, which would get my whole
>college blocked.


Oops, i don't want to seduce you into that ;)

>They have a SOAP API too, but it's sort of
>decommissioned and was pretty rubbish anyway.


Ok, I'll leave it for the real MySQL gurus then ;)
--
( Kees
)
c[_] The secret of being miserable is to have leisure to
bother about whether you are happy or not. The cure
for it is occupation. (George Bernard Shaw 1856-1950) (#472)
Reply With Quote
  #5 (permalink)  
Old 09-16-2007
Shield
 
Posts: n/a
Default Re: Searching Google n-gram corpus?


I have a similar system using a different database engine.

Why not put each file in a simple table. 1 file per table.

each table with three columns

phrase,
grams
count

example

"this is a phrase"
4
10000

you can create a large view with all the tables via a union all clause

create view alldata
as
select * from file1
union all
select * from file2
union all
select * from file3
union all
select * from file4
union all
select * from file5
union all
....
try to spread you tables amongst many disks. disk stripping could also
help

having smaller tables to work with will be a lot easier than having a
40 trillion record table :-)

if your database supports word indexing, you could use a word index on
phrase, and search via the contains predicate.

what do you plan to do with the dataset anyway?


Reply With Quote
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT +1. The time now is 06:29 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0