Inspired by the recent

googlewhacking

fad, I thought that by using the overlap of two searches, you should be able to estimate the size of the thing you are searching in. This assumes lots of unreasonable things about distribution, both in what you search for, and what exists on the web.

Here is the technical part. The results are at the end.

let T = the total number of pages indexed by Google.

define g(x) = number of hits google finds for x.

define g(x,y) = number of hits google finds for x y.

define g(x,y,z) = number of hits google finds for x y z.

etc.

let x

_{f} be the frequency that the term x appears on any page indexed by google (it's from 0 to 1). i.e. if the term "x" appears on 50% of all pages, then x

_{f} = 0.5

So,

g(x) = x

_{f} * T

g(x,y) = x

_{f} * y

_{f} * T

g(x,y,z) = x

_{f} * y

_{f} * z

_{f} * T

Solve for T in the second equation:

T = g(x,y) / (x

_{f} * y

_{f})

T = g(x,y) / ((g(x)/T) * (g(y)/T)) (since x

_{f} = g(x)/T)

T =g(x,y) / ((g(x) * g(y)) / T

^{2}) (simplifying)

T =g(x,y) * T

^{2} / (g(x) * g(y)) (simplifying more)

Solve for T and you get

**Result 1**T = g(x) * g(y) / g(x,y).

Since you get a value of g by using google, you can get trial values for T whenever you want!

You can do a nastier version of the above to find that with n terms x

_{1}, x

_{2}, ... , x

_{n}
**Result 2**in that case,

T = ((g(x

_{1}) * g(x

_{2}) * ... * g(x

_{n})) / g(x

_{1},x

_{2}, ... ,x

_{n}))^(1/(n-1))

Test Results!
Ok, now the tests! I will do 3 tests with

pairs, and 3 tests with

triplets. I will get the test words by doing "

random node" and taking the first word besides "

the" in the

node title.

T is the

The number of pages indexed by Google.

g(x) is the number of hits for "x" on google.

Test 1

g(

house) = 46,100,000

g(

Estrangle) = 94

g(

house,

Estrangle) = 24

T= 1.8e8

Test 2

g(

Hostess)= 450,000

g(

gene)= 6,240,000

g(

Hostess,

gene) = 7,310

T = 3.8e8

Test 3

g(

Toronto) = 7,050,000

g(

Brassclaw) = 836

g(

Toronto,

Brassclaw) = 2

T = 2.9e9

Test 4

g(

Unlikeness)= 3890

g(

Ophidiophobia) = 956

g(

Adobe) = 7,160,000

g(

Unlikeness,

Ophidiophobia,

Adobe) = 0

T = ?

Test 5

g(

condom) 689,000

g(

open) = 60,200,000

g(

Failing) = 2,200,000

g(

condom,

open,

Failing)= 5410

T = 1.3e8

Test 6

g(

Two) = 112,000,000

g(

Genesis) = 2,880,000

g(

brilliant) = 2,560,000

g(

Two,

Genesis,

brilliant)= 43,000

T = 1.4e8

Results
I got 5 results for T: 1.8e8 3.8e8 2.9e9 1.3e8 and 1.4e8. The average is 7.6e8 = 760,000,000.

After doing all this, someone told me that google actually claims to index 2e9 = 2,000,000,00 pages. So, the accuracy is not too horrible.

I am sure this has already been done in real math, and has a real name. Does anyone know it?

I know that google publicizes the number of pages they claim to index. However, this technique can be used on other search engines as well, whose credibility is in doubt.