Inspired by the recent googlewhack
, I thought that by using the overlap of two searches, you should be able to estimate the size of the thing you are searching in. This assumes lots of unreasonable things about distribution, both in what you search for, and what exists on the web.
Here is the technical part. The results are at the end.
let T = the total number of pages indexed by Google.
define g(x) = number of hits google finds for x.
define g(x,y) = number of hits google finds for x y.
define g(x,y,z) = number of hits google finds for x y z.
be the frequency that the term x appears on any page indexed by google (it's from 0 to 1). i.e. if the term "x" appears on 50% of all pages, then xf
g(x) = xf
g(x,y) = xf
g(x,y,z) = xf
Solve for T in the second equation:
T = g(x,y) / (xf
T = g(x,y) / ((g(x)/T) * (g(y)/T)) (since xf
T =g(x,y) / ((g(x) * g(y)) / T2
T =g(x,y) * T2
/ (g(x) * g(y)) (simplifying more)
Solve for T and you get
T = g(x) * g(y) / g(x,y).
Since you get a value of g by using google, you can get trial values for T whenever you want!
You can do a nastier version of the above to find that with n terms x1
, ... , xn
in that case,
T = ((g(x1
) * g(x2
) * ... * g(xn
)) / g(x1
, ... ,xn
Ok, now the tests! I will do 3 tests with pair
s, and 3 tests with triplet
s. I will get the test words by doing "random node
" and taking the first word besides "the
" in the node title
T is the The number of pages indexed by Google
g(x) is the number of hits for "x" on google.
) = 46,100,000
) = 94
) = 24
) = 7,310
T = 3.8e8
) = 7,050,000
) = 836
) = 2
T = 2.9e9
) = 956
) = 7,160,000
) = 0
T = ?
) = 60,200,000
) = 2,200,000
T = 1.3e8
) = 112,000,000
) = 2,880,000
) = 2,560,000
T = 1.4e8
I got 5 results for T: 1.8e8 3.8e8 2.9e9 1.3e8 and 1.4e8. The average is 7.6e8 = 760,000,000.
After doing all this, someone told me that google actually claims to index 2e9 = 2,000,000,00 pages. So, the accuracy is not too horrible.
I am sure this has already been done in real math, and has a real name. Does anyone know it?
I know that google publicizes the number of pages they claim to index. However, this technique can be used on other search engines as well, whose credibility is in doubt.