Recall analyzer
The recall analyzer estimates the relative size of indices of the Internet search engines.
The total number of documents indexed, as reported by the SE itself cannot be used for comparison because different SEs have different methods of document count. For example, some of them include duplicate documents in the count while others do not. Counting the duplicate documents may double the reported index size or increase it even more.
Additionally, the size of the index is a very PR-sensitive issue as it is one of the very few simple notions in SE area easily understandable by journalists. This means that the bigger index database size you report, the better press you get.
The number of documents found for a particular query does not always reflect the real number of documents indexed by a SE. Almost every frequent query will return tens of thousands results in all search engines. But the user will never be allowed to see them all: the search session will be interrupted after browsing through first hundreds of pages. Thus the exact number of web pages found can be verified only in the case when the number of possible results is very small, that is for queries containing very rare words.
For multiple-word queries, certain SEs show in the search results not only the documents where all the words comprising the query are found, but also the documents containing single words from the query. These "tail" documents usually irrelevant to the query, but counting them can increase the total number of pages found.
In order to obtain independent and reliable data on the relative index size of the popular SEs, we developed a simple automatic method, based on a set of sample queries. We gathered a set of very rare words, all of which occur several tens of times on the web. Once a day, we count how many of these occurences are found by each search engine.
To make the data steadier, we use a different set of sample queries from the whole query pool every day.
The set of sample queries is constantly replenished by our linguists. If you have some rare words and want to help us cover the 'faraway' areas of the Net, please send us these words, and we will consider including them into the sample queries list.
- 90−100%
- 80−90%
- 60−80%
- 40−60%
- 20−40%
- 0−20%
|
|