Analyzer of nagivational search

A search query with a purpose of finding a certain website is called a navigational query. Such queries include "sberbank", "komsomolskaya pravda", "rambler", "gazeta ru", etc.

The best result for a navigational query is the required site in the first position of search results. For evaluation of navigational search, the search engines were tested with 200 queries randomly selected from the array of navigational queries. Each query was assigned one or more site/marker. The top 10 search results are checked for the site/marker entries. When several sites/markers were assigned to a query, each of them listed in one of the top positions was considered a hit. The percentage of queries which yielded the site/marker on the first page was calculated. This number is the aggregate indicator of the quality of navigational search.

The best search engine is the one with highest aggregate indicator for this analyzer. In the informer, the search engines are sorted by the aggregate indicator.

Analyzer of subject search

A human being is often able to interpret a search query, determine what the user wants, evaluate the information on the Web and form the search results better than a machine. For this reason, the results formed by an expert are always better than those of an algorithm. This analyzer monitors daily the search results for a set of queries, for which the corresponding links have been selected by experts. It calculates the number of sites found by each search engine that were on the experts' list. For each query, the percentage of similarity between algorithmic and expert results is calculated.

As an expert opinion, the output of the expert system Neuron is used. The aggregate indicator of this analyzer is the proportion of results matching the expert opinion, regardless of the position of the matching site(s).

The best search engine is the one with highest value of aggregate indicator. In the informer, the search engines are sorted by the aggregate indicator.

Currently, 18 queries are evaluated. The number of the queries will be increased.

Analyzer of correct hints

Most of the search engines attempt to suggest a correct spelling for a query in case a typo is suspected. The quality of such hints is an important addition to the overall quality of the search. This analyzer looks for the correct hint in the search results for a query with a deliberate typo and estimates the number of occurrences of a 'correct' query contained in the hint. The evaluation is based on the same set of queries containing typos that is used for the typo resistance analyzer. The more correct hints have been given, the higher is the search engine's index for this analyzer. This determines the order of search engines in the informer of the analyzer.

Typo resistance analyzer

Human are not machines, they make mistakes. This includes the mistakes while typing in a search query: a typo, next button pressed by accident ("quety" instead of "query"), a double character or a missed one ("qury" or "queery"), after all, the user can type the word 'by ear' not knowing the correct spelling ("yandax" instead of "yandex"). In this case, the search engine can adhere to one of the following strategies:
1) no processing: search with exact spelling only
2) recognize the typo, but still search for the entered query with an additional hint: "perhaps you were looking for [correct spelling]?"
3) recognize the typo and search for the correct spelling immediately

Depending on the chosen strategy, the user either remains unaware of the fact that (s)he is mistaken, or notices it and makes an extra click (up to the user), or gets the correct results without ever noticing his own mistake.

This analyzer compares the search results of the "correct query" and several forms of its possible mistypings. The similarity of results to those of a "correct" query is evaluated.


Apart from deliberate typo correction, matches can arise in four cases:
1) accidentally
3) the page contains both the correct and mistyped spelling
4) incorrect reaction of the engine's morphology (e.g., the unknown word "mushroomz" which is a typo of "mushrooms" is corrected to "mushroom")
5) promotion of the same websites both for correct and incorrect spelling of queries

All of these cases produce noise in this analyzer: an accidental match of results.
The similarity is evaluated in the same way as for the update analyzer but with a different set of queries.

The more matching results are registered, the higher is the index of the search engine for this analyzer. This determines the order of search engines in the informer of the analyzer.

In future, a rotation of query sets with typos from a wide array will be introduced.

Quotation search quality analyzer

Sometimes people search for a certain text using its known fragment – they do “quotation search”. This method is frequently used to find original literary work. In response to quotation queries, a quality search engine should return a link to a web page containing the text of the work that the quotation was taken from. Ideally, the relevant link should appear first.

For example, a user submitting the query "To be or not to be, that is the question" is most certainly looking for the text of Shakespeare's Hamlet. A link to this text should appear first in the search results. Quotation queries are usually longer than other queries and are also more unique (the same quotation query is rarely repeated). Overall however, the quotation queries make a formidable share of queries.

To estimate the quality of quotation search, the search engines are tested using 50 queries, randomly selected from the whole array of quotation queries. Each query has one or more text fragments that must be included in the text of the pages listed in the search results. The Top 10 results are checked for these text fragments (called ‘marker fragments’).
For each search engine, we calculate the percentage of queries, for which the marker fragment was found in Top 10. This number is used as the aggregate indicator of the quality of quotation search.

original_name

original_desc original_desc_long

synonym_name

synonym_desc synonym_desc_long

Analyzer of search spam level

At "Ashmanov and Partners" we study the phenomenon of search spam – the methods and technologies reducing the quality of search results and interfering with the operation of search engines.

Search spam is a text, URL, technology, program code or other web elements created by the web-master for the sole purpose of promoting the site in search engines' results, and not for a fast and reliable search based on complete and authentic information. The experts check Top 10 results of search queries on a regular basis, marking the sites which, in their opinion, contain elements of search spam. The collected data is entered into the informer. It shows the percentage of sites marked as spam in the overall number of sites that appeared in Top 10 of analyzed queries.

The source of information on the spam status of a given URL is the data of the anti-spam lab of the company "Ashmanov and Partners". The following categories of search spam are used:

* doorway – definite spam: doorways, leading the user to other pages,
* spamcatalog – definite spam: spammer catalogues,
* spamcontent – definite spam: spammers' stolen content,
* pseudosite – definite spam: site disguised as corporate (pseudo-company),
* catalog – catalogues,
* board – bulletin boards,
* domainsale – domains for sale,
* secondary – secondary, stolen content,
* partner – any partner programs,
* linksite – link support site,
* spamforum – forum containing spam,
* techspam – technical spam,
* searchres – search results
An aggregate indicator is the share of spam sites in the search results. The best search engine has the lowest indicator. This determines the order of search engines in the informer of the analyzer.

[ ]

SEO-pressing analyzer

Many queries are ambiguous, for instance: ‘design’, ‘cars’, ‘sports’, etc. These queries are called ‘informational’. The best result for such a query would be a selection of links to the resources representing different meanings of the query. Thus, the output for the query "design" should contain links to the websites on web-design, landscape design, interior design, etc. It is not easy to compile a high-quality multi-thematic set of links, especially considering the fact that site optimizers abuse popular informational queries to promote their customers' sites. Due to such SEO-"pressing", the top is taken over by resources whose promotion is most profitable, so the results become monotonous, consisting of websites with the same kind of commercial offers.
The analyzer searches the title phrases and snippets of the top 10 search results for similar lines. The summarizing index is the percentage of similar lines in the overall number of sites found in the top 10 results for the analyzed queries. The higher this index is, the higher is the SEO-pressing on the given search engine.
Typical words and phrases in the title or quotation are considered an indication of monothematicity. The percentage of search results that include "marker phrases" is the aggregate indicator.
Best search engine has the lowest aggregate indicator for this analyzer. This determines the order of search engines in the informer of the given analyzer.

Analyzer of 'adult sites' presence in the search results

This analyzer is currently running in test mode, the pornography detection for text documents is being fine-tuned. The results may be incorrect.

This analyzer collects search results for ambiguous queries which may be interpreted as targeting a certain category of pornography, but also admit other interpretations. No queries which unambiguously indicate that the user is searching for porn are included.

For instance, a query "stockings" could come from a user looking for a stockings shop or for the corresponding category of pornography. In order to detect pornography, the search results are processed using the technology "Semantic Mirror" developed by our company. Web content that is assigned to the category /Dosug/Adult or any of its subcategories is considered "adult content".

For every search engine, the percentage of documents in its top 10 results belonging to these categories is calculated.

The "adult sites" presence analyzer implies no valuation. This means that we do not affirm that a search engine with a high percentage of porn in its search results is "bad" or "immoral"..

Recall analyzer

The recall analyzer estimates the relative size of indices of the Internet search engines. The total number of documents indexed, as reported by the SE itself cannot be used for comparison because different SEs have different methods of document count. For example, some of them include duplicate documents in the count while others do not. Counting the duplicate documents may double the reported index size or increase it even more.
Additionally, the size of the index is a very PR-sensitive issue as it is one of the very few simple notions in SE area easily understandable by journalists. This means that the bigger index database size you report, the better press you get.

The number of documents found for a particular query does not always reflect the real number of documents indexed by a SE. Almost every frequent query will return tens of thousands results in all search engines. But the user will never be allowed to see them all: the search session will be interrupted after browsing through first hundreds of pages. Thus the exact number of web pages found can be verified only in the case when the number of possible results is very small, that is for queries containing very rare words.

For multiple-word queries, certain SEs show in the search results not only the documents where all the words comprising the query are found, but also the documents containing single words from the query. These "tail" documents usually irrelevant to the query, but counting them can increase the total number of pages found.

In order to obtain independent and reliable data on the relative index size of the popular SEs, we developed a simple automatic method, based on a set of sample queries. We gathered a set of very rare words, all of which occur several tens of times on the web. Once a day, we count how many of these occurences are found by each search engine.
To make the data steadier, we use a different set of sample queries from the whole query pool every day.

The set of sample queries is constantly replenished by our linguists. If you have some rare words and want to help us cover the 'faraway' areas of the Net, please send us these words, and we will consider including them into the sample queries list.

Update analyzer

‘Update’ refers to the process of search results renewal. When the results are updated, some sites may make it to the top 10, some other sites may "sink". Every search engine has its own update style which becomes clear in this analyzer. Every day the search engine update analyzer monitors the top ten responses to 140 queries in order to assess the number of sites that changed their positions, and how much the positions have changed. Let Di be the change in position for the page that appeared i-th in top 10 search results on day 1. For example, if the fifth page from the first day top 10 appeared third or seventh on the second day, D5=2. If the second day top 10 did not contain a certain page which was present on the first day, then we will assume that Di=10 for that page.

The update indicator is calculated using the formula:

10
∑ Di/100
i=1

Consider a couple of examples:
Example 1
On Day 1, a certain query has the following Top 10:
C1, C2, C3, C4, C5, C6, C7, C8, C9, C10.
On Day 2, the same query has this Top 10:
Cn, C1, C2, C3, C4, C5, C6, C7, C8, C9.

In this case the update indicator is calculated as follows:
((2-1)+(3-2)+(4-3)+(10-9)+10)/100 = 0.19 (19%)

Example 2
For Day 1, a certain query has the following Top 10:
C1, C2, C3, C4, C5, C6, C7, C8, C9, C10.
For Day 2, the same query has this Top 10:
Cn1, Cn2, Cn3, Cn4, Cn5, Cn6, Cn7, Cn8, Cn9, Cn10.

In this case the update indicator equals:
10*10/100 = 1.00 (100%)

The analyzer also calculates the additional parameters: the number of sites which disappeared from the search results and the number of sites which changed their positions.

This analyzer has no valuation. The results can be interpreted in two ways: a search engine that has frequent large updates could be considered more up-to-date; a search engine with rare updates can be considered more stable and predictable. The informer of this analyzer sorts the search engines in the ascending order of update level.

Click analyzer

This analyzer shows what percentage of clicks leading to Russian web pages comes from each search engine. Unlike the other analyzers, this one does not directly assess the search quality. Rather it reflects the popularity and usage of the search engines. The analyzer utilizes the data from Liveinternet.ru. We only take into account the clicks on sites that have a Liveinternet.ru counter installed. Out of all the data of the LiveInternet counter, we only take into account the data on Russian users (Russian IP addresses). This is done to filter out the noise produced by the so-called "idiot clicks", i.e. random clicks of non-Russian-speaking users of "big" search engines such as Google, MSN Live Search, and Yahoo. These are not really Russian search engine users, but they can significantly distort the statistics (since the Internet outside Russia is vast, and the number of such random users is high).

The numbers cited in this analyzer are usually considered the shares of the search engines' market, but this is not quite correct. Here is why:
a) The LiveInternet counter only shows clicks on the sites where it is installed. Some big websites do not install it. Thus the statistics is not, strictly speaking, representative of the whole Russian Internet.

b) It is unclear how exactly the percentage of clicks from a search engine correlates with its true popularity. What if, using a "bad" search engine, the user has to click on multiple search results before (s)he finds the right site, while using a "good" one (s)he finds what (s)he needs at the first click? The "bad" search engine would in this case generate many clicks per user while the "good" one would generate only one. In general, the exact connection between popularity and clicks is unknown. A huge change in the percentage of clicks (say, 5 points or more) would probably reflect a real change in attendance of a search engine. Smaller fluctuations (1-2%) are probably less informative.

It is important to keep in mind that these figures represent percentage, not the absolute attendance or the absolute number of clicks. Thus the small dips clearly visible on the monthly graph of Yandex are mirrored by small increases on the part of Google. The attendance of Yandex decreases on weekends while that of Google suffers less (the reason is unknown to us). Since the share of Yandex is high, its decrease results in proportional growth of the share of Google on weekends (the sum of all search engines' shares remains constant). For Rambler, the weekend decrease is just as pronounced as it is for Yandex, so its share of percentage does not rise in the way that of Google does.

In the informer of this analyzer, the search engines are arranged in the descending order of the share of clicks.

actual_name

actual_desc actual_desc_long

regional_name

regional_desc regional_desc_long