Hardware and software setup

How searches are carried out in search engines. How search engines work - snippets, reverse search algorithm, page indexing and features of Yandex

Hello dear readers!

Search engines in the global Internet space in currently enough. Each of them has its own algorithms for indexing and ranking sites, but in general, the principle of operation of search engines is quite similar.

Knowledge of how it works search system in the face of rapidly growing competition, they are a significant advantage in promoting not only commercial, but also informational sites and blogs. This knowledge helps to build an effective website optimization strategy and get into the TOP search results for promoted query groups with less effort.

Search engine principles

The purpose of the optimizer is to "adjust" promoted pages to search algorithms and, thereby, help these pages achieve high positions for certain queries. But before starting work on optimizing a site or blog, it is necessary to at least superficially understand the features of the work of search engines in order to understand how they can react to the actions taken by the optimizer.

Of course, the detailed details of the formation of search results are information that search engines do not disclose. However, for the right efforts, it is enough to understand the main principles by which search engines work.

Information search methods

The two main methods used by search engines today differ in their approach to finding information.

  1. Direct Search Algorithm, which involves matching each of the documents stored in the database of the search engine with a key phrase (user request), is a fairly reliable method that allows you to find all the necessary information. The disadvantage of this method is that when searching through large datasets, the time required to find the answer is quite long.
  2. Reverse Index Algorithm, when key phrase the list of documents in which it is present is compared, it is convenient when interacting with databases containing tens and hundreds of millions of pages. With this approach, the search is performed not on all documents, but only on special files, which includes lists of words contained on site pages. Each word in such a list is accompanied by an indication of the coordinates of the positions where it occurs, and other parameters. It is this method that is used today in the work of such well-known search engines as Yandex and Google.

It should be noted here that when a user accesses the browser's search bar, the search is not performed directly on the Internet, but in pre-collected, saved and up-to-date this moment databases containing blocks of information processed by search engines (site pages). Quick generation of search results is possible thanks to working with reverse indexes.

The text content of the pages (direct indexes) is also stored by search engines and used in the automatic generation of snippets from text fragments that are most appropriate for the query.

Mathematical model of ranking

In order to speed up the search and simplify the process of generating the issue that best meets the user's request, a certain mathematical model is used. The task of this mathematical model- finding desired pages in the current database of reverse indexes, assessment of their degree of compliance with the request and distribution in descending order of relevance.

Just finding the right phrase on the page is not enough. When determined by search engines, the calculation of the weight of the document relative to the user request is applied. For each request, this parameter is calculated based on the following data: the frequency of use on the analyzed page and a coefficient that reflects how rarely the same word occurs in other documents of the search engine database. The product of these two values ​​corresponds to the weight of the document.

Of course, the presented algorithm is very simplified, since search engines have a number of other additional coefficients used in the calculations, but the meaning does not change. The more often single word of the user's request occurs in any document, the higher the weight of the latter. At the same time, the text content of the page is considered spam if certain limits are exceeded, which are different for each request.

The main functions of the search engine

Everything existing systems search engines are designed to perform several important functions: information search, its indexing, quality assessment, correct ranking and formation of search results. The primary task of any search engine is to provide the user with the information he is looking for, the most accurate answer to a specific request.

Since most users have no idea how Internet search engines work, there are very limited opportunities to educate users on the “correct” search (for example, search hints), developers are forced to improve the search itself. The latter implies the creation of algorithms and principles of operation of search engines that allow finding the required information, regardless of how “correctly” it is formulated. search query.

Scanning

This is tracking changes in already indexed documents and searching for new pages that can be presented in the results of issuing user requests. Search engines scan resources on the Internet using specialized programs called spiders or search robots.

Scanning of Internet resources and data collection is performed by search bots automatically. After the first visit to the site and its inclusion in the search database, robots begin to periodically visit this site in order to track and record the changes that have occurred in the content.

Since the number of developing resources on the Internet is large, and new sites appear daily, the described process does not stop for a minute. This principle of operation of search engines on the Internet allows them to always have up-to-date information about the sites available on the network and their content.

The main task of the search robot is to search for new data and transfer it to the search engine for further processing.

Indexing

The search engine is able to find data only on sites presented in its database - in other words, indexed. At this step, the search engine must determine whether the found information should be entered into the database and, if so, into which section. This process is also performed automatically.

It is believed that Google indexes almost all information available on the network, while Yandex approaches content indexing more selectively and not so quickly. Both search giants of the Runet work for the benefit of the user, but general principles the work of the search engine Google and Yandex is somewhat different, as they are based on the unique software solutions that make up each system.

The common point for search engines is that the process of indexing all new resources takes longer than indexing new content on sites known to the system. Information that appears on sites that are highly trusted by search engines gets into the index almost instantly.

Ranging

Ranking is an assessment by search engine algorithms of the significance of indexed data and alignment of them in accordance with factors specific to this search engine. The received information is processed in order to generate search results for the entire range of user requests. What kind of information will be presented in the search results above and what below is completely determined by how the selected search engine and its algorithms work.

Sites that are in the base of the search engine are distributed according to topics and groups of queries. For each group of requests, a preliminary issuance is formed, which is subject to further adjustment. The positions of most sites change after each update of the issuance - updating the ranking, which happens daily in Google, in Yandex search - once every few days.

Man as an assistant in the struggle for the quality of issuance

The reality is that even the most advanced search engines, such as Yandex and Google, currently still need human help to generate results that meet accepted quality standards. Where search algorithm does not work well enough, its results are adjusted manually - by evaluating the content of the page against a variety of criteria.

Numerous army of specially trained people from different countries– moderators (assessors) of search engines – every day they have to do a huge amount of work to check the compliance of site pages with user requests, filtering spam and prohibited content (texts, images, videos). The work of assessors allows you to make the issuance cleaner and contributes to the further development of self-learning search algorithms.

Conclusion

With the development of the Internet and the gradual change in standards and forms of content presentation, the approach to search is also changing, the processes of indexing and ranking information, the algorithms used are being improved, and new ranking factors are emerging. All this allows search engines to generate the most high-quality and adequate results for user requests, but at the same time complicates the life of webmasters and website promotion specialists.

In the comments under the article, I propose to speak about which of the main search engines of the Runet - Yandex or Google, in your opinion, works better, providing the user with a better search, and why.

The Internet is necessary for many users in order to receive answers to the queries (questions) they enter.

If there were no search engines, users would have to search for the necessary sites on their own, remember them, and write them down. In many cases, finding something suitable "manually" would be very difficult, and often simply impossible.

For us, all this routine work on searching, storing and sorting information on sites.

Let's start with the well-known Runet search engines.

Internet search engines in Russian

1) Let's start with the domestic search engine. Yandex works not only in Russia, but also works in Belarus and Kazakhstan, in Ukraine, in Turkey. There is also Yandex on English language.

2) The Google search engine came to us from America, has Russian-language localization:

3) The domestic search engine Mile ru, which simultaneously represents the social network VKontakte, Odnoklassniki, also My World, the famous Answers Mail.ru and other projects.

4) Intelligent search engine

Nigma (Nigma) http://www.nigma.ru/

Since September 19, 2017, the “intellectual” nigma has not been working. She ceased to be of financial interest to her creators, they switched to another search engine called CocCoc.

5) The well-known company Rostelecom created the Sputnik search engine.

There is a Sputnik search engine, designed specifically for children, about which I wrote.

6) Rambler was one of the first domestic search engines:

There are other famous search engines in the world:

  • bing,
  • Yahoo!
  • Baidu,
  • ecosia,

Let's try to figure out how the search engine works, namely, how sites are indexed, the analysis of indexing results and the formation of search results. The principles of operation of search engines are approximately the same: searching for information on the Internet, storing it and sorting it for issuance in response to user requests. But the algorithms used by search engines can be very different. These algorithms are kept secret and its disclosure is prohibited.

Entering the same query in search strings different search engines, you can get different answers. The reason is that all search engines use their own algorithms.

Purpose of search engines

First of all, you need to know that search engines are commercial organizations. Their goal is to make a profit. Profits can be earned from contextual advertising, other types of advertising, with the promotion of the necessary sites to the top lines of the issue. In general, there are many ways.

It depends on the size of the audience he has, that is, how many people use this search engine. The larger the audience, the more people the ad will be shown. Accordingly, this advertising will cost more. Search engines can increase the audience by own advertising, as well as attracting users by improving the quality of their services, algorithm and search convenience.

The most important and difficult thing here is the development of a full-fledged functioning search algorithm that would provide relevant results for most user queries.

The work of the search engine and the actions of webmasters

Each search engine has its own algorithm, which must take into account a huge number of different factors when analyzing information and compiling results in response to a user request:

  • the age of a particular site,
  • site domain characteristics,
  • the quality of the content on the site and its types,
  • site navigation and structure features,
  • usability (user-friendliness),
  • behavioral factors (the search engine can determine whether the user found what he was looking for on the site or the user returned to the search engine again and is looking for the answer to the same query there again)
  • etc.

All this is necessary precisely to ensure that the issuance at the request of the user is as relevant as possible, satisfying the user's needs. At the same time, search engine algorithms are constantly changing and improving. As they say, there is no limit to perfection.

On the other hand, webmasters and SEOs are constantly inventing new ways to promote their sites, which are not always fair. The task of the developers of the search engine algorithm is to make changes to it that would not allow “bad” sites of dishonest optimizers to appear in the TOP.

How does a search engine work?

Now about how the direct work of the search engine takes place. It consists of at least three stages:

  • scanning,
  • indexing,
  • ranging.

The number of sites on the Internet is simply astronomical. And each site is information, informational content that is created for readers (real people).

Scanning

This is a search engine roaming the Internet to collect new information, to analyze links and find new content that can be used to give the user in response to his queries. For scanning, search engines have special robots, which are called search robots or spiders.

Search robots are programs that automatically visit websites and collect information from them. Crawling can be primary (the robot visits a new site for the first time). After the initial collection of information from the site and entering it into the database of the search engine, the robot begins to visit its pages with a certain regularity. If there have been any changes (new content added, old content removed), then all these changes will be fixed by the search engine.

The main task of the search spider is to find new information and give it to the search engine for the next stage of processing, that is, for indexing.

Indexing

The search engine can search for information only among those sites that are already included in its database (indexed by it). If scanning is the process of searching and collecting information that is available on a particular site, then indexing is the process of entering this information into the search engine database. At this stage, the search engine automatically decides whether to enter this or that information into its database and where to enter it, in which section of the database. For example, Google indexes almost all information found by its robots on the Internet, while Yandex is more picky and does not index everything.

For new sites, the indexing phase can be long, so visitors from search engines can wait a long time for new sites. BUT new information, which appears on old, promoted sites, can be indexed almost instantly and almost immediately get into the "index", that is, into the database of search engines.

Ranging

Ranking is the alignment of information that was previously indexed and entered into the database of a particular search engine, according to rank, that is, what information the search engine will show to its users in the first place, and what information will be placed “rank” lower. Ranking can be attributed to the stage of service by the search engine of its client - the user.

On the servers of the search engine, the received information is processed and the issue is generated for a huge range of all kinds of queries. This is where search engine algorithms come into play. All sites listed in the database are classified by topics, the topics are divided into groups of requests. For each of the groups of requests, a preliminary issuance can be compiled, which will subsequently be adjusted.

Why should a marketer know the basic principles search engine optimization? It's simple: organic traffic is a great source of incoming flow target audience for your corporate website and even landing pages.

Meet a series of educational posts on the topic of SEO.

What is a search engine?

The search engine is big base documents (content). Search robots bypass resources and index different types of content, it is these saved documents that are ranked in the search.

In fact, Yandex is a “cast” of the Runet (also Turkey and a few English-language sites), and Google is the global Internet.

A search index is a data structure containing information about documents and the location of keywords in them.

According to the principle of operation, search engines are similar to each other, the differences lie in the ranking formulas (ordering sites in search results), which are based on machine learning.

Every day, millions of users submit queries to search engines.

"Abstract to write":

"Buy":

But most interested in...

How is a search engine organized?

To provide users with quick answers, the search architecture was divided into 2 parts:

  • basic search,
  • metasearch.

Basic Search

Basic search - a program that searches its part of the index and provides all the documents that match the query.

Metasearch is a program that processes a search query, determines the user's regionality, and if the query is popular, then it gives out a ready-made search option, and if the query is new, it selects a basic search and issues a command to select documents, then ranks the found documents using machine learning and provides user.

Search query classification

To give a relevant answer to the user, the search engine first tries to understand what he specifically needs. The search query is analyzed and the user is analyzed in parallel.

Search queries are analyzed by parameters:

  • Length;
  • definition;
  • popularity;
  • competitiveness;
  • syntax;
  • geography.

Request type:

  • navigation;
  • informational;
  • transactional;
  • multimedia;
  • general;
  • official.

After parsing and classifying the query, the ranking function is selected.

The designation of request types is confidential information and the proposed options are the guesswork of search engine promotion specialists.

If the user sets a general query, then the search engine returns different types documents. And it should be understood that by promoting the commercial page of the site in the TOP-10 for a general request, you are claiming to get not one of the 10 places, but the number of places
for commercial pages, which is highlighted by the ranking formula. And therefore, the probability of being ranked in the top for such requests is lower.

Machine learning MatrixNet is an algorithm introduced in 2009 by Yandex that selects the function of ranking documents for certain queries.

MatrixNet is used not only in Yandex search, but also for scientific purposes. For example, at the European Center for Nuclear Research, it is used for rare events in large amounts of data (they are looking for the Higgs boson).

Primary data for evaluating the effectiveness of the ranking formula is collected by the assessors department. These are specially trained people who evaluate a sample of sites according to an experimental formula according to the following criteria.

Site quality assessment

Vitalny - official site (Sberbank, LPgenerator). The search query corresponds to the official website, groups in social networks, information on authoritative resources.

Useful (score 5) - a site that provides extended information upon request.

Example - request: banner fabric.

A site corresponding to the "useful" rating should contain information:

  • what is banner fabric;
  • specifications;
  • Photo;
  • kinds;
  • price list;
  • something else.

Top request examples:

Relevant+ (Score 4) - This score means that the page matches the search query.

Relevant- (Score 3) - The page does not exactly match the search query.

Let's say that the search for "Guardians of the Galaxy sessions" displays a page about a movie without screenings, a page of a past session, a page of a trailer on youtube.

Irrelevant (Score 2) - The page does not match the query.
Example: the name of the hotel displays the name of another hotel.

To promote a resource for a general or informational request, you need to create a page corresponding to the “useful” rating.

For clear queries, it is enough to meet the "relevant+" score.

Relevance is achieved through textual and link matching of the page with search queries.

conclusions

  1. Not all queries can promote a commercial landing page;
  2. Not all information requests can be used to promote a commercial site;
  3. By promoting a general request, create a useful page.

A common reason why a site does not reach the top is that the content of the promoted page does not match the search query.

We will talk about this in the next article “Checklist for basic website optimization”.

By definition, an Internet search engine is an information retrieval system that helps us find information in world wide web. This facilitates the global exchange of information. But the internet is an unstructured database. It is growing exponentially, and has become a huge repository of information. Finding information on the Internet is a difficult task. There is a need to have a tool to manage, filter and extract this ocean information. The search engine serves this purpose.

How does a search engine work?

Internet search engines are engines that search and retrieve information on the Internet. Most of them use crawler indexer architecture. They depend on their track modules. Crawlers, also called spiders, are small programs that crawl web pages.

Crawlers visit the initial set of URLs. They mine the URLs that appear on crawled pages and send this information to the crawler module. The crawler decides which pages to visit next and gives those URLs to the crawlers.

The topics covered by different search engines vary depending on the algorithms they use. Some search engines are programmed to search sites for a specific topic, while others' crawlers can visit as many places as possible.

The indexing module extracts information from each page it visits and adds the URL to the database. This results in a huge lookup table, from a list of URLs pointing to pages of information. The table shows the pages that were covered during crawling.

The analysis module is another important part of the search engine architecture. It creates a utility index. The index utility can grant access to pages of a given length or pages containing a certain number of pictures on them.

During the process of crawling and indexing, the search engine saves the pages it retrieves. They are temporarily stored in the page's storage. Search engines maintain a cache of the pages they visit to speed up the retrieval of already visited pages.

The search engine query module receives search queries from users in the form of keywords. The ranking module sorts the results.

The crawler indexer architecture has many variations. They change into distributed architecture search engine. These architectures consist of collectors and brokers. Collectors collect indexing information from web servers while brokers provide an indexing mechanism and a query interface. Brokers index the update based on information received from collectors and other brokers. They can filter information. Many search engines today use this type of architecture.

Search engines and page rankings

When we create a query in a search engine, the results are displayed in a specific order. Most of us tend to visit the top order pages and ignore the last ones. This is because we think the top few pages are more relevant to our query. So everyone is interested in ranking their pages in the top ten search engine results.

The words listed in the search engine query interface are the keywords that were requested by the search engines. They are a list of pages related to the requested keywords. During this process, search engines retrieve those pages that have frequent occurrences of those keywords. They look for relationships between keywords. The location of keywords is also considered, as is the rank of the page containing them. Keywords that appear in page titles or URLs are given more weight. Pages that have links pointing to them make them even more popular. If many other sites link to a page, it is seen as valuable and more relevant.

There is a ranking algorithm that every search engine uses. The algorithm is a computerized formula designed to provide relevant pages upon user request. Each search engine may have a different ranking algorithm that analyzes the pages in the engine's database to determine the appropriate responses to search queries. Search engines index different information in different ways. This leads to the fact that a particular query, delivered to two different search engines, can bring pages to different orders or extract different pages. The popularity of a website are the determinants of relevance. The click-through popularity of a site is another factor that determines its rank. This is a measure of how often a site is visited.

Webmasters try to trick search engine algorithms in order to boost their site's position in the SERPs. Filling site pages with keywords or using meta tags to fool search engine ranking strategies. But search engines are smart enough! They improve their algorithms so that the machinations of webmasters do not affect search results.

You need to understand that even pages after the first few in the list may contain exactly the information you were looking for. But rest assured that good search engines will always bring you highly relevant pages in the top order!

What is it

DuckDuckGo is a fairly well-known open source search engine. source code. The servers are located in the USA. In addition to its own robot, the search engine uses the results of other sources: Yahoo, Bing, Wikipedia.

The better

DuckDuckGo positions itself as the ultimate privacy and privacy search. The system does not collect any data about the user, does not store logs (no search history), use cookies maximally limited.

DuckDuckGo does not collect or share personal information from users. This is our privacy policy.

Gabriel Weinberg, founder of DuckDuckGo

Why do you need this

All major search engines try to personalize search results based on data about the person in front of the monitor. This phenomenon is called "filter bubble": the user sees only those results that are consistent with his preferences or that the system considers as such.

Forms an objective picture that does not depend on your past behavior on the Web, and gets rid of Google and Yandex thematic advertising based on your requests. DuckDuckGo makes it easy to find information on foreign languages, while Google and Yandex prefer Russian-language sites by default, even if the query is entered in another language.


What is it

not Evil is a system that searches the anonymous Tor network. To use it, you need to go to this network, for example, by launching a specialized .

not Evil is not the only search engine of its kind. There is LOOK (default search in the Tor browser, accessible from the regular Internet) or TORCH (one of the oldest search engines on the Tor network) and others. We settled on not Evil because of the unmistakable allusion to Google (just look at the start page).

The better

Searching there where google, "Yandex" and other search engines are closed in principle.

Why do you need this

There are many resources on the Tor network that cannot be found on the law-abiding Internet. And their number will grow as the control of the authorities over the contents of the Web tightens. Tor is a kind of network within the Web with its social networks, torrent trackers, media, marketplaces, blogs, libraries, and so on.

3. YaCy

What is it

YaCy is a decentralized search engine that works on the principle of P2P networks. Each computer on which the main software module is installed scans the Internet on its own, that is, it is an analogue of a search robot. The results obtained are collected in a common database, which is used by all YaCy participants.

The better

It is difficult to say here whether this is better or worse, since YaCy is a completely different approach to organizing search. The lack of a single server and owner company makes the results completely independent of anyone's preferences. The autonomy of each node excludes censorship. YaCy is capable of searching the deep web and non-indexed public networks.

Why do you need this

If you are a supporter of open source software and a free Internet that is not influenced by government agencies and large corporations, then YaCy is your choice. It can also be used to organize searches within a corporate or other autonomous network. And although YaCy is not very useful in everyday life, it is a worthy alternative to Google in terms of the search process.

4. Pipl

What is it

Pipl is a system designed to search for information about a specific person.

The better

The authors of Pipl claim that their specialized algorithms search more efficiently than "regular" search engines. In particular, profiles are prioritized social networks, comments, lists of participants, and various databases where information about people is published, such as databases of court decisions. Pipl's leadership in this area is confirmed by Lifehacker.com, TechCrunch and other publications.

Why do you need this

If you need to find information about a person living in the US, then Pipl will be much more efficient than Google. Databases of Russian courts, apparently, are inaccessible to the search engine. Therefore, he does not cope so well with the citizens of Russia.

What is it

FindSounds is another specialized search engine. Searches open sources for various sounds: house, nature, cars, people, and so on. The service does not support requests in Russian, but there is an impressive list of Russian-language tags that you can search for.

The better

In the issuance of only sounds and nothing more. In the settings you can set the desired format and sound quality. All found sounds are available for download. There is a pattern search.

Why do you need this

If you need to quickly find the sound of a musket shot, the blow of a sucking woodpecker, or the cry of Homer Simpson, then this service is for you. And we chose this only from the available Russian-language queries. In English, the spectrum is even wider.

Seriously, a specialized service implies a specialized audience. But will it come in handy for you too?

What is it

Wolfram|Alpha is a computational search engine. Instead of links to articles containing keywords, it gives a ready response to the user's request. For example, if you enter “compare the population of New York and San Francisco” in English into the search form, then Wolfram|Alpha will immediately display tables and graphs with a comparison.

The better

This service is better than others for finding facts and calculating data. Wolfram|Alpha collects and organizes knowledge available on the Web from various fields, including science, culture and entertainment. If this database contains a ready answer to a search query, the system shows it, if not, it calculates and displays the result. In this case, the user sees only and nothing more.

Why do you need this

If you are, for example, a student, analyst, journalist, or researcher, you can use Wolfram|Alpha to find and calculate data related to your activities. The service does not understand all requests, but is constantly evolving and becoming smarter.

What is it

The Dogpile metasearch engine displays a combobox of results from search results Google, Yahoo and other popular systems.

The better

First, Dogpile displays fewer ads. Secondly, the service uses a special algorithm to find and display top scores from different search engines. According to the developers of Dogpile, their system generates the most complete issue on the entire Internet.

Why do you need this

If you can't find information on Google or another standard search engine, look it up in several search engines at once using Dogpile.

What is it

BoardReader is a text search system for forums, Q&A services and other communities.

The better

The service allows you to narrow the search field to social sites. Thanks to special filters, you can quickly find posts and comments that match your criteria: language, publication date, and site name.

Why do you need this

BoardReader can be useful for PR specialists and other media professionals who are interested in the opinion of the mass media on certain issues.

Finally

The life of alternative search engines is often fleeting. Lifehacker asked the former CEO of the Ukrainian branch of Yandex Sergey Petrenko about the long-term prospects for such projects.


Sergei Petrenko

Former CEO of Yandex.Ukraine.

As for the fate of alternative search engines, it is simple: to be very niche projects with a small audience, therefore, without clear commercial prospects, or, conversely, with the complete clarity of their absence.

If you look at the examples in the article, you can see that such search engines either specialize in a narrow but in-demand niche, which, perhaps only so far, has not grown enough to be noticeable on the radars of Google or Yandex, or are testing an original hypothesis in ranking, which is not yet applicable in conventional search.

For example, if a Tor search suddenly turns out to be in demand, that is, at least a percentage of the Google audience will need the results from there, then, of course, ordinary search engines will begin to solve the problem of how to find them and show them to the user. If the behavior of the audience shows that a significant proportion of users in a significant number of queries seem to be more relevant results, data without taking into account factors that depend on the user, then Yandex or Google will begin to give such results.

"To be better" in the context of this article does not mean "to be better at everything". Yes, in many aspects our heroes are far from Yandex (even far from Bing). But each of these services gives the user something that the giants of the search industry cannot offer. Surely you also know similar projects. Share with us - let's discuss.

Liked the article? Share with friends!
Was this article helpful?
Yes
Not
Thanks for your feedback!
Something went wrong and your vote was not counted.
Thanks. Your message has been sent
Did you find an error in the text?
Select it, click Ctrl+Enter and we'll fix it!