- This topic has 0 replies, 1 voice, and was last updated 1 year ago by Chinomnso.
- February 7, 2020 at 5:22 pm #85384Participant@chinomnso
Search engines provide an efficient way to search through the billions of web pages on the internet. They organize the information on the world wide web and present the most relevant results when people search the internet, making them an important part of our day-to-day lives. What would life be like without search engines?
For results from your website to show up when users search, your content needs to be visible to search engines. In this article, we will discuss how search engines crawl, index and rank websites. First, let us talk about crawling.
What Is Search Engine Crawling?
Crawling is the process by which search engines search for and discover content on the world wide web. This is done by means of robots known as spiders or crawlers, and the content could be text, a PDF file, a video, an image, or even an icon. Whatever format the content may take, the paths taken by the crawlers are always links.
Googlebot, for example, is Google’s web crawler. It starts out by getting a few web pages on a web site. Then it follows the links on the pages it gets to find new URLs. This now brings us to the idea of an index.
What Is A Search Engine Index?
As a search engine crawler moves around, it adds new content it finds to a massive database of discovered URLs. These URLs are later retrieved when a user searches Google for information that matches the search query. That “store” or database to which the URLs are added is known as an index.
When a user uses a search engine, the search engine checks through its index, looking for the most relevant content and arranges what it finds in descending order of relevance. This arrangement of search results is known as ranking. As a general rule, it is safe to believe that the higher a website is ranked in search results, the more relevant the content therein is to the search query. This explains why people joke that the best place to hide something is in page 2 of Google Search results. In fact, when I’m searching for something, I hardly ever go beyond the first page. I’d rather change my search terms than go checking page 2 for more relevant results.
A webmaster can make their site inaccessible to search engine spiders. They could block the bots from accessing all or just part of their website, thereby excluding it from search results. Although you may have reasons to do this, generally, you want to make your content available to those who search the internet.
Search Engines – Google And Others
While there are lots of search engines on the world wide web, SEO experts mainly focus on Google, Bing, and maybe Yahoo. Google receives the most attention because it is the largest and the best, and the overwhelming majority of internet users use it as their favourite search engine. In fact, I can’t remember when last I ever used Bing, how much less Yahoo. And it is reported that if you include Google Maps and YouTube, more than 90% of web searches occur on Google, making Bing and Yahoo pale into insignificance as their search share combined is about a twentieth of Google’s. Therefore, this article talks more about Google.
Is Your Website Indexed?
It is good to remember that for your website to show up on search results pages, it must be crawled and indexed. The first thing you need to do would be to see how many of your website’s pages are indexed on a search engine. This way, you will know if your site is being crawled and indexed properly. If it is crawled, you will know which of the pages are being crawled and can hide anyone you don’t want.
One way to do this is to use a special search operator. Head to Google Search and search for your website using the following format: “site:yourdomain.com”. In the screenshot below, I used this website as an example.
The number of results displayed on a search result is not exact. But it gives you a good idea of how many of your website’s pages are indexed on Google. And remember, search results are ranked. Therefore, the results are ranked in descending order of relevance with the most relevant results on the first page.
To get deeper insights into your website’s indexing and ranking, you can use the Index Coverage report available in Google Search Console. Gaining access to this powerful tool is free and requires you only have a Google account. So you can sign up for a Google account right away if you don’t have one. Once you’re inside Google Search Console, you can submit a sitemap for your website and monitor how many pages have been indexed by Google. This is just one of the many things you can do in Google Search Console.
If your website is not showing up in search results, it could be as a result of any of the following or a combination of several of them:
- Your site is still new and has not been crawled
- Your website’s navigation makes it difficult for your website to be properly crawled
- Your site has been penalized by Google for black hat SEO practices
- You site contains some crawler directives that block search engines
- There are no links to your site from any external websites
To an extent, you can decide what shows up in search results for your website. Therefore, if you used the search operator we used above or checked Google Search Console and didn’t find some important pages from your website, or you found some pages you don’t want to show up in search results, you can optimize your website and use Google Search Console to instruct Googlebot on how to crawl your website.
As an example, you may have some staging or test pages on your website. Or you may have pages with duplicate content, or ones with very little content and do not want them to show up in search results. You could use a special kind of file called robots.txt to instruct crawlers to stay away from those pages when they come around.
This file is usually located in the root directory of websites and contains a list of pages to be crawled and the frequency at which they should be crawled. These instructions are made using robots.txt directives, special “commands” that follow a certain syntax for this kind of file.
How Does Googlebot Treat robots.txt Files
If Googlebot comes visiting and can’t find a robots.txt file, it goes ahead to crawl the entire site. If however, it finds a robots.txt file, it tries to obey the instructions found there and crawls any pages it is allowed or instructed to crawl. If Googlebot encounters an error while trying to read instructions in a site’s robots.txt file, it would abandon the site and not crawl it. Therefore, you may not need to manually write your own robots.txt file. Use a robots.txt generator like this one to avoid syntax errors.
Not all web robots follow the directives in a robots.txt file. People with nefarious intentions build bad bots like email, phone number and address scrapers that ignore the directives therein. In fact, some of these bots use your robots.txt file to find out what you have marked as private content and go ahead to scrape them for information. Although you may choose to use a robots.txt file to block crawlers from accessing private pages like administrative dashboards and the like, preventing them from popping up in search results, adding the URLs of such pages to a publicly accessible robots.txt file makes them easier to be found by bad bots. It is better to NoIndex such pages and lock them behind login forms. After all, bots don’t login.
How To Define URL Parameters In Google Search Console
On some sites, the same content can be accessed by means of the same URL. As an example, when shopping on an e-commerce site, you may use filters to narrow your search down to what you want based on size, colour, or price. As you refine your search, the URL changes slightly, and some items still show up despite your adjustments. The examples below can yield some common results on a real website:123https://www.example.com/products/women?category=shoes&color=greenhttps://www.example.com/products/women?category=shoes&color=green&&price=lt5000https://www.example.com/products/women?category=shoes&color=green&&price=lt5000&&sizelt=40
Each of the URLs below would produce some green bags. The question now is, how does Google know which version of these three URLs to serve up when something related is searched for? Google can intelligently figure out on its own which URL to serve up, but you can take advantage of the URL Parameters feature in Google Search Console to tell Google just how you want your pages to be treated. This feature allows you to tell Google to either ignore or crawl URLs with certain parameters. That would be a good thing to do if some of your URL parameters create duplicate content.
Can Crawlers Find All Your Important Stuff?
We have been able to identify what you can do to keep search engines away from unimportant stuff on your website. Let us talk a little about how to make Google find the important content on your website. At times, a search engine would be able to find parts of your website by crawling it, but some parts of your website may not be picked up for one reason or the other. It is important that all the important information on your website shows up in search. Therefore, you need to be sure that a bot can actually crawl through your website, and not just to your website. After all, you can stop by a friend’s place on your way from work, but his door is locked. You went to his house, you stopped at the door, but you did not go into his house.
If important content is hidden behind login forms, bots would be unable to access the material there. And if users have to answer survey questions or fill out forms of some sorts before accessing some content, bots are unlikely to gain access to these. After all, crawlers don’t have usernames and passwords and can’t login. And they do not have opinions to answer survey questions either.
Some webmasters think that placing a search form on their website would make everything searchable on the website accessible to crawlers. The truth, however, is that web crawlers cannot use search forms. So, do not rely on search forms to make your content accessible to robots.
Another mistake some others make is that they put text on non-text content and expect it to be picked up by crawlers. An image containing text, or a video with text should not be used to display text intended for indexing. Although search engines are improving at recognizing images using artificial intelligence, they cannot read and understand text on images and serve them up yet in text search results. If you want text to be crawled by crawlers, put it directly in your markup, rather than embed it in multimedia.
Just as a user needs to use a navigation bar to move around your website, crawlers also need links that serve as a sort of directory of the pages on your website, guiding them from page to page. if you have a page on your website that is not linked to from any other page on your website, you could as well have left that page on your personal computer, keeping it offline. Your website’s navigation should be structured in a way that makes it accessible to search engines, making it easier to be listed in search results. Your website needs to have a clear navigation structure. This makes it easier for users and bots alike to find content.
Mistakes That Can Make Crawlers Skip Content On your Site
- Personalizing the navigation, or showing unique navigation to specific users
- Not linking an important page from your navigation. Links are the roads through which crawlers access your content
- Having different navigations for mobile and for desktop
Using Sitemaps Properly
A sitemap is just like the name sounds. It contains a list of URLs that crawlers can use as a map to guide them through the pages on your website. To create a sitemap, simply make use of a sitemap generator like XML Sitemaps Generator. When you are done creating it, download it and edit it to take out any links that you do not want to be on the sitemap, then upload it to Google Search Control. Although submitting a sitemap does not remove the need for clean navigation, it can serve as a map, guiding bots through the terrain of your website.
When creating your sitemap, be sure that all important URLs are included and ensure that your directions are consistent. For example, you may have blocked certain URLs in your robots.txt file. Do not include such URLs in your sitemap, as it would cause confusion for crawlers.
Sometimes, web pages may return 4xx errors. An example is a 404 Not Found Error. This may be caused by a typo. The same could happen to a crawler if you mistype a URL in your sitemap. When search engines encounter a 404, for example, they would not be able to access the page.
Other times, 5xx codes may be encountered too. These are server errors, and the most common is perhaps the 500 Internal Server Error. A 5xx error indicates that the server on which the web page exists failed to fulfil the request made by the user, or the search engine as the case may be. In Google Search Console, the Crawl Error report is dedicated to such errors as these. These errors occur on search engines when a request times out and the bot abandons the request. This article provides more information about the 500 Internal Error. You may wish to read it to learn more.
How Does Indexing Work?
After your site has been crawled, the next step is to get it indexed. The fact that your site has been discovered and crawled does not mean that it will be automatically indexed. We’ve already talked about how search engines discover your web pages. And we know that the index is the database where links to the discovered pages are stored. When a crawler finds a web page, it virtually renders it just like a web browser would. It then analyzes the contents on the page and all of that information, not just the links to the pages, is stored in the index.
To see how Googlebot sees your pages, search for anything on Google, then click on the dropdown beside any of the URLs in the search results beneath the search result headings. From the dropdown, click Cached. When you look at the cached version of the site, check if it is properly crawled, or if any important page element is missing.
Can Pages Be Removed From The Index?
Pages can surely be removed from the index for any of the following reasons:
- The URL had a noindex tag. This tag can be added by webmasters to instruct search engines to take out the search engine’s index.
- A URL can be prevented from crawling a web page after a form or has been added to restrict access to the web page.
- The URL returns a 4XX error or a 5XX error
If you suspect that a web page on your web site is no longer indexed, you can use the URL Inspection tool to find out the status of the page.
Using Meta Directives
Meta directives, or meta tags as they are otherwise known, are instructions to search engines detailing how you want your site to be treated by them. For example, you can use meta tags to prevent search engines from indexing certain pages in search results. These instructions are placed in the <head> section of your HTML pages, or by means of the X-Robots-Tag in the HTTP header.
How Do Search Engines Rank URLs?
I have mentioned before that search engines present search results in decreasing order of relevance. To rank pages by relevance, search engines make use of algorithms. These algorithms get revised from time to time, and they keep getting smarter. Google, for example, regularly makes adjustments to their search algorithms. Sometimes, these adjustments are minor tweaks, while others are core updates with the purpose of solving a particular problem. Penguin, for example, was designed to fight link spam.
You may wonder why algorithm changes occur often. Google doesn’t always explain why they make the changes, but we can be sure that the primary purpose is to improve users’ experience when they search the web. We are not about going into the details of algorithms of search engines, but we can be sure that ranking on search engines is done by algorithms.
Links And SEO
Throughout the history of SEO, links have been a big factor. Links could be inbound or internal. Inbound links or backlinks are links from other websites that point to your website, while internal links are links on your website that point to other pages within your website.
Links are like referrals. If someone speaks convincingly to you about his company’s services, you may view it as biased. This is because that’s where he works. Talking about links in SEO, internal links are not a sign of good authority.
Referrals from other websites are signs of better authority. Think of it this way: someone has used a service and enjoyed it. He tries to convince you to use the service because he personally enjoyed it. Although he may be trying to sell something to you, you can trust him because he doesn’t work there.
When your website has referrals from high-quality websites, it is a sign of good authority. It’s just like having a renowned export in your field recommending a textbook you have written. Because he is respected in that field, the reputation of your book increases.
It’s true we have a good foundation on crawling, indexing and ranking. However, you need to keep up with SEO as it keeps changing regularly. Learn as much as you can about search engine algorithms when they roll out updates. And be sure to avoid black hat SEO practices.
- You must be logged in to reply to this topic.