How search engines see

June 19, 2020

How and what can search engines see? 👀

The amount and types of information a search engine can make sense of determines whether a website will appear at all in a search result, its ranking in search results and how the information is displayed.

You need to understand this if you want people to find your website and achieve your goals. Professionally speaking a web developer should understand Search Engine Optimization and it is also helpful for writers and designers.

For the purpose of this post we will address search engines and some crawlers including Twitter and Facebook. We will learn what they see and review some tools and methods for seeing websites the way they do.

The static web

In the beginning of the world wide web most websites only contained plain text in HTML. Building a program to crawl all websites and make sense of them was challenging, but the fact that the data was almost always available in plain HTML made it possible to scrape it and interpret it.

To see your website from the perspective of most search engines, simply open a terminal and use curl like this

curl https://www.google.com/

This is the output

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="no"><head><meta content="Velkommen til Google S?k. Finn det du leter etter p? nettet p? et blunk." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title>

This is the information available to Google, Bing, Baidu, Yandex, Facebook and Twitter. Plain and simple.

If your website can be scraped like this and you get human-readable text as a result, you’re most likely going to benefit from all the free things search engines do for you. If instead you only get code in javascript returned, you’re in for a surprise.

Google can brag that Googlebot reads and parses javascript. Sometimes this is referred to as making Google “evergreen”.

This means that Google can make sense of websites that require javascript to render the data on the page, including text, images, styling, video and sound.

As of today June 19th 2020 no other search engine does this and it really matters. It means all the other search engines cannot make sense of pages rendered with javascript.

The most popular web frameworks for building websites are React and Vue, both of which were designed to render websites in javascript. For over 6 years this has created a lot of trouble for people who build these sites and of course for the people trying to find them.

Being the only search engine that offers javascript parsing, Google has another advantage on top of its existing near-monopoly on search engine usage in the western world.

It’s not likely that other search engines and social media site crawlers will adopt javascript rendering soon so web developers need to understand the importance of making your data static.

In my next post on making React pages static I’ll explain your options to fix this with server-side rendering and static site generation.

How the crawler visits your site

So what does the search engine crawler do when it visits your site?

I don’t work at Google but I read their guide for Search Engine Optimization, and I think you should too.

This is their introduction to SEO and these are their webmaster guidelines

So how does this work?

In brief, the search engine crawler first tries to find instructions to understand what it is allowed to do and what is disallowed. These instructions are made for robots.

Then the crawler will follow all the links on the pages it can find to discover all the pages it can on its own, as well as following any list of pages you have submitted. This list is called a sitemap.

Note that the bots and search engine crawlers will not consider a javascript link or button as a link. It is much better to just use plain HTML script tags for links

<a href="https://tobiasmcvey.com/">English 🇬🇧</a>

After gathering your site’s data, the search engine attempts to understand it and interpret it.

Robots.txt

Googlebot will discover your site first at its root, which is where it will look for a robots.txt file. The root is the top of the domain. You can see my file here.

The robots.txt file contains instructions for what crawlers are allowed to view and what they are forbidden from seeing. You can mention specific pages or folders of multiple pages. There is also an option to use robots tags in the HTTP header of each page to pass the same instructions, but be advised that this requires more effort on your part as a developer.

You can also tweak how frequently the website should be scanned.

This is worth noticing since some bots are very greedy, so sometimes you might want to limit how often they can visit the site. Maybe you don’t need to be scanned every week and have a limit on your data budget for hosting. Note you can set instructions for all crawlers or just specific crawlers by referring to them by name.

User-agent: Baiduspider 

Disallow: /example-folder/

In this example we tell the Baidu crawler not to index any pages that match /example-folder/.

The robots file is also the best place to link to your sitemap.

After finishing reading the robots.txt file the crawler will begin to visit the pages it can find on domain as well as in your sitemap.

If you really don’t want search engines to index your site I recommend blocking them with a robots instruction that matches all user-agents.

One final thing: Not all bots and crawlers behave politely. Obeying robots.txt is a convention and not all crawlers do that or follow the instructions to the letter.

For this reason if someone scrapes a website too much they might find themselves being IP-banned.

Sitemap

The sitemap is simply put a map over your website, containing a list of all the webpages you want search engines to list in their search results. Typically it’s a single XML file, but for larger sites you can have both an index-file as well as multiple XML files to list all pages.

The search engine will add these pages to the index, the list of all pages in its database for potential search results.

Note you won’t get any benefit from listing any logged-in pages here so don’t do that.

If your site has thousands of pages you might find it useful to organise them by topics or language, for example

By categories as folders

www.domain.com/robots.txt -> www.domain.com/sitemap-index.xml

contents of sitemap-index.ml
www.domain.com/sitemap-book.xml
www.domain.com/sitemap-movies.xml

the content folders and their sitemap files
www.domain.com/books -> sitemap-books.xml
www.domain.com/movies -> sitemap-movies.xml

By languages in folders, for example where English is at the root and Norwegian is kept in its own folder

www.domain.com/robots.txt -> www.domain.com/sitemap.xml

www.domene.com/english -> www.domene.com/sitemap-en.xml
www.domene.com/norsk -> www.domene.com/sitemap-nb.xml

Note that each domain needs a sitemap, so if you have anything on domain.com and sub.domain.com you will need a sitemap for both. This is simply how the sitemap convention works.

Finally, you need to know that the bots are very, very busy and will only crawl your site for a specific duration. If it finds nothing new on its visit, it won’t add anything new to the index.

That means if you are adding new content your site or deleting old pages then you should update the sitemap. Afterwards the search engines will update the search results to reflect changes on your site.

You can affect this by adding your sitemap through the tools mentioned below: Google Search Console and Bing Webmaster Tools.

Links

The search engines can also discover your site and webpages when other sites link to them.

Tools for developers

There are also free tools offered by search engines and social media sites for previewing and live fetching of your webpages. I highly recommend using them.

First of all, you can and should just look up your pages on Google and Bing to see what they associate with your site and which words they associate with it.

To see a specific page you can simply search for its exact address in Google and Bing, and it should always appear on top of the search results. This is especially useful for debugging a new page you just published.

These tools are for digging deeper into what is known about your site

For Google, no signup required:

For Google, Bing and Yahoo:

Google Search Console, signup and verification required
Bing Webmaster Tools, signup and verification required

For Facebook and Twitter:

Facebook Sharing Debugger, signup required
Twitter

Amongst these tools I swear by using Google Search Console, Bing Webmaster Tools and the Google Structured Data Testing Tool.

If you verify site ownership with Google Search Console you can also immediately sign up for the same information with Bing Webmaster Tools.

Google search console and Bing webmaster tools

Google Search Console and Bing Webmaster Tools tell you what information their respective search engines know about your websites. You can add multiple sites and see specific data for each domain and subdomain.

They tell you which pages have been added to their index, which means their list of websites, and which ones they are showing for different search queries, the words people use to find your website. They also tell you which pages are considered canonical, which means that they might consider 2 or more pages similar or even identical and they will show one page instead of the other.

You can get most of this information from Google without signing up for Google Search Console using only the Mobile Friendly Test, but it doesn’t tell you which pages it has selected as canonical versions.

It’s up to you to pick your tool, but a professional website ought to know if a specific page about a product or government service is being concealed due to the search engine picking the wrong page for canonical.

More importantly they tell you which pages aren’t in their index, some hints about why it they aren’t in their index, and errors they encounter when trying to view your site. For this reason I use them as a debugging tool as well as for web analytics.

Using this information you can improve what is shown to other people using search engines, with some limitations.

If you look at search engine results for a query you will see something like this: A list of webpages with links and a short text snippet describing the page.

search engine results for tobiasmcvey.com

The page title is more or less always what you supply in the title tag

<title>Tobias McVey | Site and Blog</title>

However the page description can be influenced both by your use of the meta description tag as well as what Google thinks people are looking for on the page.

<meta name="description" content="Personal blog about analytics and product development">

The tools can also inform you when there is missing HTML or there is a problem parsing your sitemap.

This is very important also for accessibility. All text data about your site can also be used by screen readers to improve the experience on the page and in search results. In fact most search engines are relying on web developers following the W3C Web Standards and reward them for it.

For example using breadcrumbs helps the search engines understand your website information architecture, identify the top pages that visitors are looking for and showcase them in search results. Bing and Google have offered this feature for several years now.

Seeing like a search engine with Bing and Google

In addition to this both Google Search Console and Bing Webmaster Tools let you preview your webpages live using their fetch tools. This is useful to make sure the search engine actually finds the data you wanted to appear on the page. That is required for it to be indexed, for the search engine to understand what the website is about and ultimately ranking it for words used in the search engine and topics.

For example, here is what Bing can preview on my site using Fetch as Bingbot


<!DOCTYPE html>
<html lang="en">
  <head>
    
    <meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="generator" content="Hugo 0.58.2 with theme Tranquilpeak 0.4.7-BETA">
<meta name="author" content="Tobias McVey &lt;/br&gt;**[English 🇬🇧](/)** **[Norsk 🇳🇴](/nb)**">
<meta name="keywords" content="">
<meta name="description" content="Personal blog about analytics and product development">


<meta property="og:description" content="Personal blog about analytics and product development">
<meta property="og:type" content="website">
<meta property="og:title" content="Tobias McVey | Site and Blog">
<meta name="twitter:title" content="Tobias McVey | Site and Blog">
<meta property="og:url" content="https://tobiasmcvey.com/">
<meta property="twitter:url" content="https://tobiasmcvey.com/">
<meta property="og:site_name" content="Tobias McVey | Site and Blog">
<meta property="og:description" content="Personal blog about analytics and product development">
<meta name="twitter:description" content="Personal blog about analytics and product development">
<meta property="og:locale" content="en">


<meta name="twitter:card" content="summary">

  <meta name="twitter:site" content="@https://twitter.com/tobiasmcvey">


  <meta name="twitter:creator" content="@https://twitter.com/tobiasmcvey">

And here is Google using the URL inspection tool

<!DOCTYPE html>
<html lang="en" class=""><head>
    
    
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="generator" content="Hugo 0.58.2 with theme Tranquilpeak 0.4.7-BETA" />
<meta name="author" content="Tobias McVey &lt;/br&gt;**[English 🇬🇧](/)** **[Norsk 🇳🇴](/nb)**" />
<meta name="keywords" content="" />
<meta name="description" content="Personal blog about analytics and product development" />


<meta property="og:description" content="Personal blog about analytics and product development" />
<meta property="og:type" content="website" />
<meta property="og:title" content="Tobias McVey | Site and Blog" />
<meta name="twitter:title" content="Tobias McVey | Site and Blog" />
<meta property="og:url" content="https://tobiasmcvey.com/" />
<meta property="twitter:url" content="https://tobiasmcvey.com/" />
<meta property="og:site_name" content="Tobias McVey | Site and Blog" />
<meta property="og:description" content="Personal blog about analytics and product development" />
<meta name="twitter:description" content="Personal blog about analytics and product development" />
<meta property="og:locale" content="en" />


<meta name="twitter:card" content="summary" />

  <meta name="twitter:site" content="@https://twitter.com/tobiasmcvey" />


  <meta name="twitter:creator" content="@https://twitter.com/tobiasmcvey" />

Google is able to show you a screenshot, the page code and errors. This is because Googlebot runs a headless browser version of Chrome so it can take screenshots and render javascript.

The errors are particularly useful because you will sometimes run into these problems:

A page, folder or asset is located on a URL that is blocked by the robots.txt file on your site.

You fix this problem by changing your robots.txt file or robots page instruction to permit Google and other crawlers on to the page.

Code or assets on your page cannot load because they are too slow, so the crawler gives up and moves on.

To fix this problem you simply need to find a way to make these assets load faster, and ideally you look into the performance of the webpage as a whole.

Note there is also an industry that sells SEO tools. I don’t recommend using these. The only useful feature they offer is to scan your own website and look for errors in your site’s code, errors in the sitemap and the like.

You might want to just scrape your own website and look at the code for your templates and scan for errors.

The Facebook Sharing Debugger lets you see what a page will look like when shared on Facebook.

Facebook uses its own convention of HTML tags called Open Graph. Without these, it cannot really show the page properly.

For example, I get these errors

Warnings That Should Be Fixed
Inferred Property
The provided 'og:image' properties are not yet available because new images are processed asynchronously. To ensure shares of new URLs include an image, specify the dimensions using 'og:image:width' and 'og:image:height' tags. Learn More

It then suggests what I can do about it

Based on the raw tags, we constructed the following Open Graph properties
og:url	https://www.tobiasmcvey.com/
og:type	website
og:title	Tobias McVey | Site and Blog
og:description	Personal blog about analytics and product development
og:updated_time	1592578233
ia:markup_url	
ia:markup_url_dev	
ia:rules_url	
ia:rules_url_dev

Eventually I will get around to fixing this for my blog.

For a company however this can make a website appear like a scam site or just unprofessional, so it’s worth fixing. You can for example just write a script to duplicate your HTML tags and prefix them with og:

Google Structured Data Testing Tool

The Structured Data Testing Tool provides a simple way to check if the linked data on your page has valid syntax. Currently this is only used by a few services including search engines, but the intented applcation of Linked Data goes beyond merely search engines.

This won’t help your site rank higher than others, but it will help you display your content in a professional manner that can be very helpful for people.

You can use many types of schema to markup your websites. Google recommends using JSON-LD, but Microdata is also an option.

The Schema.org site is a project sponsored by Microsoft, Yahoo, Google and Yandex that lists the entities you can construct Linked Data with.