//Traps on caterpillars: causes, solutions and prevention – Deep dive of a developer by @hamletbatista
1557175835 traps on caterpillars causes solutions and prevention deep dive of a developer by hamletbatista 760x403 - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

Traps on caterpillars: causes, solutions and prevention – Deep dive of a developer by @hamletbatista

 

 

traps on caterpillars causes solutions and prevention deep dive of a developer by hamletbatista - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

traps on caterpillars causes solutions and prevention deep dive of a developer by hamletbatista - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista & # 39;);

h3_html = & # 39;

& # 39; + cat_head_params.sponsor.headline + & # 39; & # 39;

& nbsp;

cta = & # 39; & # 39; +
atext = & # 39;

& # 39; + cat_head_params.sponsor_text +

& # 39 ;;
scdetails = scheader.getElementsByClassName (& # 39; scdetails & # 39;);
sappendHtml (scdetails [0] h3_html);
sappendHtml (scdetails [0] atext);
sappendHtml (scdetails [0] cta);
// logo
sappendHtml (scheader, "http://www.searchenginejournal.com/");
sc_logo = scheader.getElementsByClassName (& # 39; sc-logo & # 39;);
logo_html = & # 39; - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista & # 39 ;;
sappendHtml (sc_logo [0] logo_html);

sappendHtml (scheader, & # 39;

ADVERTISING

& # 39;)

if ("undefined"! = typeof __gaTracker) {
__gaTracker ('create', 'UA-1465708-12', 'auto', 'tkTracker');
__gaTracker ("tkTracker.set", "dimension1", window.location.href);
__gaTracker ('tkTracker.set', 'dimension2', 'seo');
__gaTracker ("tkTracker.set", "contentGroup1", & # 39; seo & # 39;);
__gaTracker ('tkTracker.send', 'hitType': 'pageview', page: cat_head_params.logo_url, & title> #:; Cat_head_params.sponsor.headline, & # 39; sessionControl & # 39 ;: & # 39;
slinks = scheader.getElementsByTagName ("a");
sadd_event (slinks, click & # 39 ;, spons_track);
}
} // endif cat_head_params.sponsor_logo

In old articles I explained how programming skills can help you diagnose and solve complex problems, mix data from different sources and even to automate your SEO work.

In this article, we will use the programming skills we developed to learn by acting / coding .

Specifically, let's take a close look at one of the most impactful SEO technical problems that you can solve: identify and remove crawler traps.

We will explore a number of examples – their causes, their solutions via HTML and Python code snippets.

Moreover, we will do something even more interesting: write a simple robot that can avoid traps and that only takes 10 lines of Python code!

My goal with this column is that once deeply understand what causes the crawler traps, you can not after seeing them, but help the developers to prevent them from occurring in the first place .

An Introduction Guide to Crawler Traps

A crawler trap occurs when a search robot or SEO spider begins to catch a large number of crawlers. URL that does not generate new content or new unique links.

The problem of crawler traps is that they absorb the budget analysis allocated per site by search engines [

Once the Exhausted budget, the search engine will not have time to explore the valuable pages of the site. This can result in a significant loss of traffic.

This is a common problem on database – managed sites because most developers do not even know that it 's all about. a serious problem.

When they evaluate a site from end users point of view, it works well and they see no problem. This is because end users choose by clicking on the links, they do not follow all the links of a page.

How a robot works

Let's see how a robot navigates a site by searching and following links in the HTML code.

You will find below the code for a simple example of a robot based on Scrapy . I have adapted some code on their home page. Do not hesitate to follow their tutorial to learn more about building custom caterpillars.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

The first one in the loop grabs all the item blocks in the Recent Posts section, and the second loop just follows the Next link that I highlight with an arrow.

When you are writing a selective crawler like this, you can easily avoid most crawler traps!

You can save the code in a local file and run the spider from the command line, as follows:

$ scrapy runspider sejspider.py

Or a script or a Jupyter notebook.

Here is the robot race log example:

Traditional robots extract and follow all links on the page. Some links will be relative, other absolutes, others will lead to other sites, and most to other pages of the site.

The crawler must render absolute relative URLs before crawling them, and indicate those that have one. been visited for not having to visit.

A search engine robot is a little more complicated than that. It is designed as a distributed robot. This means that the analyzes of your site do not come from one machine / IP, but from several.

This subject does not come within the scope of this article, but you can read the Scrapy documentation . to learn more about how to implement a system and to have an even deeper perspective.

Now that you've seen the code of crawlers and you understand how it works, let's explore some of the common pitfalls of robots and see why a robot would fall under their spell.

How a caterpillar falls into traps

I've compiled a list of common (and less common) cases from my own experience, from Google's documentation and some articles from the community I link in the resources section. Do not hesitate to consult them to get a bigger picture.

A common and incorrect solution to crawler traps is to add noindex or canonical meta-robots to duplicate pages. It will not work because it will not reduce the sweep space. Pages must still be explored. This is an example of why it is important to understand the fundamental workings of things.

Session Identifiers

Nowadays, most websites use HTTP cookies to identify users and disable them. they prevent them from using the site.

However, many sites still use another approach to identifying users: the session identifier. This ID is unique per visitor to the website and is automatically embedded in all URLs on the page.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

When a search engine robot crawls the page, all the URLs have session ID, which makes them unique and seemingly endowed with new content.

But remember that search engine robots are distributed, so requests will come from different IP addresses. This leads to even more unique session IDs.

We want search robots to crawl:

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

But they crawl:

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

When the session ID is a URL parameter, it is an easy problem to solve because you can block it in the settings URL parameters.

But what will happen if the session ID is embedded in the actual path of URLs? Yes, this is possible and valid.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

Web servers based on the Enterprise Java Beans specification, used to add the session ID in the path as follows:; jsessionid. You can easily find sites still indexed in their URLs.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

It is not possible to block this setting when it is included in the path. You must correct it at the source .

Now, if you are writing your own robot, you can easily ignore it with this code

Faceted navigation

Single or Guided very common on e-commerce sites, are probably the most common source of crawler traps on modern sites.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

The problem is that a normal user only makes a few selections, but when we ask our crawler to grab those links and follow them, it will try all the permutations possible. The number of URLs to analyze becomes a combinatorial problem. In the screen above, we have a number X of possible permutations.

Generally, you generate them using JavaScript, but Google can run and explore them, but that's not enough.

Better The approach is to add the parameters as URL fragments. Search engine crawlers ignore URL fragments. So, the above extract would be rewritten like this.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

Here is the code to convert specific parameters into fragments:

A terrible implementation of faceted navigation that we often see converts filter URL parameters into paths, which makes filtering by query string almost impossible.

For example, instead of / category? Color = blue, you get /category/color=blue/.[19459009_revem19459021]Frely Relative Links

I used to see so many problems with relative URLs, I recommend to customers to always make all URLs absolute. I realized later that it was an extreme measure, but let me show with code why related links can cause so many pitfalls on crawlers.

As I mentioned earlier, when an exploration robot finds relative links, it must convert them to absolutes. To convert them to absolute, it uses the source URL as a reference.

Here is the code to convert a relative link to absolute.

Now let's see what happens when the relative link is formatted incorrectly.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

Here is the code that shows the absolute link that results.

The place where the caterpillar trap is set is here. When I open this fake URL in the browser, I do not get a 404, which would tell the crawler to remove the page and not follow any links. I receive a soft 404 that sets the trap in motion.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

Our faulty link in the footer will grow again when the robot tries to create an absolute URL.

The robot will continue to follow this process and the fake URL will continue to grow until it reaches the maximum. URL limit supported by Web server software or CDN. It changes according to the system.

For example, IIS and Internet Explorer do not support URLs greater than 2,048 to 2,083 characters in length.

There is a quick and easy way or long and painful to catch this type of crawler trap.

You probably already know the long and arduous approach: run an SEO spider for hours until it falls into the trap.

You usually know that he found one because it is out of memory if you run it on your desktop or millions of URLs on a small site if you use a site based on a cloud.

easy way is to look for the presence of a status code error 414 in the server logs. Most W3C compliant web servers will return a 414 if the requested URL is longer than necessary.

If the web server does not report 414, you can also measure the length of the requested URLs in the log and filter characters longer than 2000 characters.

Here is the code to do one or the other.

Here is a variant of the missing slash that is particularly difficult to detect. This happens when you copy and paste code into a word processor and it replaces the quote character.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

For the human eye, quotes are similar, unless you pay special attention. Let's see what happens when the robot converts the seemingly correct relative URL into absolute.

Destroys the Cache

Cache Bypass is a technique used by developers to force delivery networks (CDNs) to use the latest version of their files hosted.

This technique requires the addition of a unique identifier to pages or page resources that you want to "unlink" from the CDN cache.

When developers use one or more unique identifiers, it creates additional URLs to parse, typically images, CSS, and JavaScript, but this usually does not matter.

The biggest problem arises when they decide to use random. unique identifiers, frequent updates of resources and pages and let search engines explore all variants of files.

Here's what it looks like

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

You can detect these problems in your server logs and I will describe the code in this section in the next section.

Caching Versioned Pages with Image Resizing

Similar to removing the cache, there is a curious problem with static page caching plugins, such as developed by a company called MageWorx .

For one of our clients, the Magento plugin was recording different versions of the page. resources for each change made by the customer.

This problem was compounded when the plug-in automatically resized images into different sizes per supported device.

The problem probably did not occur during the initial development of the plug-in, because Google was not trying to aggressively explore the resources of the page.

The problem is that Arch Engine search engines now also parse page resources and parse all versions created by the caching plug-in.

We had a client whose analysis evaluated 100 times the size of the site and 70% of the requests for analysis. were hitting images. You can only detect a problem of this type by reading the logs.

We will generate fake Googlebot queries on randomly cached images to better illustrate the problem and learn how to identify the problem.

Here is the initialization code:

Here is the loop for generating the fake entries of the log.

Next, let's use pandas and matplotlib to identify the problem.

] This graph shows the image below.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

This chart shows Googlebot requests daily. It is similar to the Crawl Statistics feature of the old Search Console. This report prompted us to further deepen the newspapers.

Once you have Googlebot requests in a Pandas data frame, it's pretty easy to pin down the problem.

Here's how we can filter on any of the days with the peak scan, and break down by page type by file extension.

Long Channels Redirection and Loops

A simple way to waste the exploration budget is to really have long chains of redirect and even loops. They usually occur because of coding errors.

Let's code an example of a redirection chain that creates a loop to better understand them.

This is what happens when you open the first URL in Chrome. .

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

You can also see the channel in the Web Application Log

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

When you ask developers to implement rewrite rules, do the following:

Switch from http to https.Cas for mixed case-based URLs. Create a friendly search engine for URLs. a separate redirect instead of a single redirection from the source to the destination.

Redirect strings are easy to detect because you can see the code below.

They are also relatively easy to correct once you identify yourself. the problematic code. Always redirect from the source to the final destination.

Redirection link to mobile / desktop computers

An interesting type of redirect is the one used by some sites to help users force the mobile or desktop version of the site. Sometimes it uses a URL parameter to indicate the version of the requested site, which is usually a safe approach.

However, the detection of cookies and user agents is also popular because loops can occur because search engine robots do not place cookies.

This code shows how it should work correctly.

This shows how this could work incorrectly by modifying the default values ​​to reflect erroneous assumptions (dependence on the presence of HTTP cookies).

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

URLs to solicit circulars

This has happened to us recently. This is an unusual case, but I hope it will happen more often as more and more services move behind proxy services, such as Cloudflare.

You could have proxied URLs repeatedly to create a string. Similar to how that happens with redirects.

You can think of proxy URLs as server-side redirected URLs. The URL does not change in the browser but the content does. In order to track the tracking of mandated URL loops, you must check the logs of your server.

We have an application in Cloudflare that makes API calls to our backend to get the SEO changes to make. Our team recently introduced an error that caused a proxy API calls, which created a nasty and hard-to-detect loop.

We used the very practical application Logflare of @ chasers to view the call logs of our API in real time. This is what regular calls look like.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

Here is an example of circular / recursive. It's a massive demand. I found hundreds of chained queries when I decoded the text.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

We can use the same trick we used to detect faulty relative links. We can filter by status code 414 or even by query length.

Most queries must not exceed 2049 characters. You can refer to the code we used for erroneous redirects.

Magic URLs + Random text

Another example is URLs that contain optional text and require only one identifier to serve the content.

Generally, this is not a problem except when the URLs can be linked to random and inconsistent text from the site.

For example, when the product's URL often changes its name, search engines need to analyze all variations.

Here is an example.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

If I am the link to the product 1137649-4 with a short text for the product description, the download of the product page will display.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

Mais vous pouvez voir que le canonique est différent de la page que j'ai demandée.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

En principe, vous pouvez taper n'importe quel texte entre le produit et l'ID du produit et le même chargement de page.

Les canoniques résolvent le problème de contenu en double, mais le crawl peut être volumineux en fonction de la façon dont plusieurs fois le nom du produit est mis à jour.

Pour suivre l'impact de ce problème, vous devez diviser les chemins des URL en répertoires et les regrouper par ID de produit. Voici le code pour le faire.

Voici l'exemple de sortie:

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

de svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

Liens vers des recherches internes générées dynamiquement

Certains fournisseurs de recherche sur site aident à créer un "nouveau" contenu basé sur des mots clés simplement en effectuant des recherches avec un grand nombre de mots clés et en formatant les URL de recherche comme des URL classiques.

Un petit nombre d'URL de ce type n'est généralement pas un problème, mais si vous le combinez à une liste de mots-clés volumineuse, vous obtenez une situation similaire à celle que j'ai mentionnée pour la navigation par facettes.

Trop d'URL aboutissant essentiellement au même contenu.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

Une solution pour les détecter consiste à rechercher les ID de classe des listes et à voir si elles correspondent à celles des listes lors d'une recherche régulière.

Dans l'exemple ci-dessus, je vois l'ID de classe «sli_phrase», ce qui indique que le site utilise les systèmes SLI pour améliorer leur recherche.

Je laisse le code pour détecter celui-ci comme exercice pour le lecteur.

Liens calendrier / événement

C’est probablement le piège le plus facile à comprendre.

Si vous placez un calendrier sur une page, même s’il s’agit d’un widget JavaScript, et Si vous laissez les moteurs de recherche explorer les liens du mois prochain, cela ne se terminera jamais pour des raisons évidentes.

Écrire un code généralisé pour le détecter automatiquement est particulièrement difficile. Je suis ouvert à toutes les idées de la communauté.

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

Comment attraper des pièges sur chenilles avant de mettre du code en production

La plupart des équipes de développement modernes utilisent une technique appelée intégration continue pour automatiser la livraison de code de haute qualité à la production.

Les tests automatisés sont un élément clé des flux de travail d'intégration continue et le meilleur endroit pour présenter les scripts que nous avons rassemblés dans cet article pour attraper les pièges.

L'idée est qu'une fois qu'un piège à chenilles est détecté , cela arrêterait le déploiement de la production. Vous pouvez utiliser la même approche et rédiger des tests pour de nombreux autres problèmes de référencement critiques.

CircleCI est l'un des fournisseurs de cet espace et vous pouvez voir ci-dessous l'exemple produit par l'un de nos modèles. .

gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

svg%3E - Traps on caterpillars: causes, solutions and prevention - Deep dive of a developer by @hamletbatista

Comment diagnostiquer les pièges après le fait

À l'heure actuelle, l'approche la plus courante consiste à attraper les pièges à chenilles une fois les dégâts infligés. En général, vous lancez une analyse spider SEO et, si elle ne se termine jamais, vous aurez probablement un piège.

Enregistrez votre recherche Google en utilisant des opérateurs tels que site: et s'il y a trop de pages indexées, vous avez un piège. ]

Vous pouvez également rechercher dans l'outil de recherche de paramètres d'URL de la console de recherche Google un nombre excessif d'URL surveillées.

Vous ne trouverez qu'un grand nombre des interruptions mentionnées dans les journaux du serveur en recherchant les modèles répétitifs .

Vous trouvez également des pièges lorsque vous voyez un grand nombre de titres en double ou de méta-descriptions. Another thing to check is a larger number of internal links that pages that should exist on the site.

Resources to Learn More

Here are some resources I used while researching this article:

More Resources:

Image Credits

All screenshots taken by author, May 2019