Spamdex - Spam Archive

Report spam

Send in your spam and get the offenders listed

Forward the spam you receive to questions@spamdex.co.uk

Also in google.com

Official Google Webmaster Central Blog

Official Google Webmaster Central Blog

Link to Google Webmaster Central Blog

What Crawl Budget Means for Googlebot

Posted: 16 Jan 2017 08:28 AM PST

Recently, we've heard a number of definitions for "crawl budget", however we don't have a single term that would describe everything that "crawl budget" stands for externally. With this post we'll clarify what we actually have and what it means for Googlebot.

First, we'd like to emphasize that crawl budget, as described below, is not something most publishers have to worry about. If new pages tend to be crawled the same day they're published, crawl budget is not something webmasters need to focus on. Likewise, if a site has fewer than a few thousand URLs, most of the time it will be crawled efficiently.

Prioritizing what to crawl, when, and how much resource the server hosting the site can allocate to crawling is more important for bigger sites, or those that auto-generate pages based on URL parameters, for example.

Crawl rate limit

Googlebot is designed to be a good citizen of the web. Crawling is its main priority, while making sure it doesn't degrade the experience of users visiting the site. We call this the "crawl rate limit," which limits the maximum fetching rate for a given site.


Simply put, this represents the number of simultaneous parallel connections Googlebot may use to crawl the site, as well as the time it has to wait between the fetches. The crawl rate can go up and down based on a couple of factors:
  • Crawl health: if the site responds really quickly for a while, the limit goes up, meaning more connections can be used to crawl. If the site slows down or responds with server errors, the limit goes down and Googlebot crawls less.
  • Limit set in Search Console: website owners can reduce Googlebot's crawling of their site. Note that setting higher limits doesn't automatically increase crawling.


Crawl demand

Even if the crawl rate limit isn't reached, if there's no demand from indexing, there will be low activity from Googlebot. The two factors that play a significant role in determining crawl demand are:
  • Popularity: URLs that are more popular on the Internet tend to be crawled more often to keep them fresher in our index.
  • Staleness: our systems attempt to prevent URLs from becoming stale in the index.
Additionally, site-wide events like site moves may trigger an increase in crawl demand in order to reindex the content under the new URLs.

Taking crawl rate and crawl demand together we define crawl budget as the number of URLs Googlebot can and wants to crawl.


Factors affecting crawl budget

According to our analysis, having many low-value-add URLs can negatively affect a site's crawling and indexing. We found that the low-value-add URLs fall into these categories, in order of significance:
Wasting server resources on pages like these will drain crawl activity from pages that do actually have value, which may cause a significant delay in discovering great content on a site.


Top questions

Crawling is the entry point for sites into Google's search results. Efficient crawling of a website helps with its indexing in Google Search.

Q: Does site speed affect my crawl budget? How about errors?
A: Making a site faster improves the users' experience while also increasing crawl rate. For Googlebot a speedy site is a sign of healthy servers, so it can get more content over the same number of connections. On the flip side, a significant number of 5xx errors or connection timeouts signal the opposite, and crawling slows down.
We recommend paying attention to the Crawl Errors report in Search Console and keeping the number of server errors low.

Q: Is crawling a ranking factor?
A: An increased crawl rate will not necessarily lead to better positions in Search results. Google uses hundreds of signals to rank the results, and while crawling is necessary for being in the results, it's not a ranking signal.

Q: Do alternate URLs and embedded content count in the crawl budget?
A: Generally, any URL that Googlebot crawls will count towards a site's crawl budget. Alternate URLs, like AMP or hreflang, as well as embedded content, such as CSS and JavaScript, may have to be crawled and will consume a site's crawl budget. Similarly, long redirect chains may have a negative effect on crawling.

Q: Can I control Googlebot with the "crawl-delay" directive?
A: The non-standard "crawl-delay" robots.txt directive is not processed by Googlebot.

Q: Does the nofollow directive affect crawl budget?
A: It depends. Any URL that is crawled affects crawl budget, so even if your page marks a URL as nofollow it can still be crawled if another page on your site, or any page on the web, doesn't label the link as nofollow.

For information on how to optimize crawling of your site, take a look at our blogpost on optimizing crawling from 2009 that is still applicable. If you have questions, ask in the forums!

Posted by Gary, Crawling and Indexing teams

















---------------------------

All titles, content, publisher names, trademarks, artwork, and associated imagery are trademarks and/or copyright material of their respective owners. All rights reserved. The Spam Archive website contains material for general information purposes only. It has been written for the purpose of providing information and historical reference containing in the main instances of business or commercial spam.

Lets beat spam together
Many of the messages in Spamdex's archive contain forged headers in one form or another. The fact that an email claims to have come from one email address or another does not mean it actually originated at that address!
Please use spamdex responsibly.


Yes You! Get INVOLVED - Send in your spam and report offenders

Create a rule in outlook or simply forward the junk email you receive to questions@spamdex.co.uk | See contributors

Google + Spam | 2010- 2017 Spamdex - The Spam Archive for the internet. unsolicited electric messages (spam) archived for posterity. Link to us and help promote Spamdex as a means of forcing Spammers to re-think the amount of spam they send us.

The Spam Archive - Chronicling spam emails into readable web records index for all time

Please contact us with any comments or questions at questions@spamdex.co.uk. Spam Archive is a non-profit library of thousands of spam email messages sent to a single email address. A number of far-sighted people have been saving all their spam and have put it online. This is a valuable resource for anyone writing Bayesian filters. The Spam Archive is building a digital library of Internet spam. Your use of the Archive is subject to the Archive's Terms of Use. All emails viewed are copyright of the respected companies or corporations. Special thanks: We would like to thank Benedict who is a SEO Consultant who has freely given up his time to advise us on how best to maximise on our organic search traffic strategy and also for his wonderful creative vision on how to spread the word about Spamdex and how we try to combat spam across the globe. Click here for more information.

Our inspiration is the "Internet Archive" USA. "Libraries exist to preserve society's cultural artefacts and to provide access to them. If libraries are to continue to foster education and scholarship in this era of digital technology, it's essential for them to extend those functions into the digital world." This is our library of unsolicited emails from around the world. See https://archive.org. Spamdex is in no way associated though. Supporters and members of http://spam.abuse.net Helping rid the internet of spam, one email at a time. Working with Inernet Aware to improve user knowlegde on keeping safe online. | Link to us | Terms | Privacy | Cookies | Complaints | Copyright | Spam emails / ICO | Spam images | Sitemap

Important: Users take note, this is Spamdex - The Spam Archive for the internet. Some of the pages indexed could contain offensive language or contain fraudulent offers. If an offer looks too good to be true it probably is! Please tread, carefully, all of the links should be fine. Clicking I agree means you agree to our terms and conditions. We cannot be held responsible etc etc.

The Spam Archive - Chronicling spam emails into readable web records

The Glass House | London | SW19 8AE |
Spamdex is a digital archive of unsolicited electronic mail 4.8 out of 5 based on reviews
Spamdex - The Spam Archive Located in London, SW19 8AE. Phone: 080000 0514541.