Universal Web Crawler Blocking Report

Universal Web Crawler Blocking Report

Last update on:

Introduction

This research aims to analyze the prevalence of web crawlers’ full website blocking among a diverse range of websites and answer, among all, the questions of “which web crawlers (robots) are most frequently blocked from top traffic sites” and “which categories of sites have blocked each of the different common web crawlers“. The study investigates the frequency at which specific web crawlers (robots) are blocked across a dataset comprising 86,151 analyzed websites up to this moment. More websites are analyzed from our initial backlog of 1 million high-traffic websites. The analysis encompasses both gradual automated analysis of the 1 million high-traffic sites, as well as websites manually analyzed by users through our robots.txt checking tool.

Data Collection

  • Website Selection: The dataset includes a comprehensive selection of websites drawn from two primary sources: an automated process that analyzes 1 million top-traffic websites and a manual analysis of websites by users of our free robots.txt testing tool.
  • Web Crawlers Blocking Analysis: The analysis focuses on the robots.txt files of the selected sites, identifying instances where specific web crawlers are blocked. The evaluation ensures each website is represented only once in the dataset, utilizing the most recent analysis of the robots.txt file of each website to avoid duplication.

Data Sources

  • Automated Analysis: Websites are selected for automated analysis based on their Open Page Rank. The list of sites is also obtained from DomCop’s list of the top 10 million websites, which is obtained from the Common Crawl project.
  • Manual Analysis: Users contribute to the dataset by manually submitting domains for analysis via our robots.txt testing tool. These submissions broaden the scope of the dataset and incorporate a wider variety of websites.

Data Processing

  • Crawler Grouping: To streamline the analysis, crawlers from the same organization are grouped together. For example, various user agents associated with Semrush are consolidated into a single category, enhancing the clarity and interpretability of the results.
  • Crawler Name Normalization: The crawler names extracted from robots.txt files undergo normalization to account for variations in formatting and casing. This ensures accurate data categorization and aggregation.
  • Occurrence Counting: Only when a bot is entirely blocked from a site is counted as a blocking occurrence. If a bot is partially blocked (blocked from certain paths but not the entire site), it is not counted as an occurrence. This criterion ensures that the analysis accurately reflects only instances where bots are effectively excluded from accessing the site’s content.
  • Website Categories: Each website is categorized based on a curated list of 185 categories. AI is used to create the list of categories, which then undergoes a manual human review to ensure the least overlap between them. These categories can be seen below:

  1. News
  2. Technology
  3. Education
  4. Reference
  5. Lifestyle
  6. Business & Finance
  7. Travel
  8. Arts and Culture
  9. Entertainment
  10. Government & Public Services
  11. Science
  12. Shopping
  13. Language & Communication
  14. Retail
  15. Sports
  16. Health & Wellness
  17. Food & Beverage
  18. Photography
  19. Personal Blog
  20. Music
  21. Outdoor activities
  22. Community platform
  23. Automotive
  24. Weather
  25. Hobbies & Collectibles
  26. Agriculture and Farming
  27. Quotes & Inspiration
  28. Religion
  29. Meeting Management
  30. Security
  31. Real Estate
  32. Law & Legal
  33. Parenting
  34. Social Media
  35. History and Geography
  36. Home and Garden
  37. Blogs
  38. Events & Conferences
  39. Job search
  40. Wedding Planning
  41. Non-profit/Activism
  42. Online pharmacy
  43. Tools and Equipment
  44. File Sharing
  45. Standardization Organizations
  46. Proxy Services
  47. Postal Services
  48. Tech & Computer Help
  49. Sustainability
  50. Reviews
  51. Online Dating
  52. Charity / Nonprofit organization
  53. Gaming & Gambling
  54. Outdoor & Recreation
  55. Spirituality
  56. Search Engine Optimization
  57. Human Resources
  58. Personal branding/profile-building
  59. Community forums
  60. Conference Website
  61. Transportation
  62. Email services

  1. Outdoor and Adventure
  2. Aviation
  3. Mathematics
  4. Museum
  5. Military
  6. Fundraising Platforms
  7. Wildlife
  8. Animal Welfare
  9. Online Publishing
  10. SEO Tools
  11. History
  12. Online Auctions
  13. Screen sharing
  14. Conference
  15. Social Networking
  16. Digital Repository
  17. Career and Job Search
  18. Television
  19. Forum
  20. Streaming Service
  21. Blogging
  22. Professional Organization
  23. Online Classifieds
  24. Gambling & Casino
  25. Antiques and Collectibles
  26. Online dictionaries
  27. Consumer Information
  28. Consumer Protection
  29. Space
  30. Puzzle and games
  31. Writing
  32. Psychology
  33. Robotics
  34. Open Data
  35. Uncategorized
  36. Biology/Ecology
  37. History & Conspiracy Theory
  38. Video
  39. Linguistics
  40. Project Management
  41. Online Libraries
  42. How-to Guides
  43. Login Portal
  44. Media Sharing Platform
  45. Philanthropy/Donations
  46. Regional and Ethnic Dialects
  47. Utilities
  48. Digital Publishing
  49. Website Localization
  50. Online eBook platform
  51. Mapping
  52. Print-on-Demand Services
  53. Landscaping
  54. Human Rights
  55. Energy & Utilities
  56. Home Goods/Home Decor
  57. Document Management
  58. Textiles
  59. Wallpapers
  60. Genealogy
  61. Charity / Nonprofit
  62. Statistics

  1. Editorial Blogs
  2. Environment
  3. Movies
  4. Telecom
  5. Opinion Blogs
  6. Toys
  7. Hobby/Collectibles
  8. Tech/Tutorials
  9. Politics
  10. Vintage Postcards
  11. Sporting Goods
  12. Storage
  13. Open-source and Civic Tech
  14. Podcast directory
  15. Directory
  16. Crafts & Hobbies
  17. Book Publishing
  18. Home Appliances
  19. Energy and Environment
  20. Publishing
  21. Bookselling
  22. Streaming Platform
  23. Job search platform
  24. Crowdfunding platform
  25. Question and Answer platform
  26. Home and Kitchen
  27. Pets & Animals
  28. Podcasts
  29. Archaeology
  30. Family & Genealogy
  31. Online community
  32. Soccer
  33. VPN Services
  34. Data Collection and Analysis Platforms
  35. File hosting and downloading
  36. B2B Tech Community
  37. online forum
  38. Blogging platform
  39. Cryptocurrency Tracking
  40. Search Engine
  41. Dictionary
  42. Documentation / Knowledge Base
  43. Ticketing and Event Services
  44. Tea & Coffee.
  45. Sourcing
  46. Arts and Crafts
  47. Publishing & Printing
  48. URL shortening service
  49. Health and wellness.
  50. Online Fundraising
  51. Cloud Services
  52. History & Culture
  53. Privacy and Data Protection
  54. Coupon and Deals
  55. Celebrity official website
  56. Home Improvement
  57. Job and Career Services
  58. Dating & Relationships
  59. Review Website
  60. Professional Networking
  61. Encyclopedia

Analysis

  • Visualization: The research employs a bar chart to visualize the frequency of crawler exclusions across different bot categories. The Y-axis represents the number of websites blocking a specific robot, while the X-axis delineates the bot names. A stacked bar chart is also presented to show the prevalence of crawler exclusion among different categories of websites.
  • Statistical Analysis: Simple quantitative analysis is conducted to identify trends and patterns in robot exclusions. The frequency of robot blocking is examined to discern prevalent practices among website owners.
  • Interpretation: The findings can be interpreted to provide insights into the prevalence and significance of web crawler exclusions in the online ecosystem, especially among top-traffic websites and among different categories of sites. Implications for website owners, technical SEO strategies, and robot behavior can be derived from this data by other interested researchers.

Results

The results of the ongoing analysis are published on Nexunom’s Robots.txt Checker page. Here is a snapshot of the results at 86,151 analyzed websites. The bar chart represents the number of times we found a crawler name disallowed to access the entire site on a website’s robots.txt file, and the following table shows the exact number of times each of the top 15 crawlers have been blocked.

Universal Web Crawler Blocking Report 1
15 Top Blocked Web Crawlers – Bar Chart
Universal Web Crawler Blocking Report 2
Number of Times Each Web Crawler Has Been Blocked

As the table above shows, GPTBot, SemrushBot, CCBot, and TeleportBot are among the top four most blocked crawlers in the robots.txt files of the top-traffic sites. The table below shows a more comprehensive list of the top blocked crawlers (55 top blocked crawler list) along with the rank of each crawler in the top blocked list.

RankCrawlerOccurrence
1GPTBot3911
2SemrushBot2330
3CCBot2267
4TeleportBot2124
5Google-Extended1958
6MJ12bot1716
7ChatGPT-User1688
8AhrefsBot1488
9WebCopier1035
10WebStripper1020
11Offline Explorer1005
12SiteSnagger1004
13WebZIP1003
14larbin953
15MSIECrawler947
16HTTrack888
17anthropic-ai868
18wget868
19dotbot854
20ZyBORG853
21NPBot817
22Xenu816
23WebReaper815
24sitecheck.internetseer.com777
25grub-client775
26Fetch768
27Zealbot765
28Download Ninja758
29linko755
30libwww751
31Microsoft.URL.Control731
32Zao729
33UbiCrawler722
34DOC716
35k2spider695
36PetalBot679
37BLEXBot658
38Amazonbot643
39Baiduspider637
40FacebookBot634
41omgilibot633
42ia_archiver609
43fast604
44Mediapartners-Google*585
45Yandex549
46Bytespider525
47Claude-Web515
48omgili514
49cohere-ai507
50ClaudeBot492
51TurnitinBot467
52PerplexityBot434
53008397
54magpie-crawler366
55psbot342

The following stacked bar chart shows which “categories of sites” have blocked a specific crawler in their robots.txt files. For example, the stacked bar for GPTBot reveals that “News,” “Technology,” “Entertainment,” and “Business & Finance” are among the top categories of sites blocking Chat GPT crawler. The same trend applies to the rest of the AI Crawlers.

Note 1: This stacked chart, unlike the above bar chart, only represents the websites we analyzed from our dataset of 1 million top-traffic sites and does not include the sites analyzed by the users via Nexunom’s robots.txt checker, so it is a better representative of the blocking behavior of the top-traffic sites than the bar chart above.

Note 2: The other category represents the sum of the occurrences of all the other categories that had 10 or fewer occurrences for each bot group. In order to avoid the stacked chart being cluttered with categories with fewer occurrences than 10, we grouped them into the “Other” category.

Universal Web Crawler Blocking Report 3
Categories of Sites Blocking Each Web Crawler

The following table shows four of the top “site categories” blocking each of the top 15 crawlers (not including the other category.)

Bot GroupCategoryOccurrence
GPTBotNews209
GPTBotTechnology122
GPTBotEntertainment92
GPTBotBusiness & Finance78
SemrushBotNews72
SemrushBotEducation59
SemrushBotGovernment & Public Services57
SemrushBotReference41
CCBotNews183
CCBotTechnology78
CCBotEntertainment60
CCBotBusiness & Finance47
TeleportBotReference78
TeleportBotNews67
TeleportBotEducation44
TeleportBotOnline Libraries42
Google-ExtendedNews185
Google-ExtendedEntertainment62
Google-ExtendedTechnology60
Google-ExtendedEducation36
MJ12botReference79
MJ12botNews79
MJ12botEducation47
MJ12botTechnology55
ChatGPT-UserNews158
ChatGPT-UserTechnology67
ChatGPT-UserEducation40
ChatGPT-UserBusiness & Finance35
AhrefsBotNews90
AhrefsBotTechnology63
AhrefsBotGovernment & Public Services53
AhrefsBotEducation49
WebCopierReference79
WebCopierNews62
WebCopierOnline Libraries42
WebCopierEducation42
WebStripperReference76
WebStripperNews62
WebStripperOnline Libraries42
WebStripperEducation42
Offline ExplorerReference76
Offline ExplorerNews62
Offline ExplorerOnline Libraries42
Offline ExplorerEducation42
SiteSnaggerReference74
SiteSnaggerNews61
SiteSnaggerOnline Libraries42
SiteSnaggerEducation42
WebZIPReference77
WebZIPNews60
WebZIPOnline Libraries42
WebZIPEducation44
larbinReference74
larbinNews55
larbinEducation41
larbinOnline Libraries41
MSIECrawlerReference75
MSIECrawlerNews60
MSIECrawlerEducation43
MSIECrawlerOnline Libraries41

Limitations and Considerations

  • User Contributions: While the dataset may include domains searched by users in our robots.txt tester, the proportion of such data is considered negligible and does not significantly influence the results.
  • Single Domain Representation: Each domain is counted only once in the analysis, with the latest crawl of its robots.txt file contributing to the dataset. This approach ensures fair representation and avoids skewing the results based on multiple entries for the same domain.
  • Incomplete Data: While efforts are made to continuously update the dataset, it may not capture all domains or reflect instantaneous changes in robot exclusion practices across the web.
  • Potential Sampling Bias: The dataset’s composition may be influenced by sampling bias inherent in the selection of domains for analysis from the top 1 million high traffic websites, potentially impacting the generalizability of findings to all the web.

Analysis of the Results

The analysis of the results indicates that certain categories of web crawlers are among the most frequently blocked across the analyzed websites. Notably, web crawlers related to artificial intelligence tools (AI crawlers), such as GPTBot, GoogleExtended, and ChatGPT-User, feature prominently among the top blocked web crawlers. CCBot (Common Crawl Bot), which historically provides data for training AI tools, is also found among the top 5 blocked crawlers. However, the latest results show that anthropic-ai has dropped from the top 15 blocked crawlers. The anthropic-ai, which is another AI Cralwer that was among the top 15 blocked crawlers when 50,000 sites were analyzed, dropped to position 18 after 30,000 more sites were analyzed. Also, another AI Cralwer, PerplexityBot, is in position 52 of the top blocked crawlers, probably due to being a newer crawler or a less important one.

Additionally, web crawlers associated with search engine optimization (SEO) tools, such as SemrushBot, AhrefsBot, and MJ12bot (Majestic SEO web crawler), are prevalent in the list of frequently blocked crawlers. However, after analyzing 30,000 more websites, the dotbot (Moz web crawler), which was initially on the list of 15 top blocked crawlers, has dropped to position 19.

While these findings offer insights into common practices regarding robot exclusions, it’s important to note that the results are presented for informational purposes only. It’s acknowledged that some web crawlers, particularly those related to AI tools like GPTBot or Google-Extended, may contribute valuable traffic to the sites they crawl. Therefore, while recognizing the prevalence of certain bot categories in robot exclusions, caution is advised against hasty blocking, ensuring that legitimate bot traffic is not inadvertently excluded.

The findings contribute to our understanding of robot crawling management practices and can help form strategies for handling web crawler traffic effectively. If you want to contribute to this research or have any suggestions or recommendations for us, please feel free to leave a comment below.

What’s your Reaction?
+1
4
+1
0
+1
0
+1
0
+1
0
+1
0
+1
0

Author

  • Saeed Khosravi

    Saeed Khosravi is an SEO Strategist, Digital Marketer, and WordPress Expert with over 15 years of experience, starting his career in 2008. He graduated with a degree in MIB Marketing from HEC Montreal. As the Founder and CEO of Nexunom, Saeed, alongside his dedicated team, provides comprehensive digital marketing solutions to local businesses. He is also the founder and the main brain behind several successful marketing SAAS platforms, including Allintitle.co, ReviewTool.com, and Tavata.com.

    https://www.linkedin.com/in/saeedkhosravi/ [email protected] Khosravi Saeed
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Saeed Khosravi