How disabling javascript rendering makes google go to war with LLMs (ChatGPT, Llama)

How disabling javascript rendering makes google go to war with LLMs (ChatGPT, Llama)

Disabling javascript rendering for bots that use Google engine to parsing, makes it way more difficult to get data to the output of LLM models. When web browsers render JavaScript, they can create dynamic, interactive content that changes based on user actions. However, when JavaScript is disabled, websites typically fall back to their basic HTML structure. This apparently simple technical detail has important implications for how AI models interact with web content.

Language models like ChatGPT and Llama typically access web content in a way that's similar to having JavaScript disabled - they see the basic HTML structure rather than the full interactive experience. This creates an interesting situation with Google's search results pages.

Google has historically used JavaScript to enhance their search results, making them more interactive and dynamic. However, they've noticed that AI models can scrape and potentially reuse their search results when accessing the basic HTML version. This creates a dilemma for Google - they want their content to be accessible to human users but may want to limit how easily AI models can systematically extract and repurpose their data.

First signals - Google is making move to change Javascript basics.

On the Understand the JavaScript SEO basics - if you compare it to the older version:

"Googlebot queues pages for both crawling and rendering. It is not immediately obvious when a page is waiting for crawling and when it is waiting for rendering.

When Googlebot fetches a URL from the crawling queue by making an HTTP request, it first checks if you allow crawling. Googlebot reads the robots.txt file. If it marks the URL as disallowed, then Googlebot skips making an HTTP request to this URL and skips the URL."

Changed to:
"Googlebot queues pages for both crawling and rendering. It is not immediately obvious when a page is waiting for crawling and when it is waiting for rendering. When Googlebot fetches a URL from the crawling queue by making an HTTP request, it first checks if you allow crawling. Googlebot reads the robots.txt file. If it marks the URL as disallowed, then Googlebot skips making an HTTP request to this URL and skips the URL. Google Search won't render JavaScript from blocked files or on blocked pages."

We always want to link for variable sources: More explanation is on seroundtable. <- . Thanks for this analize.

Let's understand the data collection aspect: When LLMs like Claude, Perplexity, or ChatGPT are trained, they often use web content that has been crawled and processed. If JavaScript content is blocked from rendering in Google's system, this could create a cascade effect:

  1. The JavaScript content never gets rendered in Google's index
  2. Engines rely on that don't have enough data to answer correctly.
  3. Bad answers, not enough data to have right answer.

On January 16, the first reports appeared regarding the discontinuation of javascript-free page parsing (scraping). The first post appeared on Hacker News.

If you don't enable JavaScript and want to parse content, for example, using Firefox 128 version a corresponding message appears that requires you to run it.

LLM Problem with scraping

This is one of the indications that Google wants to fight for leadership in AI and language models or not lose its position as the leading search engine for web content, but let's break it down a little.

What is an LLM in AI?

LLM or Large Language Model, is a type of artificial intelligence system designed to process, understand, and generate human-like text. These models are built using deep learning techniques, particularly neural networks with many layers, allowing them to handle vast amounts of textual data and learn complex language patterns. OpenAI describes it as a system trained on lots of text to create a model that can generate new text. Their ability to perform various language tasks like writing, answering questions, translation, and summarization without being specifically programmed for each task. Their ability to recognize patterns and relationships in language, which enables them to understand context and generate coherent responses.

However, it's important to understand that despite their impressive capabilities, LLMs don't truly "understand" language the way humans do. They're pattern matching systems that can sometimes make mistakes or generate plausible-sounding but incorrect information.

How much data on average does an LLM need to train?

To train modern LLM, you need datasets ranging from hundreds of gigabytes to several terabytes of text. For example, GPT-3, with its 175 billion parameters, was trained on approximately 570GB of text data, which is roughly equivalent to 400 billion tokens. A token represents about 4 characters on average, so this translates to around 1.6 trillion characters of text.

Now, let's convert this to paper pages. For this calculation, we'll need to know how many characters typically fit on a standard page. A standard page using Times New Roman 12-point font with standard margins typically contains:

  • About 50 lines per page
  • Approximately 75 characters per line
  • This gives us 3,750 characters per page (50 × 75)

To find the number of pages, we divide the total number of characters by characters per page: 1.6 trillion characters ÷ 3,750 characters per page = 426.67 million pages.

To put this in perspective, let's consider what this means physically:

  • A standard ream of paper (500 sheets) is about 2 inches (5.08 cm) thick
  • So 426.67 million pages would be equivalent to 853,340 reams
  • This stack would be approximately 1,706,680 inches or 142,223 feet tall
  • Converting to more relatable terms, this stack would be about 27 miles (43 kilometers) high.

This means if we laid the pages end to end, they would wrap around the Earth about three times, as the Earth's circumference is approximately 24,901 miles.

    Why is google pushing javascript renderer for retrieving search results?

    Because they are way better of this than Chatgpt or Claude.. Google knows its position when it comes to search engines. It is number one and would like to remain so for years to come.

    In a recent (December 2024, Evercore ISI) survey of 1,000 people, OpenAI chatbot was the top search provider for 5% of respondents, up from 1% in June. Source:

    Google's global search market share fell below 90% in late 2024 for the first time since 2015, dropping from 91.62% to 89.73% year-over-year. -> Source: https://gs.statcounter.com/search-engine-market-share

    2 percent is just a small diference, but not for multibilion worthy company.

    This leads to what some have called an "arms race" between Google and AI companies. As Google implements new protections, AI companies work on more sophisticated ways to access web content. This creates interesting technical and ethical questions about data access, fair use, and the boundaries between human and AI interaction with web services.

    The broader context here is about control over valuable training data. Search results represent a massive, curated dataset that could be valuable for training AI models. Google, understandably, wants to maintain control over how their data is used, while AI companies need access to high-quality data to improve their models.

    What makes this particularly interesting is that Google isn't just defending against competitors - they're also developing their own AI models. This puts them in a complex position where they need to balance protecting their data while also advancing their own AI capabilities.

      Why is the Javascript render issue such a big topic, and can it affect your site too?

      Yes, Javascript policy changes can cost you more. Each page request requires the server to dynamically render HTML, which consumes more resources. The median page serves 591.2 KB of JavaScript (564.2 KB on mobile), making it a significant contributor to page weight - Source: https://web.dev/articles/rendering-on-the-web?hl=en. More data consumed means increased costs.

      On 10 January, techcrunch described case of Triplegangers. Triplegangers, a boutique e-commerce platform offering 3D image files, experienced a severe service disruption when OpenAI's web crawler inadvertently overwhelmed their servers. Operating from 600 distinct IP addresses, the crawler deluged their system with tens of thousands of requests while attempting to index their extensive catalog of over 65,000 products and hundreds of thousands of images.

      The intensive data harvesting operation effectively paralyzed the website's operations, triggering system failures and blocking legitimate customer access to the platform. And now, imagine every crawler will access your site and download more data. Way, more data, and it will take more time to process information. Here's an example how much more time needed Serpapi to scrape data from google.

      What is a data difference between HTML and JavaScript crawl output?

      When web browsers render JavaScript, they can create dynamic, interactive content that changes based on user actions. However, when JavaScript is disabled, websites typically fall back to their basic HTML structure. This apparently simple technical detail has important implications for how AI models interact with web content.

      We crawl HTML directly, we're getting a static snapshot of the page's initial structure - imagine taking a photograph of a building's framework before any movement or activity happens inside. This HTML output contains the base structure and content that was sent from the server, including all the static text, images, and markup that make up the page's skeleton.

      JavaScript crawling, on the other hand, is more like recording a video of the building coming to life. When JavaScript executes, it can dynamically modify the Document Object Model (DOM), make API calls to fetch additional data, and create new content that wasn't present in the initial HTML. This means JavaScript crawling can capture:

      1. Dynamically loaded content that appears after API calls
      2. Elements that are created or modified by JavaScript after the page loads
      3. Content that's revealed through user interactions like clicking or scrolling
      4. Data that's pulled from external sources and injected into the page

      For example, imagine crawling a modern social media feed. An HTML crawl would only capture the initial few posts that were included in the server's response. But a JavaScript crawl would be able to capture all the additional posts that load as you scroll down, since those are typically fetched dynamically through JavaScript.

      This difference becomes particularly important when dealing with Single Page Applications (SPAs) or any modern web application that heavily relies on JavaScript for content rendering. In these cases, an HTML crawl might return very minimal content, while a JavaScript crawl would capture the full, interactive experience of the site.

      Our knowledge and content is the models' retraining base, and there's not much we can do about it.

      When we interact online, we create various types of data: social media posts, emails, chat messages, documents, code, articles, videos, and more. This data becomes part of the ever-growing digital ecosystem. Companies and organizations might collect and use this data, following their terms of service and privacy policies, to train or fine-tune AI models.

      The cycle works something like this: We create content → This content becomes potentially available as training data → New AI models might be trained on this data → These models help create more content → And the cycle continues. It's similar to how human knowledge has traditionally grown and evolved, but at a much faster pace and larger scale.

      There is not that much control - or lack thereof. Once we put content online, it can be difficult to fully control how it might be used in AI training. While there are some technical approaches like robots.txt files or specific licenses that attempt to restrict data usage, these aren't always foolproof solutions.

      This raises important questions about data ownership, privacy, and consent in the AI age. For instance, when someone posts a blog article or shares code on GitHub, should they have a say in whether their content is used to train AI models? How can we balance the benefits of having AI systems trained on diverse, real-world data with individuals' rights to control their digital creations?

      How you can block bots, or lower website cost?

      To lower website cost, consider cloud cost optimalization. To block bots, it's good to know that cloudflare is offering some kind of control -> Bot fight mode.

      Bot fight mode:

      • Identifies traffic matching patterns of known bots
      • Issues computationally expensive challenges to suspected bots
      • Notifies Bandwidth Alliance partners to disable detected bots

      We're not gonna win this battle but at least, lets not be victims of it, let's not pay more for js rendering. It's not that simple, but finding balance between bot access and not accesing part is a key to cost balance and being visible in the google.

      you may also like

      How disabling javascript rendering makes google go to war with LLMs (ChatGPT, Llama)

      Disabling JavaScript is a soft declaration of war to other LLMs. In the middle of the fight, your site can pay more for hosting and site rendering. Well, it could affect you but, you can prevent it.

      Read full story

      The Ultimate Showdown: Kubernetes vs Cloud Foundry for Cloud Applications

      Deciding between Kubernetes and Cloud Foundry? Read our in-depth comparison to make an informed choice for your cloud deployment strategy.

      Read full story

      The Future of Business: Integrating AI for Competitive Advantage

      How AI technology can streamline operations, improve customer experiences, and drive business success in today's digital age.

      Read full story