Signs your startup might need a DevOps engineer
What are the signs your startup might need a DevOps engineer? Learn how a DevOps approach can enhance your software development lifecycle and boost deployment efficiency.
Disabling javascript rendering for bots that use Google engine to parsing, makes it way more difficult to get data to the output of LLM models. When web browsers render JavaScript, they can create dynamic, interactive content that changes based on user actions. However, when JavaScript is disabled, websites typically fall back to their basic HTML structure. This apparently simple technical detail has important implications for how AI models interact with web content.
Language models like ChatGPT and Llama typically access web content in a way that's similar to having JavaScript disabled - they see the basic HTML structure rather than the full interactive experience. This creates an interesting situation with Google's search results pages.
Google has historically used JavaScript to enhance their search results, making them more interactive and dynamic. However, they've noticed that AI models can scrape and potentially reuse their search results when accessing the basic HTML version. This creates a dilemma for Google - they want their content to be accessible to human users but may want to limit how easily AI models can systematically extract and repurpose their data.
On the Understand the JavaScript SEO basics - if you compare it to the older version:
"Googlebot queues pages for both crawling and rendering. It is not immediately obvious when a page is waiting for crawling and when it is waiting for rendering.
When Googlebot fetches a URL from the crawling queue by making an HTTP request, it first checks if you allow crawling. Googlebot reads the robots.txt file. If it marks the URL as disallowed, then Googlebot skips making an HTTP request to this URL and skips the URL."
Changed to:
"Googlebot queues pages for both crawling and rendering. It is not immediately obvious when a page is waiting for crawling and when it is waiting for rendering. When Googlebot fetches a URL from the crawling queue by making an HTTP request, it first checks if you allow crawling. Googlebot reads the robots.txt file. If it marks the URL as disallowed, then Googlebot skips making an HTTP request to this URL and skips the URL. Google Search won't render JavaScript from blocked files or on blocked pages."
We always want to link for variable sources: More explanation is on seroundtable. <- . Thanks for this analize.
Let's understand the data collection aspect: When LLMs like Claude, Perplexity, or ChatGPT are trained, they often use web content that has been crawled and processed. If JavaScript content is blocked from rendering in Google's system, this could create a cascade effect:
On January 16, the first reports appeared regarding the discontinuation of javascript-free page parsing (scraping). The first post appeared on Hacker News.
If you don't enable JavaScript and want to parse content, for example, using Firefox 128 version a corresponding message appears that requires you to run it.
This is one of the indications that Google wants to fight for leadership in AI and language models or not lose its position as the leading search engine for web content, but let's break it down a little.
LLM or Large Language Model, is a type of artificial intelligence system designed to process, understand, and generate human-like text. These models are built using deep learning techniques, particularly neural networks with many layers, allowing them to handle vast amounts of textual data and learn complex language patterns. OpenAI describes it as a system trained on lots of text to create a model that can generate new text. Their ability to perform various language tasks like writing, answering questions, translation, and summarization without being specifically programmed for each task. Their ability to recognize patterns and relationships in language, which enables them to understand context and generate coherent responses.
However, it's important to understand that despite their impressive capabilities, LLMs don't truly "understand" language the way humans do. They're pattern matching systems that can sometimes make mistakes or generate plausible-sounding but incorrect information.
To train modern LLM, you need datasets ranging from hundreds of gigabytes to several terabytes of text. For example, GPT-3, with its 175 billion parameters, was trained on approximately 570GB of text data, which is roughly equivalent to 400 billion tokens. A token represents about 4 characters on average, so this translates to around 1.6 trillion characters of text.
Now, let's convert this to paper pages. For this calculation, we'll need to know how many characters typically fit on a standard page. A standard page using Times New Roman 12-point font with standard margins typically contains:
To find the number of pages, we divide the total number of characters by characters per page: 1.6 trillion characters ÷ 3,750 characters per page = 426.67 million pages.
To put this in perspective, let's consider what this means physically:
This means if we laid the pages end to end, they would wrap around the Earth about three times, as the Earth's circumference is approximately 24,901 miles.
Because they are way better of this than Chatgpt or Claude.. Google knows its position when it comes to search engines. It is number one and would like to remain so for years to come.
In a recent (December 2024, Evercore ISI) survey of 1,000 people, OpenAI chatbot was the top search provider for 5% of respondents, up from 1% in June. Source:
Google's global search market share fell below 90% in late 2024 for the first time since 2015, dropping from 91.62% to 89.73% year-over-year. -> Source: https://gs.statcounter.com/search-engine-market-share
2 percent is just a small diference, but not for multibilion worthy company.
This leads to what some have called an "arms race" between Google and AI companies. As Google implements new protections, AI companies work on more sophisticated ways to access web content. This creates interesting technical and ethical questions about data access, fair use, and the boundaries between human and AI interaction with web services.
The broader context here is about control over valuable training data. Search results represent a massive, curated dataset that could be valuable for training AI models. Google, understandably, wants to maintain control over how their data is used, while AI companies need access to high-quality data to improve their models.
What makes this particularly interesting is that Google isn't just defending against competitors - they're also developing their own AI models. This puts them in a complex position where they need to balance protecting their data while also advancing their own AI capabilities.
Yes, Javascript policy changes can cost you more. Each page request requires the server to dynamically render HTML, which consumes more resources. The median page serves 591.2 KB of JavaScript (564.2 KB on mobile), making it a significant contributor to page weight - Source: https://web.dev/articles/rendering-on-the-web?hl=en. More data consumed means increased costs.
On 10 January, techcrunch described case of Triplegangers. Triplegangers, a boutique e-commerce platform offering 3D image files, experienced a severe service disruption when OpenAI's web crawler inadvertently overwhelmed their servers. Operating from 600 distinct IP addresses, the crawler deluged their system with tens of thousands of requests while attempting to index their extensive catalog of over 65,000 products and hundreds of thousands of images.
When web browsers render JavaScript, they can create dynamic, interactive content that changes based on user actions. However, when JavaScript is disabled, websites typically fall back to their basic HTML structure. This apparently simple technical detail has important implications for how AI models interact with web content.
We crawl HTML directly, we're getting a static snapshot of the page's initial structure - imagine taking a photograph of a building's framework before any movement or activity happens inside. This HTML output contains the base structure and content that was sent from the server, including all the static text, images, and markup that make up the page's skeleton.
JavaScript crawling, on the other hand, is more like recording a video of the building coming to life. When JavaScript executes, it can dynamically modify the Document Object Model (DOM), make API calls to fetch additional data, and create new content that wasn't present in the initial HTML. This means JavaScript crawling can capture:
For example, imagine crawling a modern social media feed. An HTML crawl would only capture the initial few posts that were included in the server's response. But a JavaScript crawl would be able to capture all the additional posts that load as you scroll down, since those are typically fetched dynamically through JavaScript.
This difference becomes particularly important when dealing with Single Page Applications (SPAs) or any modern web application that heavily relies on JavaScript for content rendering. In these cases, an HTML crawl might return very minimal content, while a JavaScript crawl would capture the full, interactive experience of the site.
When we interact online, we create various types of data: social media posts, emails, chat messages, documents, code, articles, videos, and more. This data becomes part of the ever-growing digital ecosystem. Companies and organizations might collect and use this data, following their terms of service and privacy policies, to train or fine-tune AI models.
The cycle works something like this: We create content → This content becomes potentially available as training data → New AI models might be trained on this data → These models help create more content → And the cycle continues. It's similar to how human knowledge has traditionally grown and evolved, but at a much faster pace and larger scale.
There is not that much control - or lack thereof. Once we put content online, it can be difficult to fully control how it might be used in AI training. While there are some technical approaches like robots.txt files or specific licenses that attempt to restrict data usage, these aren't always foolproof solutions.
This raises important questions about data ownership, privacy, and consent in the AI age. For instance, when someone posts a blog article or shares code on GitHub, should they have a say in whether their content is used to train AI models? How can we balance the benefits of having AI systems trained on diverse, real-world data with individuals' rights to control their digital creations?
To lower website cost, consider cloud cost optimalization. To block bots, it's good to know that cloudflare is offering some kind of control -> Bot fight mode.
Bot fight mode:
Notifies Bandwidth Alliance partners to disable detected bots
We're not gonna win this battle but at least, lets not be victims of it, let's not pay more for js rendering. It's not that simple, but finding balance between bot access and not accesing part is a key to cost balance and being visible in the google.