The Choice

7/1/2025

The Choice

AIs are data hungry and as strange as it might sound, training data must be fresh and hopefully new, and with larger models comes the need for more data on which to build a relationship base. However, if a number of models wind up using the same basic training data, the inherent biases in that data become amplified and the outputs of the models become less diverse and less ‘creative’. This makes the hunt for data a major focus for model builders.
While lawsuits wind their way through the courts around the globe, there is little to stop AI companies from taking almost anything that is in digital form. Google (GOOG) and other internet search bots who spend their days and nights searching for sites and information to index respect a simple text file that sits in the site’s root file and tells them what parts of a website they can access and what parts they cannot, AI content bots do not. They search sites for data and often pay no attention to whether the data is protected, licensed, or copyrighted. They have little care for who owns the material as long as it seems valid and are willing to use it to maintain the data ‘freshness’ that models require.
If you have a site that has protected content on it, it is quite difficult to stop AI data crawlers from extracting data and images from your web pages. This data is taken,processed and organized and can be added to existing datasets or used to create a proprietary dataset. The txt file mentioned above can specify specific crawlers that the site does not want to have access (User-agent: GPTBot Disallow: /), but as the aforementioned txt file is basically an honor system, if the crawlers are told to ignore the file, they do what they want and copy what they want.
Many content creators are not willing to leave the decision as to what is ‘fair use’ and what is not up to the courts and as of July 1, Cloudflare (NET) has put an automatic default opt-in on any new sites that join their network and the option to opt-in for any existing customer. This feature blocks all verified ‘AI-related’ bots, as well as any unverified bots that exhibit similar behavior, but does not block traditional search engine crawlers. Further, users can give notice to the bots as to what parts of the site can be accessed for training data (if any) with a simple dashboard. But it gets better…
Publishers can set one of three access levels for their sites. They can allow free access, they can charge a ‘per-crawl’ fee, or they can block the site completely. AI crawlers must register with Cloudflare, receive a crypto key for identification, and configure the crawler to include ‘payment intent’. Cloudflare acts as the ‘merchant’, essentially the middleman in the transaction, and handles infrastructure, and billing. The goal (other than to make money for Cloudflare) is to make sure content creators are compensated for their work, whether the courts see AI ‘strip-mining’ as legal or not. If you can’t beat them, block them (or charge them).
Cloudflare has gone as far as to develop software that can sense when an unregistered crawler is at work. It serves the unauthorized bot AI generated decoy pages that have hidden links that only bots would follow. This causes the bots to waste time and resources and helps Cloudflare to learn more about each new generation of AI bots, all of which works in conjunction with Cloudflare’s existing security and detection techniques. They are trying to return content control back to the creators who developed it and offset the data grabbing AI companies that are looking to justify almost any content use for training purposes.
Side note: What is Cloudflare? Cloudflare runs a reverse proxy edge network for individuals and businesses. It has a large number (>330 server locations in cities all around the world and acts as an intermediary between user web requests and site servers. This allows the site server to remain behind a firewall, with only the proxy server (Cloudflare) exposed to the internet. The proxy servers also maintain a cache of static content that allows requests to be processed quickly, even if the site server is busy. The risk is that the Cloudflare proxy then becomes the front for many domains and becomes the point of failure for many sites should it go down or be comprimised.

0 Comments

The Choice

The Choice

Leave a Reply.

Author

Archives

Categories

The Choice

The Choice​

Leave a Reply.

Author

Archives

Categories

The Choice