Many consumers are uncomfortable with the thought that their data is being used for artificial intelligence.
As companies race to develop and train increasingly large generative AI models, developers have been using as much of the searchable internet as they can.
This includes some of the public data people have shared online, as well as possibly some private data.
According to a recent interview published in Scientific American, in order to train a generative AI model, it is necessary to use a massive amount of text and images humans have created. As companies work hard to create the largest artificial intelligence available at any given time, it is becoming increasingly clear that while much of the data is public, some is indeed coming from copyrighted sources. This has led to a wave of lawsuits filed by artists challenging the way their work is being used by artificial intelligence developers.
That said, it’s important to note that it isn’t only visual artists and published authors who are affected by this trend in artificial intelligence training. The reason is that developers are using their own versions of web crawlers and web scrapers to obtain the data needed for training their artificial intelligence. After all, the internet as a whole cannot be downloaded. It must be crawled or scraped instead.
These tools are readily available for generative AI model developers to use for training.
The crawlers and scrapers move through a massive number of URLs and download what they find. In the case of OpenAI, it was an open access web crawler called Common Crawl that was used to gather training data for at least one version of the tremendous language model used to power the ChatGPT artificial intelligence chatbot.
That said, these crawlers – and the generative AI model trainers – don’t typically show any details about their training processes or the data they use to achieve it. In OpenAI’s own paper about the process, it says that “given both the competitive landscape and the safety implications of large scale models like GPT-4 this report contains no further details about the architecture, hardware training, compute dataset, construction training method or similar.”