Google and Meta trained AI with porn sites, says newspaper – 04/25/2023 – Tech

Google and Meta trained AI with porn sites, says newspaper – 04/25/2023 – Tech

[ad_1]

An investigation by The Washington Post showed that Google’s database used to train artificial intelligence (AI) models at the company itself and other tech giants such as Meta contained pirated files, pornographic websites and extremist forums. .

The American newspaper unraveled the file called C4, which has content from 15 million websites on the open internet.

With help from the Allen Institute for AI Research, the Post cross-referenced the data with information from internet monitoring platform Similar Web to separate the links found into categories such as business, press and culture. About 5 million addresses were discarded from the base because they were no longer listed on the internet.

The investigation found obvious sources like Wikipedia and the online version of some of the main news outlets around the world.

However, it also identified at least 28 sites taken down by the US Department of Justice for infringing intellectual property laws – the pirate library b-ok.org ranked 190th on the importance list among 10 million references.

In addition to addresses that stored pirated material, pornographic sites and extremist forums also served as a reference for the construction of C4, although its developers claim that they used filters to remove offensive content.

Google was approached by the report since Tuesday (18), by email and telephone, but did not respond to the report’s request for information until the publication of this text. On the same date, Meta was also questioned and did not comment.

The Washington Post report even found voting data from voters in Colorado (40th place) and Florida (73rd place). These data are public, but under malicious treatment they may pose a risk to the holders, and violate personal data protection laws in the US and Brazil.

Artificial intelligence training includes several sources such as C4. In the development of GPT-3, a fundamental technology in the development of the ChatGPT text-generating AI, the startup OpenAI used 40 times more data than is available in the Google database.

OpenAI did not disclose the amount of data used to train GPT-4, the startup’s latest artificial intelligence model, which is backed by Microsoft. The public is in the dark about the sources used to train the most successful technology among generative AIs.

Newspapers, artists and writers have objected to the unauthorized use of their works to train artificial intelligence models. Journalism broadcaster CNN and The Wall Street Journal published articles in defense of paying copyrights to develop this technology.

The main source for C4 is Google’s repository of patents filed around the world—Google Patents.

The base also stores data from 500,000 personal blogs and funding campaigns published on crowdfunding sites, such as Kickstarter and Patreon. These materials can make AI more efficient in writing advertising texts, an area in which it is already being applied.

[ad_2]

Source link