Economy

Google and Meta trained AI with porn sites, says newspaper – 04/25/2023 – Tech

1 year ago Brazil

[ad_1]

An investigation by The Washington Post showed that Google’s database used to train artificial intelligence (AI) models at the company itself and other tech giants such as Meta contained pirated files, pornographic websites and extremist forums. .

The American newspaper unraveled the file called C4, which has content from 15 million websites on the open internet.

With help from the Allen Institute for AI Research, the Post cross-referenced the data with information from internet monitoring platform Similar Web to separate the links found into categories such as business, press and culture. About 5 million addresses were discarded from the base because they were no longer listed on the internet.

The investigation found obvious sources like Wikipedia and the online version of some of the main news outlets around the world.

However, it also identified at least 28 sites taken down by the US Department of Justice for infringing intellectual property laws – the pirate library b-ok.org ranked 190th on the importance list among 10 million references.

In addition to addresses that stored pirated material, pornographic sites and extremist forums also served as a reference for the construction of C4, although its developers claim that they used filters to remove offensive content.

Google was approached by the report since Tuesday (18), by email and telephone, but did not respond to the report’s request for information until the publication of this text. On the same date, Meta was also questioned and did not comment.

The Washington Post report even found voting data from voters in Colorado (40th place) and Florida (73rd place). These data are public, but under malicious treatment they may pose a risk to the holders, and violate personal data protection laws in the US and Brazil.

Artificial intelligence training includes several sources such as C4. In the development of GPT-3, a fundamental technology in the development of the ChatGPT text-generating AI, the startup OpenAI used 40 times more data than is available in the Google database.

OpenAI did not disclose the amount of data used to train GPT-4, the startup’s latest artificial intelligence model, which is backed by Microsoft. The public is in the dark about the sources used to train the most successful technology among generative AIs.

Newspapers, artists and writers have objected to the unauthorized use of their works to train artificial intelligence models. Journalism broadcaster CNN and The Wall Street Journal published articles in defense of paying copyrights to develop this technology.

The main source for C4 is Google’s repository of patents filed around the world—Google Patents.

The base also stores data from 500,000 personal blogs and funding campaigns published on crowdfunding sites, such as Kickstarter and Patreon. These materials can make AI more efficient in writing advertising texts, an area in which it is already being applied.

[ad_2]

Source link

Tags: artificial intelligence, ChatGPT, crime, Deep learning, digital piracy, Google, Internet, internet of things, iot, machine learning, Meta, newspaper, piracy, plagiarism, porn, sheet, sites, smuggling, Tech, Trained

Economy

“PT is masterminding the apocalypse of the spending hole”, says senator

2 weeks ago Brazil

Economy

International symposium in the Amazon highlights challenges and solutions for renewable energy

2 weeks ago Brazil

Economy

Quarry employees in RS were paid with crack – 04/16/2024 – Market

2 weeks ago Brazil

Economy

LDO 2025: economic team aims for accounts in the black, but official estimate is still a deficit throughout the Lula government

2 weeks ago Brazil

Economy

Mayors and deputies criticize government for project to reinstate municipal payrolls

2 weeks ago Brazil

Economy

Haddad attributes ‘two thirds’ of the dollar’s rise to the external scenario – 04/16/2024 – Market

2 weeks ago Brazil

Todos os artigos são traduzidos da fonte original. Operamos um serviço de tradução para ajudar os falantes de inglês no Brasil a entender o que está acontecendo em todo o Brasil.

Todo o conteúdo e fotos são propriedade da fonte original. Cada artigo tem um link para a fonte original na parte inferior do artigo. Não armazenamos nenhuma imagem original em nosso servidor.

Muitos expatriados, aposentados e turistas que falam inglês confiam em nosso serviço. Muitos residentes de língua inglesa no Brasil não têm acesso a essas informações. Nosso serviço facilita o entendimento entre os residentes de língua inglesa e a mídia de língua portuguesa, ao mesmo tempo em que proporciona mais visitantes a cada artigo original.

Se você deseja que um item seja removido, entre em contato conosco com o URL e o comprovante de propriedade para remover qualquer item de nosso sistema.