3.5 C
United Kingdom
Friday, February 7, 2025

Meta trained their AI on Pirate Shadow Libraries


Meta trained their AI on Pirate Shadow LibrariesMeta trained their AI on Pirate Shadow Libraries

According to unsealed emails released on Thursday, Meta trained its AI on pirated e-books. Last month, Meta admitted to torrenting a controversial large dataset from LibGen, which includes tens of millions of pirated books. However, details around the torrenting were murky until yesterday, when Meta’s unredacted emails were made public for the first time. The new evidence showed that Meta torrented “at least 81.7 terabytes of data across multiple shadow libraries such as Anna’s Archive, Z-Library, and LibGen.”

The emails being unsealed was due to Joseph Saveri Law Firm filed US federal class-action lawsuits on behalf of Sarah Silverman and other authors against OpenAI and Meta, accusing the companies of illegally using copyrighted material to train AI language models such as ChatGPT and LLaMA.

“The magnitude of Meta’s unlawful torrenting scheme is astonishing,” the authors’ filing alleged, insisting that “vastly smaller acts of data piracy—just .008 percent of the amount of copyrighted works Meta pirated—have resulted in Judges referring the conduct to the US Attorneys’ office for criminal investigation.”

Here is basically what Meta did. Staff would pirate books with work laptops that were not connected to company servers. They would access a huge trove of BitTorrent files and thought if they did not seed the files, then nothing was wrong with it.

Here are some of the key findings of the emails and documents.

  • This document contains admissions that Meta knew that LibGen was pirated (i.e., illegal) and expresses concern over what will happen if
    regulators learn that Meta is training Llama on pirated copyrighted data.
  • This document suggests Meta in-house counsel advised Meta to stop its efforts to license copyrighted works and instead utilize pirated works exclusively.
  • On a message chain, Erin Murray explains that OpenAI’s model is likely trained on Smashwords and LibGen.
  • This document shows Meta employees deciding not to use “FB [Facebook] infra[structure]” for its “data downloading” from pirated databases in order to “avoid the risk of racing back the seeder/downloader from FB servers.”

Wrap Up

According to Google, the average Kindle e-book is 2.6mb. in size. Meta trained their AI on 35.7 terabytes of data from these 3 shadow libraries, so that comes to over 31,423,076 books that were downloaded from torrents.  I find it reprehensible that Meta did not even find it viable to pay for an expanded lisense to offically own the books and train their AI on those. Instead, they did the easy route and did illegal things.

I find it highly likely that they would not get anything more than a slap on the wrist and maybe a small fine. It all depends on the the class action goes and if they have to pay a whole lot more. What I find interesting, is the same companies that used pirated content to train their AI, are also the ones trying to shut the shadow libraries down, so nobody else can use the same data.


Michael Kozlowski is the editor-in-chief at Good e-Reader and has written about audiobooks and e-readers for the past fifteen years. Newspapers and websites such as the CBC, CNET, Engadget, Huffington Post and the New York Times have picked up his articles. He Lives in Vancouver, British Columbia, Canada.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles