The Singapore Authors Whose Pirated Books Were Used to Train AI
The list includes Singapore figures like Lee Kuan Yew, Catherine Lim, Balli Kaur Jaswal, and many others.
Note to current subscribers: I’m currently transitioning my content onto Substack, which will be the platform I’ll be using going forward!
The full list of 191,000 pirated ebooks used to train AI models like ChatGPT has been made public, and it includes Singapore figures like Lee Kuan Yew, Catherine Lim, Balli Kaur Jaswal, and many others.
These books were part of an enormous and popular dataset — named Books3 — which was used to train generative AI. Meta used Books3 for its Llama2 AI, which will power the company’s AI assistants.
With AI’s surge in popularity, the data practices of AI companies have come under increasing scrutiny and criticism. A string of lawsuits have been filed by authors against ChatGPT creator OpenAI, alleging copyright infringement.
And over 10,000 authors — including James Pattinson and Margaret Atwood — have signed an open letter to AI companies saying that AI companies are using authors’ work without consent, credit, or compensation.
Here is an (incomplete) list of Singapore authors whose works were used to train AI as part of the Books3 dataset, in no particular order:
Lee Kuan Yew
Memoirs of Lee Kuan Yew Slipcase Edition
Catherine Lim
Miss Seetoh in the World
Roll Out the Champagne, Singapore!
Balli Kaur Jaswal
Erotic Stories for Punjabi Widows: A Novel
The Unlikely Adventures of the Shergill Sisters: A Novel
Neon Yang
The Ascent to Godhood (The Tensorate Series Book 4)
The Black Tides of Heaven (Kindle Single) (The Tensorate Series Book 1)
The Descent of Monsters (The Tensorate Series, 3)
The Red Threads of Fortune (Kindle Single) (The Tensorate Series Book 2)
Amanda Lee Koe
Delayed Rays of a Star: A Novel
Ovidia Yu
Aunty Lee's Delights: A Singaporean Mystery (The Aunty Lee Series)
Meira Chand
The Gossamer Fly
Kevin Kwan
China Rich Girlfriend: A Novel
Crazy Rich Asians (Crazy Rich Asians Trilogy Book 1)
Locos, ricos y asiáticos (Spanish Edition)
Rich People Problems (Crazy Rich Asians Trilogy)
The Crazy Rich Asians Trilogy Box Set: Crazy Rich Asians; China Rich Girlfriend; Rich People Problems
Rachel Heng
Suicide Club: A Novel About Living
Clarissa Goenawan
Rainbirds
Jing-Jing Lee
How We Disappeared: A Novel
Cherian George
Hate Spin: The Manufacture of Religious Offense and Its Threat to Democracy (Information Policy)
Lesley-Anne Tan and Monica Lim
Danger Dan And Gadget Girl
Jeremy Tiang
It Never Rains On National Day: Stories
Cheryl Lu Lien Tan
A Tiger in the Kitchen: A Memoir of Food and Family
Sarong Party Girls: A Novel
Suchen Christine Lim
A Bit of Earth: An Exciting Saga from the First Singapore Literature Prize Winner
Rice Bowl
Shamini Flint
Game Changer! (2) (The Susie K Files)
Note: This is not a comprehensive list. You can use this tool to search the list of authors whose works are in the Books3 dataset here.
For context, Books3 follows a tradition of AI companies collecting (often without permission) large amounts of data necessary to train their AI. Its name was inspired by ChatGPT creator OpenAI’s Books1 and Books2 datasets — the contents of the latter remain a mystery.
And Books3 hasn’t only been used by Meta. It is a popular training dataset for AI models which has been downloaded over 500 times in the last month alone. Plus, as OpenAI no longer discloses what data is used to train its AI, it is unclear if it also uses Books3.
I, for one, am curious about what these authors think about their works being used to train AI.
In August, author Margaret Atwood wrote in an op-ed for the Atlantic in response to the revelation that her books were part of the dataset, writing, “They intend to make a lot of money off the entities they have reared and fattened on my words, so they could at least buy me a coffee.”