- Jan 8, 2011
HaveIBeenTrained uses clip retrieval to search the Laion-5B and Laion-400M image datasets. These are currently the largest public text-to-image datsets, and they are used to train models like Stable Diffusion, Imagen, among many others.
Text-to-image datasets are typically shared as files that resemble enormous spreadsheets. Their main columns are:
When it's time to train a generative AI system, organizations like Stability use those datasets to download the images from their links and present them to the model with their captions.
- a link to an image on the internet like:
- a caption that describes that image like:
"Platform mp3 Album by Holly Herndon"
With HaveIBeenTrained, artists can search these databases for links to their work and flag them for removal.
We're incorporating new datasets as they are released and we're also partnering with other organizations who collect and use image links, so HaveIBeenTrained can serve as a once only opt-out tool that applies to every dataset used to train generative AI Art tools.
Our solution builds up upon retrieval tools [1,2,3] created by LAION community that enable efficient search through very large collections of image-text pairs based on kNN indicies that are pre-computed by using CLIP models pre-trained by openAI and LAION.
Related thread: Discussion Thread - AI training site stole his photos, then sued when he complained: Robert Kneschke's story