Free Open Source Dataset of 400M Image-Text Pairs for ML: Laion-400M

Laion-400M is a huge and free dataset of 400M image text pairs for machine learning. Here you can use this dataset to train your AI models. The image-text-pairs in this open dataset have been extracted by crawling them from Common Crawl webpages between the years 2014 and 2021.

Images and text in this dataset have been filtered using OpenAI‘s CLIP by calculating the cosine similarity between the embeddings, and those with a similarity below 0.3 have been dropped. The 0.3 threshold here was determined via human evaluations, which is actually pretty good for estimating semantic for image-text data.

The downloadables for this dataset are NumPy files, which are nothing but CLIP embeddings. Along with that, there are parquet files and KNN index of the image embeddings.

A word of caution that needs to be mentioned here is that even though this data has been filtered but it still has NSFW images. Only the illegal and illicit image data was removed from it. The NSFW content is marked in the metadata, so in case you don’t want those images for your work, then don’t forget to exclude them.

Example LAION-400M dataset

Laion-400M includes a 10 TB web dataset with images and captions, with additional 1 TB of CLIP embeddings. And along with these two there are 2 small 6 GB KNN indices too. There are some other technical details about it as well and you can read more about them on the main website.

If you’ve worked with AI datasets before, then you can use it in the same way. First, you need a good internet connection to download files, as they are pretty huge. You can either download files directly, using wget, or you can download them over torrent. You can find the links on the main website here.

wget -m -np -c -U "eye02" -w 2 -R "index.html*" ""

laion400m dataset Download

In the screenshot I added in the beginning, you can see some example images. And not only that, but there is a simple web app as well that you can use to visualize this dataset. This is basically a search tool and you can search by entering some text or an image. Depending on what input you use, it will show you the results accordingly. Below, you can see the text as well as image search in action.

LAION-400M dataset visualization

In this way, you can download and even experience using this huge image-text dataset. It is really a gem of a tool for ML and AI programmers. They can leverage it’s power to come with cool program and software that deal with computer vision.

Closing thoughts:

If you are into data science, then you can download this Laion-400M dataset right now and start playing with it. The data is huge, but as long as you know what are you doing, you’ll be good. This dataset can be used to build intelligent text to image search engines. Or, even image generation simply from natural text or whatever, as possibilities are endless.

Free/Paid: Free

