Google develops automated dataset describing Wikipedia images

By Ameen Akbar On Sep 22, 2021

Spread the love

Google Research has created an automated dataset that can describe images in various languages in relative detail. For this, Google uses images and articles that are on Wikipedia and machine learning. The dataset would be especially useful for research.

People who research how images and text relate to each other in different languages use datasets that consist of images and descriptions of images, according to Google Research. These datasets can be captioned manually, which produces high-quality descriptions, but takes a long time.

The descriptions can also be automated, but current techniques require heuristics and a lot of filtering to guarantee data quality. In addition, these datasets are hardly available in non-English languages. Google Research therefore wondered whether it is possible to devise an automated process that creates datasets in different languages where the descriptions are of high quality, there are many descriptions and also involve different types of images.

The result is WIT, short for Wikipedia-based Image Text Dataset. This system uses machine learning, Wikipedia pages and Wikimedia images to design captions. The model looks at a page’s description, title, image caption, and metadata to come up with a description of an image.

In addition, Google Research applies certain filters to improve the quality of the descriptions. For example, the model removes ‘generic standard filler text’ to prevent the text from becoming unnecessarily long. In addition, the filters look at the license agreements of the images used and exclude hateful images to ensure they are suitable for research.

Ultimately, the system created 37.5 million captions for 11.5 million unique images across 108 languages, with each image varying in how many languages there are descriptions. Over a million images have descriptions in at least six languages. Human editors reportedly said that the text matched the image well in 98 percent of the samples.

Google Research hopes that the dataset will enable better research into the development of multimodal, multilingual models and better learning and rendering techniques can be found.

The first image is an example of descriptions written by WIT; the three remaining images show the process WHITE uses.