Facebook is improving its Automatic Alternative Text (AAT) technology to better utilize object recognition to generate descriptions of photos on demand. It will enable the blind or visually impaired individuals to understand what’s on their News Feed in a better way. For context, AAT was introduced back in 2016, and it is now improved by 10x as the new Facebook AAT recognizes over 1,200 concepts.
Each photo you post on Facebook and Instagram gets evaluated by an image analysis AI (that is, AAT technology) in order to create a caption. It adds information to alt text, which is a field in an image’s metadata that describes its contents: “A dog standing in a field” or a “person playing football.” This allows visually impaired people to understand the images on their news feed. However, people don’t bother adding these descriptions to their images. Hence, Facebook is working on making its social media more accessible by training its AI.
The latest iteration of AAT has the ability to detect and identify in a photo by more than 10x, which in turn means fewer photos without a description. It can now identify activities, landmarks, types of animals, and so forth. For example, a photo might read, “May be a selfie of 2 people, outdoors, the Leaning Tower of Pisa.”
Facebook says it is the first in the industry to include information about the positional location and relative size of elements in a photo. For instance, instead of saying “Maybe a photo of 5 people,” the AI can analyze and specify that there are two people in the center of the photo and three others scattered toward the fringes, implying that the two in the center are the focus. Facebook also added that it trained the models to predict locations and semantic labels of the objects within an image.
The company leveraged a model trained on weakly-supervised data in the form of billions of public Instagram images and their hashtags for its latest iteration of AAT. It fine-tuned the data across all geographies and evaluated concepts along gender, skin tone, and age axes. As a result, the AAT is now more accurate and culturally, and demographically inclusive. For example, it can now understand and identify weddings around the world based (in part) on traditional apparel.
Facebook asked users who depend on screen readers how much information they wanted to hear and when they wanted to hear it. And, it came to a conclusion that people want more information when an image is from friends or family, and less when it’s not.
AAT uses simple phrasing for its default description rather than a long, flowy sentence. It begins every description with “May be,” because there is a margin for error but “we’ve set the bar very high,” says the company. The AAT alt text descriptions are available in 45 different languages and can be used by people around the world.