A collaboration between MIT, Microsoft, and Adobe has resulted in an algorithm that reconstructs an audio signal using the subtle vibrations of various objects depicted in video. Notably, in one experiment, the team could reproduce recognizable speech from the vibrations of a potato chip bag, an impressive feat accomplished even from 15 feet away through soundproof glass.
Further experimentations involved extracting viable audio signals from the videos of different objects, such as aluminum foil, the surface of a glass of water, and plant leaves. The team’s findings are due to be shared at Siggraph, a leading computer graphics conference.
Abe Davis, a postgraduate student of electrical engineering and computer science at MIT and the primary author of the paper, explains that vibrations caused by sound create a subtly perceptible signal often missed by the naked eye. Working alongside Davis is an ensemble of respected figures from both academia and industry, including MIT professors Frédo Durand and Bill Freeman, student Neal Wadhwa, and industry leaders Michael Rubinstein of Microsoft Research and Gautham Mysore of Adobe Research.
Reconstructing audio requires a high video frame rate, often surpassing the peak 60 fps achieved by some smartphones but not quite reaching the highest rates of commercial cameras. Nevertheless, the team was still able to deduce information about high-frequency vibrations from standard 60 fps video. While not as accurate as that gathered from high-speed cameras, the audio was good enough to identify speaker’s gender within a room or even the number of speakers.
Davis is particularly excited about the potential of this “new kind of imaging” that recovers sounds from objects, allowing for insights not only about surrounding sounds but also the object itself. His team is actively exploring the potential of identifying material and structural properties of objects based on their reactions to sound bursts.
The team’s innovative algorithm distills filter outputs to determine an object’s overall movement when impacted by sound waves – even when object edges move in discrete directions. The researchers also designed an alternative algorithm to work with the peculiarities of conventional video, repurposing the distortions associated with cost-effective sensor design to gather information about high-frequency vibrations. This data, too, can be turned into a valuable audio signal.
Alexei Efros, an associate professor at the University of California at Berkeley, praises the innovation, likening its impact to a Hollywood thriller. He also hints at possible future applications of the technology that might have not been imagined yet, suggesting that this type of groundbreaking innovation can often trigger a domino effect of scientific discovery.
This article was updated in 2025 to reflect current trends and insights.
Discover more from TechBooky
Subscribe to get the latest posts sent to your email.