Vista de Lectura

Hay nuevos artículos disponibles. Pincha para refrescar la página.

Researchers Create A Brain Implant For Near-Real-Time Speech Synthesis

Brain-to-speech interfaces have been promising to help paralyzed individuals communicate for years. Unfortunately, many systems have had significant latency that has left them lacking somewhat in the practicality stakes.

A team of researchers across UC Berkeley and UC San Francisco has been working on the problem and made significant strides forward in capability. A new system developed by the team offers near-real-time speech—capturing brain signals and synthesizing intelligible audio faster than ever before.

New Capability

The aim of the work was to create more naturalistic speech using a brain implant and voice synthesizer. While this technology has been pursued previously, it faced serious issues around latency, with delays of around eight seconds to decode signals and produce an audible sentence. New techniques had to be developed to try and speed up the process to slash the delay between a user trying to “speak” and the hardware outputting the synthesized voice.

The implant developed by researchers is used to sample data from the speech sensorimotor cortex of the brain—the area that controls the mechanical hardware that makes speech: the face, vocal chords, and all the other associated body parts that help us vocalize. The implant captures signals via an electrode array surgically implanted into the brain itself. The data captured by the implant is then passed to an AI model which figures out how to turn that signal into the right audio output to create speech. “We are essentially intercepting signals where the thought is translated into articulation and in the middle of that motor control,” said Cheol Jun Cho, a Ph.D student at UC Berkeley. “So what we’re decoding is after a thought has happened, after we’ve decided what to say, after we’ve decided what words to use, and how to move our vocal-tract muscles.”

The AI model had to be trained to perform this role. This was achieved by having a subject, Ann, look at prompts and attempting to “speak ” the phrases. Ann has suffered from paralysis after a stroke which left her unable to speak. However, when she attempts to speak, relevant regions in her brain still lit up with activity, and sampling this enabled the AI to correlate certain brain activity to intended speech. Unfortunately, since Ann could no longer vocalize herself, there was no target audio for the AI to correlate the brain data with. Instead, researchers used a text-to-speech system to generate simulated target audio for the AI to match with the brain data during training. “We also used Ann’s pre-injury voice, so when we decode the output, it sounds more like her,” explains Cho. A recording of Ann speaking at her wedding provided source material to help personalize the speech synthesis to sound more like her original speaking voice.

To measure performance of the new system, the team compared the time it took the system to generate speech to the first indications of speech intent in Ann’s brain signals. “We can see relative to that intent signal, within one second, we are getting the first sound out,” said Gopala Anumanchipalli, one of the researchers involved in the study. “And the device can continuously decode speech, so Ann can keep speaking without interruption.” Crucially, too, this speedier method didn’t compromise accuracy—in this regard, it decoded just as well as previous slower systems.

Pictured is Ann using the system to speak in near-real-time. The system also features a video avatar. Credit: UC Berkeley

The decoding system works in a continuous fashion—rather than waiting for a whole sentence, it processes in small 80-millisecond chunks and synthesizes on the fly. The algorithms used to decode the signals were not dissimilar from those used by smart assistants like Siri and Alexa, Anumanchipalli explains. “Using a similar type of algorithm, we found that we could decode neural data and, for the first time, enable near-synchronous voice streaming,” he says. “The result is more naturalistic, fluent speech synthesis.”

It was also key to determine whether the AI model

was genuinely communicating what Ann was trying to say. To investigate this, Ann was qsked to try and vocalize words outside the original training data set—things like the NATO phonetic alphabet, for example. “We wanted to see if we could generalize to the unseen words and really decode Ann’s patterns of speaking,” said Anumanchipalli. “We found that our model does this well, which shows that it is indeed learning the building blocks of sound or voice.”

For now, this is still groundbreaking research—it’s at the cutting edge of machine learning and brain-computer interfaces. Indeed, it’s the former that seems to be making a huge difference to the latter, with neural networks seemingly the perfect solution for decoding the minute details of what’s happening with our brainwaves. Still, it shows us just what could be possible down the line as the distance between us and our computers continues to get ever smaller.

Featured image: A researcher connects the brain implant to the supporting hardware of the voice synthesis system. Credit: UC Berkeley

Hackaday Links: April 27, 2025

Hackaday Links Column Banner

Looks like the Simpsons had it right again, now that an Australian radio station has been caught using an AI-generated DJ for their midday slot. Station CADA, a Sydney-based broadcaster that’s part of the Australian Radio Network, revealed that “Workdays with Thy” isn’t actually hosted by a person; rather, “Thy” is a generative AI text-to-speech system that has been on the air since November. An actual employee of the ARN finance department was used for Thy’s voice model and her headshot, which adds a bit to the creepy factor.

The discovery that they’ve been listening to a bot for months apparently has Thy’s fans in an uproar, although we suspect that the media doing the reporting is probably more exercised about this than the general public. Radio stations have used robo-jocks for the midday slot for ages, albeit using actual human DJs to record patter to play between tunes and commercials. Anyone paying attention over the last few years probably shouldn’t be surprised by this development, and we suspect similar disclosures will be forthcoming across the industry now that the cat’s out of the bag.

Also from the world of robotics, albeit the hardware kind, is this excellent essay from Brian Potter over at Construction Physics about the sad state of manual dexterity in humanoid robots. The whole article is worth reading, not least for the link to a rogue’s gallery of the current crop of humanoid robots, but briefly, the essay contends that while humanoid robots do a pretty good job of navigating in the world, their ability to do even the simplest tasks is somewhat wanting.

Brian’s example of unwrapping and applying a Band-Aid, a task that any toddler can handle, as being unimaginably difficult for any current robot to handle is quite apt. He attributes the gap in abilities between gross movements and fine motor control partly to hardware and partly to software. We think the blame skews more to the hardware side; while the legs and torso of the typical humanoid robot offer a lot of real estate for powerful actuators, squeezing that much equipment into a hand approximately the size of a human’s is a tall order. These problems will likely be overcome, of course, and when they do, Brian’s helpful list of “Dexterity Evals” or something similar will act as a sort of Turing test for robot dexterity. Although the day a humanoid robot can start a new roll of toilet paper without tearing the first sheet is the day we head for the woods.

We recently did a story on the use of nitrogen-vacancy diamonds as magnetic sensors, which we found really exciting because it’s about the simplest way we’ve seen to play with quantum physics at home. After that story ran, eagle-eyed reader Kealan noticed that Brian over at the “Real Engineering” channel on YouTube had recently run a video on anti-submarine warfare, which includes the uses of similar quantum magnetometers to detect submarines. The magnetometers in the video are based on the Zeeman effect and use laser-pumped helium atoms to detect tiny variations in the Earth’s magnetic field due to large ferrous objects like submarines. Pretty cool video; check it out.

And finally, if you have the slightest interest in civil engineering you’ve got to check out Animagraff’s recent 3D tour of the insides of Hoover Dam. If you thought a dam was just a big, boring block of concrete dumped in the middle of a river, think again. The video is incredibly detailed and starts with accurate 3D models of Black Canyon before the dam was built. Every single detail of the dam is shown, with the “X-ray views” of the dam with the surrounding rock taken away being our favorite bit — reminds us a bit of the book Underground by David Macaulay. But at the end of the day, it’s the enormity of Hoover Dam that really comes across in this video. The way that the structure dwarfs the human-for-scale included in almost every sequence is hard to express — megalophobics, beware. We were also floored by just how much machinery is buried in all that concrete. Sure, we knew about the generators, but the gates on the intake towers and the way the spillways work were news to us. Highly recommended.

“Glasses” That Transcribe Text To Audio

Glasses for the blind might sound like an odd idea, given the traditional purpose of glasses and the issue of vision impairment. However, eighth-grade student [Akhil Nagori] built these glasses with an alternate purpose in mind. They’re not really for seeing. Instead, they’re outfitted with hardware to capture text and read it aloud.

Yes, we’re talking about real-time text-to-audio transcription, built into a head-worn format. The hardware is pretty straightforward: a Raspberry Pi Zero 2W runs off a battery and is outfitted with the usual first-party camera. The camera is mounted on a set of eyeglass frames so that it points at whatever the wearer might be “looking” at. At the push of a button, the camera captures an image, and then passes it to an API which does the optical character recognition. The text can then be passed to a speech synthesizer so it can be read aloud to the wearer.

It’s funny to think about how advanced this project really is. Jump back to the dawn of the microcomputer era, and such a device would have been a total flight of fancy—something a researcher might make a PhD and career out of. Indeed, OCR and speech synthesis alone were challenge enough. Today, you can stand on the shoulders of giants and include such mighty capability in a homebrewed device that cost less than $50 to assemble. It’s a neat project, too, and one that we’re sure taught [Akhil] many valuable skills along the way.

Pipio

Pipio is an AI video production tool that aims to simplify the creation of professional-quality videos through a stylish and easy-to-use interface. The tool offers a diverse selection of over 100 realistic virtual spokespeople that can be customized to suit your needs, speaking in more than 40 languages with different accents. Pipio’s key features include […]

Source

AI Celebrity Voice Generator

Arting.AI’s Celebrity Voice Changer is a free online voice generation tool that allows you to create high-quality voice clips mimicking various celebrities, characters, and public figures. Simply select from a wide range of popular voice models, input your desired text (or upload audio or a video link), and the AI will generate the new voice […]

Source

Wondershare Virbo

Wondershare Virbo’s advanced AI technology enables you to create the most realistic and personalized AI avatar video content with diverse nationalities and languages. You can crate professional AI spokesperson videos just by typing and clicking with Wondershare Virbo. Virbo’s 150+ realistic AI avatars can be your engaging spokesperson, talking in 120+ languages with diverse accents […]

Source

Play.ht

Play.ht is an AI text-to-speech tool with some advanced features for those looking for high-quality voice files in both MP3 and WAV format. They’ve recently just added a voice cloning feature which lets you record your own voice, input it into the AI and generate text-to-speech with your own voice! Play.ht runs on their own […]

Source

MyVocal.ai

MyVocal.ai is an AI tool that offers voice cloning, text-to-speech, and custom music creation. You can upload your own voice data for training, without limitations on dialogue content. Then you can use your cloned voice to sing songs with the custom music feature! MyVocal.ai uses emotion recognition technology can automatically detect the emotional content of […]

Source

Speechify AI Studio

Speechify’s AI Studio is a powerful suite of generative AI tools designed to create quality AI voice overs and videos. One of its most impressive features is the text-to-speech generator that can convert any text into natural-sounding speech across over 50 languages and accents. The voices sound incredibly human-like and can also be customized for […]

Source

Bark

Bark is an open-source text-to-audio generator that can create realistic sounding speech, music, and sound effects from text prompts. It supports multiple languages and can match different voices and accents. Bark is built using transformer models and generates audio directly from text, it can even generate different pronunciations and accents. There is also a Hugging […]

Source

Unreal Speech

Unreal Speech is a text-to-speech API that offers natural-sounding voices, low latency, high uptime, and pricing that scales as your usage grows. You can start with 1 million free characters per month, then take advantage of volume discounts at higher tiers. Unreal Speech has an easy to integrate API with client libraries for many languages. […]

Source

LOVO

LOVO is an award-winning AI voice generator and text-to-speech tool that features more than 500+ voices in over 100 languages. LOVO enables creators to produce captivating marketing and training videos quickly and effortlessly. When using the text-to-speech features, you can set the tone and emotion of the speaker to fully match your intended audience.

Source

FreeTTS

FreeTTS is a powerful and versatile Text-to-Speech tool that offers developers an easy and efficient way to incorporate speech synthesis into their applications. Its open-source nature also ensures that it constantly evolves and improves, making it an excellent choice for those looking to add high-quality speech via text. The tool currently offers TTS voice from […]

Source

MusicStar AI

MusicStar.AI is designed for anyone, regardless of musical talent, who wants to make professional-sounding music. MusicStar.AI provides the tools you need, whether you’re a music professional working on your next hit or a music fan wishing to create music like your favorite artist. The tool features a lyrics editor, which lets you easily write and […]

Source

Deepbrain AI

Deepbrain AI offers a solution to create realistic AI-generated videos using only text. The Text-to-Speech feature allows for quick and easy video creation with support for 80+ languages, the tool also converts PPT to MP4. Some of Deepbrain’s features, such as AI Humans and AI Kiosk, are virtual employees that use natural language processing to […]

Source

Audiosonic

Audiosonic is an AI-powered text-to-speech tool that converts text into realistic, human-like audio. It uses advanced voice synthesis technology to generate high-quality voices with natural inflections and intonations. The platform offers quite a large library of human-like voices with different accents, and you’re also able to customize things like the voice speed and tone.

Source

Coqui

Coqui is an AI text-to-speech tool that allows you to quickly create professional-quality voiceovers using a pre-made voices. You can also clone a voice to perfectly match your own tone and style. Coqui gives you control over the enunciation, emotion, pitch, and other aspects of your voice, making it even easier to bring your scripts […]

Source

❌