Vista Normal

Hay nuevos artículos disponibles. Pincha para refrescar la página.
AnteayerSalida Principal

Human Brains Can Tell Deepfake Voices from Real Ones

Por: Maya Posch
19 Junio 2024 at 05:00

Although it’s generally accepted that synthesized voices which mimic real people’s voices (so-called ‘deepfakes’) can be pretty convincing, what does our brain really think of these mimicry attempts? To answer this question, researchers at the University of Zurich put a number of volunteers into fMRI scanners, allowing them to observe how their brains would react to real and a synthesized voices.  The perhaps somewhat surprising finding is that the human brain shows differences in two brain regions depending on whether it’s hearing a real or fake voice, meaning that on some level we are aware of the fact that we are listening to a deepfake.

The detailed findings by [Claudia Roswandowitz] and colleagues are published in Communications Biology. For the study, 25 volunteers were asked to accept or reject the voice samples they heard as being natural or synthesized, as well as perform identity matching with the supposed speaker. The natural voices came from four male (German) speakers, whose voices were also used to train the synthesis model with. Not only did identity matching performance crater with the synthesized voices, the resulting fMRI scans showed very different brain activity depending on whether it was the natural or synthesized voice.

One of these regions was the auditory cortex, which clearly indicates that there were acoustic differences between the natural and fake voice, the other was the nucleus accumbens (NAcc). This part of the basal forebrain is involved in the cognitive processing of e.g. motivation, reward and reinforcement learning, which plays a key role in social, maternal and addictive behavior. Overall, the deepfake voices are characterized by acoustic imperfections, and do not elicit the same sense of recognition (and thus reward sensation) as natural voices do.

Until deepfake voices can be made much better, it would appear that we are still safe, for now.

EMO: Alibaba’s Diffusion Model-Based Talking Portrait Generator

Por: Maya Posch
10 Junio 2024 at 23:00

Alibaba’s EMO (or Emote Portrait Alive) framework is a recent entry in a series of attempts to generate a talking head using existing audio (spoken word or vocal audio) and a reference portrait image as inputs. At its core it uses a diffusion model that is trained on 250 hours of video footage and over 150 million images. But unlike previous attempts, it adds what the researchers call a speed controller and a face region controller. These serve to stabilize the generated frames, along with an additional module to stop the diffusion model from outputting frames that feature a result too distinct from the reference image used as input.

In the related paper by [Linrui Tian] and colleagues a number of comparisons are shown between EMO and other frameworks, claiming significant improvements over these. A number of examples of talking and singing heads generated using this framework are provided by the researchers, which gives some idea of what are probably the ‘best case’ outputs. With some examples, like [Leslie Cheung Kwok Wing] singing ‘Unconditional‘ big glitches are obvious and there’s a definite mismatch between the vocal track and facial motions. Despite this, it’s quite impressive, especially with fairly realistic movement of the head including blinking of the eyes.

Meanwhile some seem extremely impressed, such as in a recent video by [Matthew Berman] on EMO where he states that Alibaba releasing this framework to the public might be ‘too dangerous’. The level-headed folks over at PetaPixel however also note the obvious visual imperfections that are a dead give-away for this kind of generative technology. Much like other diffusion model-based generators, it would seem that EMO is still very much stuck in the uncanny valley, with no clear path to becoming a real human yet.

Thanks to [Daniel Starr] for the tip.

❌
❌