The Origins of Video-Chat Voice | The New Yorker

tagsRed Fabric Auditorium Seats Oct. 5 , 26

To re-read this article, please select my account and select

To re-read this article, please visit my profile and then

One night, when one of our friend FaceTimed came in, three of my roommates and I were scattered in our common area. I wanted to hear his behavior. My roommate put him on the speakerphone, but when his voice shook, the turbidity angered me. The voice over the phone did not have the normal shape or elasticity of his voice. I feel deprived. We accept the weird failures brought about by technology; I would never expect real-time video to overcome the inherent cumbersomeness of converting three dimensions into two. However, since our voices are invisible-just the air-there are things in me that think they should spread better.

A few days later, I heard a roommate talking to his colleague in a Google Hangouts call, and I wanted to know the sound of the video chat again-whenever we filter through video conferencing technology, we all hear our own nasal voice. This is the voice of our Internet interlocutors all over the world or today in towns. It became

. What causes its special traits?

To find out, I arranged a FaceTime call with Chris Kyriakakis, a professor of electrical and computer engineering at the University of Southern California and chief audio scientist at speaker company Syng. Kyriakakis is an expert in recreating and perceiving sound; he worked in a multi-university research team to digitally replicate the sounds of churches in the Byzantine period. When discussing the sublime topic of video chat voice, he explained that in order to maximize the clarity of the sound sent by the computer, I should pay attention to two factors. One is the distance between my mouth and the microphone. He said that if we meet in person and I am six feet away from him, his brain will focus on my voice and filter out any background noise. But the microphone cannot hear the human ear, and it cannot be heard. They just choose the loudest voice, and when the speaker is far away, other voices will compete with that person's voice. The key is to shorten the distance. Kyriakakis did this with headphones and microphone. As a result, the sound he transmitted was a little warmer than mine, because I was using the microphone of my laptop and didn't want to crowd the camera, so I could only sit in a chair.

The second factor is the reverberation of the room, which depends on the physical volume of the room and the absorbency of its content. Your voice will never be just your voice; even in a face-to-face conversation, a person’s words have the characteristics of a person’s environment. Kyriakakis explained: "When you talk to me, the sounds coming into my ears come from thousands of other directions, because they bounce around the room." "Our brains will constantly analyze to understand, okay , We are between my son's bedroom and the squash court." He said that if there are a lot of reflective surfaces in the room, it will make me sound like I am taking a shower. Carpets, curtains, blankets, sweatshirts (anything plush) help reduce my voice shaking around and improve the fidelity of the transmission. Even a human water container can absorb reverberation. Kyriakakis said this is why certain orchestras offer cheap rehearsal tickets: by filling the entire room, they can give musicians a more accurate understanding of the opening night auditorium.

Although the upholstered seats in many concert halls are sufficiently absorbent, they are usually designed with the assumption that the audience will participate, so that the sound difference between the whole house and the performance with few performances can be ignored Excluding.

However, living in a padded room at home or enclosing your roommates in a circle to improve the voice of Skype dating may not be ideal. Instead, Kyriakakis transformed his living room into an environment by using perforated multi-layer siding (similar to the skyscraper group seen from above). In this environment, Audio can play the best effect. They absorb and diffuse sound accurately. He said: "I have a wife of a man of insight. I can put things on art walls, but it has the characteristics of killing all reflections and reverberation." I would love to see this arrangement for myself; unfortunately On the day of our FaceTimed, Kyriakakis' dog and dishwasher asked him to find a quiet place in his son's room. But it turns out that this may make our dialogue sound more realistic. The size and reverberation of his son's room may be closer to the bedroom I called. According to Kyriakakis, creating a sense of vocal intimacy online is not only clear, but similar. To maximize the feeling of being in the same room, the caller should speak in a reverb-like space.

To switch on the microphone, your voice only needs to travel a few feet (or preferably inches). A long-term and transformative journey across the Internet is still coming-a journey through constantly changing terrain. These changes may be smooth or rough due to network bandwidth. Stephen Casner, one of the early pioneers of audio and video transmission on Internet-like networks, told me that to travel, voice must be compressed and cut into small packets through a so-called codec. Each packet contains about 20 milliseconds of compressed audio-oh, ah, "s" sound. It seems that instead of sending someone a written letter, you sent them a series of sequential postcards with a single syllable. These packets will then be compressed on your conversation partner’s computer, where another codec will reproduce the sound before leaving the speaker.

Sometimes, data packets are lost. If you are playing a movie, the software can prepare for this possibility by "buffering", creating a buffer time of a few seconds in order to retransmit the missing components. The movie may not continue to play until every necessary droplet in the stream arrives. However, in order to avoid inserting awkward delays in our conversation, the real-time audio software must click with a minimum pause time to check for scattered data packets. If they do not arrive, so be it.

So, the question is how to fill the gap. Video conferencing technology uses voice codecs, and Mozilla engineer Timothy Terriberry told me that voice codecs are specifically designed to replicate human channels. (These human-centered algorithms are why playing instrumental music through video chat sounds so scary, Terriberry said.) We talked through a pure voice chat on Zoom, which sometimes uses the audio codec protocol that Terriberry helped create . If the speech codec encounters content that it thinks lacks vowels, it may read the content before and after to get clues. Then, it expands or inserts a flat tone, which is the best guess as to how the human voice will fill the space. This may lead to an effect similar to Auto-Tune, making us a temporary pain. Frictions (consonants exhaled in "fridge" and "dull") are especially challenging for voice codecs. They are shorter and less repetitive, so they are more likely to be lost. It is also difficult for computers to imitate them, which is one reason why a person's video chat voice often sounds like a short hiss or chi.

After talking with Terriberry, I started to listen to these quirks. When my two brothers and I caught up and their voices occasionally disappeared or sounded like touch robots, I admired the system's attempts. The hearing device, which was not hearing as my ears, picked up the sound, and then got lost. In order to fill the gap, the fusion of various technologies reminds people of my brother's imitation. In a sense, knowing all the hidden work makes imperfections more meaningful than frustration. The small mistakes of the software are almost like that kind of mistakes. If I make these mistakes, my brothers will laugh playfully. When they walked into my room, their video chat voice sounded more humane.

Will follow our use

In the age away from society, our background noise, bathrobes and other vacuum cleaners may be unexpected sources of contact.

If Slack improves upon fundamentally flawed collaboration methods, worth tens of billions of dollars, then imagine the value of solving potential problems.

David Owen reported on the invisible phenomenon of noise pollution, which has caused heavy costs to human health and wildlife.

2021

(As of update

1/1/21

)with

As part of a partnership with a retailer, some sales revenue may be obtained from products purchased through our website. The materials on this website may not be copied, distributed, transmitted, cached or otherwise used without the prior written permission of CondéNast.

Previous Post: All33 Chair Shark Tank Update: Where is All33 Backstrong Chair Now?

next Post: Department of Music and Theatre chair responds to Stephens Auditorium closure proposal | News | iowastatedaily.com