BERYL – new breakthrough Acoustic Echo Cancellation by Meta

I attended Meta’s RTC@Scale 2024 Conference where Meta talked about two new major changes that it accomplished while revamping the audio processing core stack. BERYL – new breakthrough Acoustic Echo Cancellation by Meta and MLOW – a new low bitrate audio codec fully written in software. this blog contains notes on Beryl. PDF of handwritten notes can be found here.

BERYL -full software AC (by Sriram Srinivasan & Hoang Do)

  • META did 20% reduction in “No Audio” or “Audio device reliability” issue on iOS & Android
  • 15% reduction in P50 mouth to ear latency on Android
  • Revamp of Audio processing stack core for WhatsApp, Instagram messenger
    • Very diverse user base
    • Different kinds of handsets
    • Different Geography
    • Noisy conditions
    • Both high end & Low end phones (more than 20% low end ARMV7)
  • Based on telemetry and user feedback Meta decided to tackle 1. ECHO and 2. Audio Quality under low bit rate network
  • High end devices use ML to suppress echo
  • To accommodate low end devices which cannot run ML, a baseline solution for echo cancellation is needed
  • Welcome BERYL
  • Bery/replaces WebRTC‘s AEC3, AECM on all devices
  • Interestingly users experiencing echo issues are also on low end devices which cannot run ML
  • Meta’s scale is too larger
    • High end phones have hardware AEC
    • Low end phones do not
    • Stereo I spatial audio only possible in s/w
    • H/w only does mono AEC
  • Beryl was needed because AM either leaves lot of residual echo or degrades quality of double-talk
  • AECM – Not scalable for millions of users & Quality not best
  • Beryl AEC = Low compute – DSP based s/w AEC
    • Lite mode for low end devices
    • Full made for high end
    • Both modes adaptive vs. ACT being simple echo suppressor
    • Near instant adaptation to changes
    • Better double talk performance
    • Multi-channel capture & render l6k1tz & 48 kHz
    • Tuned using 3000 music t speech (monot stereo on 20T devices
    • CPU usage increase of less than 7% compared to WebRTC AEC

Beryl Components

1. Delay Estimator

  • Clock drift when using external mic & speaker as they do not share common clock
  • Delay estimator, estimates delay between far- end reference signal (speaker) & near end capture signals (mic)
  • Beryl full made can handle non-causal delays (-ve delay)
  • Can handle delay up to 1 sec

2 Linear AEC

  • Estimate echo & subtract from capture signal
  • Beryl AEC is normalized least mean squared (NLMS) frequency domain dual filter algo
  • One fixed & one adaptive filter
  • Coefficients can be copied between filters
    • relative difference in the powers of error signal between two filters and input mic signal
    • Coupling factor between echo estimate E error signal *
  • Adaptation step size is configurable I depends on coherence between mic & reference signals, power and SIR
  • Great double talk performance compared to WebRTC AEC

3 Acoustic Echo Suppressor (AES)

  • Non linear distortions are introduced by amplifiers before speaker and after microphone
  • AES removes this non-linear echo (residual echo)
  • AES removes stationary echo noise, distortion, applies perceptual filtering & ambient noise matching

Implementation

  • Reduce memory, CPU & latency
  • Synchronization needed due to work on audio from input & output devices from different threads
    • mutex in functions (Good safety but worse real time performance)
    • Low level locks on shared data structures
    • Thread safe low level data structures (ok safety, great realtime Performance)
  • Neon on ARMY7 & ARMG4
  • AUX on Intel
  • CPU 4110% of WebRTC AEC

Demystifying WebRTC

WebRTC (Web Real-Time Communication) has revolutionized the way web applications handle communication. It empowers developers to embed real-time audio, video, and data exchange functionalities directly within web pages and apps, eliminating the need for plugins or additional downloads. This blog’s attempts in demystifying WebRTC is the first step in learning the basics of this technology.

Signaling: The Orchestrator of Connections

WebRTC itself doesn’t establish direct connections between browsers. Signaling, the first act in the WebRTC play, takes center stage. It involves exchanging information about the communication session between peers. This information typically includes:

  • Session Description Protocol (SDP): An SDP carries details about the media streams (audio/video) each peer intends to send or receive, along with the codecs they support.
  • ICE Candidates: These describe the network addresses and ports a peer can use for communication.
  • Offer/Answer Model: The initiating peer sends an SDP (offer) outlining its capabilities. The receiving peer responds with an SDP (answer) indicating its acceptance and potentially modifying the offer.

Several signaling mechanisms can be employed, including WebSockets, Server-Sent Events (SSE), or even custom solutions. The choice depends on the application’s specific needs and desired level of real-time interaction.

NAT Traversal: Hurdles and Leapfrogs

WebRTC connections often face the obstacle of Network Address Translation (NAT). NAT devices on home networks hide private IP addresses behind a single public address. Direct communication between peers behind NATs becomes a challenge. WebRTC employs a combination of techniques to overcome this hurdle:

  • STUN (Session Traversal Utilities for NAT): A peer sends a STUN request to a public server, which reveals the public IP and port the NAT maps the request to. This helps a peer learn its own public facing address.
  • TURN (Traversal Using Relays around NAT): When a direct connection isn’t feasible due to restrictive firewalls, TURN servers act as relays. Peers send their media streams to the TURN server, which then forwards them to the destination peer. While TURN provides a reliable fallback, it introduces latency and may not be suitable for bandwidth-intensive applications.
NAT traversal in WebRTC

NAT Traversal in webRTC

Image Credit : García, Boni & Gallego, Micael & Gortázar, Francisco & Bertolino, Antonia. (2019). Understanding and estimating quality of experience in WebRTC applications. Computing. 101. 10.1007/s00607-018-0669-7.

ICE: The Candidate for Connectivity

The Interactive Connectivity Establishment (ICE) framework plays a pivotal role in NAT traversal. Here’s how it works:

  1. Gathering Candidates: Each peer gathers potential connection points (local IP addresses and ports) it can use for communication. These include public addresses obtained via STUN and local network interfaces.
  2. Candidate Exchange: Peers exchange their gathered candidates with each other through the signaling channel.
  3. Connectivity Checks: Each peer attempts to establish a connection with the other using the received candidates. This might involve trying different combinations of local and remote candidates.
  4. Best Path Selection: Once a successful connection is established, the peers determine the optimal path based on factors like latency and bandwidth.

SDP: The Session Description

The Session Description Protocol (SDP) acts as a blueprint for the WebRTC session. It’s a text-based format that conveys essential information about the media streams involved:

  • Media types: Whether it’s audio, video, or data communication.
  • Codecs: The specific compression formats used for encoding and decoding media.
  • Transport protocols: The underlying protocols used for media transport (e.g., RTP for real-time data).
  • ICE candidates: The potential connection points offered by each peer.

The SDP is exchanged during the signaling phase, allowing peers to negotiate and agree upon a mutually supported configuration for the communication session.

v=0 
o=- 487255629242026503 2 IN IP4 127.0.0.1 
s=- 
t=0 0 

a=group:BUNDLE audio video 
a=msid-semantic: WMS 6x9ZxQZqpo19FRr3Q0xsWC2JJ1lVsk2JE0sG 
m=audio 9 RTP/SAVPF 111 103 104 9 0 8 106 105 13 126 
c=IN IP4 0.0.0.0

a=rtcp:9 IN IP4 0.0.0.0 
a=ice-ufrag:8a1/LJqQMzBmYtes 
a=ice-pwd:sbfskHYHACygyHW1wVi8GZM+ 
a=ice-options:google-ice 
a=fingerprint:sha-256 28:4C:19:10:97:56:FB:22:57:9E:5A:88:28:F3:04:
   DF:37:D0:7D:55:C3:D1:59:B0:B2:81 :FB:9D:DF:CB:15:A8 
a=setup:actpass 
a=mid:audio 
a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level 
a=extmap:3 http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time 

a=sendrecv 
a=rtcp-mux 
a=rtpmap:111 opus/48000/2 
a=fmtp:111 minptime=10 
a=rtpmap:103 ISAC/16000 
a=rtpmap:104 ISAC/32000 
a=rtpmap:9 G722/8000 
a=rtpmap:0 PCMU/8000 
a=rtpmap:8 PCMA/8000 
a=rtpmap:106 CN/32000 
a=rtpmap:105 CN/16000 
a=rtpmap:13 CN/8000 
a=rtpmap:126 telephone-event/8000 

a=maxptime:60 
a=ssrc:3607952327 cname:v1SBHP7c76XqYcWx 
a=ssrc:3607952327 msid:6x9ZxQZqpo19FRr3Q0xsWC2JJ1lVsk2JE0sG 9eb1f6d5-c3b246fe
   -b46b-63ea11c46c74 
a=ssrc:3607952327 mslabel:6x9ZxQZqpo19FRr3Q0xsWC2JJ1lVsk2JE0sG 
a=ssrc:3607952327 label:9eb1f6d5-c3b2-46fe-b46b-63ea11c46c74 
m=video 9 RTP/SAVPF 100 116 117 96 

c=IN IP4 0.0.0.0 
a=rtcp:9 IN IP4 0.0.0.0 
a=ice-ufrag:8a1/LJqQMzBmYtes
a=ice-pwd:sbfskHYHACygyHW1wVi8GZM+ 
a=ice-options:google-ice 

a=fingerprint:sha-256 28:4C:19:10:97:56:FB:22:57:9E:5A:88:28:F3:04:
   DF:37:D0:7D:55:C3:D1:59:B0:B2:81 :FB:9D:DF:CB:15:A8 
a=setup:actpass 
a=mid:video 
a=extmap:2 urn:ietf:params:rtp-hdrext:toffset 
a=extmap:3 http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time

a=sendrecv 
a=rtcp-mux 
a=rtpmap:100 VP8/90000 
a=rtcp-fb:100 ccm fir 
a=rtcp-fb:100 nack 
a=rtcp-fb:100 nack pli 
a=rtcp-fb:100 goog-remb 
a=rtpmap:116 red/90000 
a=rtpmap:117 ulpfec/90000 
a=rtpmap:96 rtx/90000 

a=fmtp:96 apt=100 
a=ssrc-group:FID 1175220440 3592114481 
a=ssrc:1175220440 cname:v1SBHP7c76XqYcWx 
a=ssrc:1175220440 msid:6x9ZxQZqpo19FRr3Q0xsWC2JJ1lVsk2JE0sG
   43d2eec3-7116-4b29-ad33-466c9358bfb3 
a=ssrc:1175220440 mslabel:6x9ZxQZqpo19FRr3Q0xsWC2JJ1lVsk2JE0sG 
a=ssrc:1175220440 label:43d2eec3-7116-4b29-ad33-466c9358bfb3 
a=ssrc:3592114481 cname:v1SBHP7c76XqYcWx 
a=ssrc:3592114481 msid:6x9ZxQZqpo19FRr3Q0xsWC2JJ1lVsk2JE0sG
   43d2eec3-7116-4b29-ad33-466c9358bfb3 
a=ssrc:3592114481 mslabel:6x9ZxQZqpo19FRr3Q0xsWC2JJ1lVsk2JE0sG 
a=ssrc:3592114481 label:43d2eec3-7116-4b29-ad33-466c9358bfb3

SDP Example

Security: Guarding the Communication Channel

WebRTC prioritizes secure communication. Two key protocols ensure data integrity and confidentiality:

  • Secure Real-time Transport Protocol (SRTP): SRTP encrypts the media content (audio/video) being transmitted between peers. This safeguards the content from eavesdroppers on the network.
  • Datagram Transport Layer Security (DTLS): DTLS secures the signaling channel, protecting the SDP and ICE candidates exchanged during session establishment. It establishes a secure connection using digital certificates and encryption.

SCTP: Streamlining Data Delivery

While WebRTC primarily relies on RTP for media transport, it also supports the Stream Control Transmission Protocol (SCTP). SCTP offers several advantages over RTP:

  • Ordered Delivery: SCTP guarantees the order in which data packets are delivered, which is crucial for reliable data communication.
  • Multihoming: A peer can use multiple network interfaces with SCTP, improving reliability and redundancy.
  • Partial Reliability: SCTP allows selective retransmission of lost packets, improving efficiency.

WebRTC might look complex to a beginner, however it is not a new technology. It is infact combination of existing protocols, codecs, networking mechanisms and transport to enable two clients behind firewall start a P2P session to exchange media and data. The beauty of WebRTC is displayed in two humans able to exchange the bond of love despite being continents apart. Lookout for future blogs for more on this amazing technology.

Bibliography: