Timeline from Transformers to LLM and Agentic AI

Since the groundbreaking 2017 paper “Attention is All You Need” introduced the Transformer architecture, the field of artificial intelligence has undergone a rapid and transformative evolution. This blog post will explore the chronology of important events that have shaped the AI landscape, leading up to the current era of Large Language Models (LLMs) and agentic AI. Let us visit the timeline from Transformers to LLM and Agentic AI

2017: The Transformer Revolution

The journey begins with the publication of “Attention is All You Need” by Google scientists in 2017. This paper introduced the Transformer architecture, which relied solely on attention mechanisms, dispensing with recurrence and convolutions entirely1. The new model demonstrated superior translation quality and efficiency in machine translation tasks, setting the stage for a paradigm shift in natural language processing.

2018: BERT and the Rise of Bidirectional Models

Building on the success of Transformers, 2018 saw the introduction of BERT (Bidirectional Encoder Representations from Transformers) by Google researchers. BERT’s innovation lay in its bidirectional nature, allowing it to capture context from both directions in text data2. This breakthrough significantly improved performance across various language tasks, from question-answering to sentiment analysis.

2019-2020: The GPT Era Begins

OpenAI’s release of GPT-3 (Generative Pre-trained Transformer 3) in 2020 marked a significant milestone in the development of large language models. With 175 billion parameters, GPT-3 demonstrated unprecedented capabilities in natural language understanding and generation, capturing the imagination of researchers and the public alike4.

2021-2022: AI Goes Mainstream

During this period, AI technologies began to permeate various industries and applications:

  • AI in Healthcare: The healthcare and pharmaceutical sectors emerged as early adopters of AI, leveraging it for tasks such as appointment scheduling, patient care, and personalized treatment3.
  • Self-Driving Vehicles: AI agents moved beyond software into the physical world, making real-time, high-stakes decisions in autonomous vehicles4.
  • Code Generation: AI systems like GitHub Copilot began assisting developers in writing code, hinting at the potential for AI to transform software development.
Timeline from Transformers to LLM and Agentic AI

2023: The Year of Generative AI

2023 saw an explosion in generative AI applications, with tools like DALL-E, Midjourney, and ChatGPT capturing public attention. These models demonstrated the ability to generate high-quality text, images, and even code, sparking discussions about the future of creative work and knowledge work.

2024: The Dawn of Agentic AI

As we moved into 2024, the concept of Agentic AI began to take shape. This new paradigm represented a shift from isolated AI tasks to specialized, interconnected agents capable of more autonomous operation3. Key developments included:

  • Multi-Agent Systems: AI agents began working collaboratively to solve complex problems, simulating human teamwork in digital environments4.
  • Small Language Models (SLMs): The adoption of SLMs alongside LLMs offered new possibilities for efficient, task-specific AI solutions3.
  • AI Orchestration: Frameworks for coordinating multiple AI agents emerged, allowing for more complex problem-solving approaches3.

2025: The Year of Agentic AI

As we stand in 2025, Agentic AI has become the new frontier in artificial intelligence. This evolution is characterized by several key trends:

  1. Autonomous Decision-Making: AI agents now operate with greater independence, capable of long-term planning and adapting to changing conditions without constant human oversight4.
  2. AI Engineers: Systems like Devin AI are now capable of debugging and writing code on their own, pushing the boundaries of what AI can achieve in software development4.
  3. Industry Transformation: Agentic AI is revolutionizing various sectors, with the potential to take over entire departments in organizations5. For example:
    • In healthcare, AI agents manage tasks from appointment scheduling to personalized treatment plans3.
    • In customer service, AI-driven virtual assistants provide increasingly sophisticated and personalized support.
  4. Multi-Agent Collaboration: OpenAI’s introduction of “Swarm,” an experimental framework for coordinating networks of AI agents, has opened new possibilities for complex problem-solving5.
  5. Enhanced Personalization: Advanced learning algorithms enable AI agents to tailor services and products to individual needs, creating highly personalized experiences across industries7.
  6. Scalable Automation: AI agents are driving automation at an unprecedented scale, from small businesses to large enterprises, significantly reducing costs and operational inefficiencies7.
  7. Continuous Learning and Adaptation: Agentic AI systems demonstrate the ability to learn autonomously and adapt to dynamic environments, enabling faster growth and efficiency across sectors7.

As we look to the future, the potential of Agentic AI seems boundless. From enhancing decision-making processes to revolutionizing entire industries, these intelligent agents are poised to transform the way we work, create, and solve problems. However, this rapid advancement also brings new challenges in ethics, privacy, and workforce adaptation that society must address.

We saw and have actually lived through this timeline from Transformers to LLM and Agentic AI, the journey has been remarkably swift, showcasing the exponential pace of innovation in artificial intelligence. As we explore the vast potential of Agentic AI, the broader quest for Artificial General Intelligence (AGI) remains a captivating goal. AGI, which aims to create intelligent systems capable of performing any intellectual task that humans can, represents the ultimate frontier in artificial intelligence. For a deeper dive into the most basic concepts on AI and Machine Learning please visit my other blog pages.

A Deep Dive into PyTorch’s GPU Memory Management

Here is an error I got when using an image generation deep learning model. It is a common error Engineers get when using PyTorch on GPU. To solve this error, a deep dive into PyTorch’s GPU Memory management is needed. So fasten your seat belts 🙂

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 58.00 MiB. GPU 0 has a total capacity of 3.71 GiB of which 57.00 MiB is free. Including non-PyTorch memory, this process has 3.64 GiB memory in use. Of the allocated memory 3.51 GiB is allocated by PyTorch, and 74.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

This error message provides valuable insights:

  • Memory Exhaustion: The GPU’s available memory (3.71 GiB) has been depleted.
  • Allocation Attempt: PyTorch attempted to allocate 20.00 MiB, but there wasn’t enough free space.
  • Memory Usage: 3.68 GiB is in use, with 3.61 GiB allocated by PyTorch and 5.27 MiB reserved but unallocated.
  • Fragmentation Hint: The message suggests that memory fragmentation might be contributing to the issue, and setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True might help.

PyTorch’s Memory Management Strategies

PyTorch employs a sophisticated memory management system to optimize GPU resource utilization. Here’s a detailed breakdown:

  • Caching Allocator: PyTorch uses a caching allocator to reduce the overhead of frequent memory allocations and deallocations. This improves performance but can also contribute to memory fragmentation if not managed effectively.
  • Memory Pooling: PyTorch pools memory into larger blocks to reduce fragmentation and improve allocation efficiency.
  • Automatic Deallocation: PyTorch automatically deallocates memory for tensors that are no longer needed, reducing the risk of memory leaks.
  • torch.cuda.empty_cache(): This function manually clears the cached memory, potentially freeing up unused resources.
  • PYTORCH_CUDA_ALLOC_CONF: This environment variable allows you to fine-tune memory allocation behavior. Experimenting with different configurations can help address fragmentation issues.

Profiling Tools for Deep Insights

To gain a granular understanding of memory usage and identify bottlenecks, profiling tools are indispensable:

NVIDIA System Management Interface (NVIDIA-smi):

  • Real-time monitoring of GPU utilization, temperature, and memory usage.
  • Provides detailed information about processes and applications consuming GPU resources.
  • Example usage in Bash: nvidia-smi
    watch -n0.1 nvidia-smi

    PyTorch Memory Profiler

    • Records memory allocations and deallocations during program execution.
    • Visualizes memory usage patterns over time.
     enable memory history, which will
    # add tracebacks and event history to snapshots
    torch.cuda.memory._record_memory_history()
    
    run_your_code()
    torch.cuda.memory._dump_snapshot("my_snapshot.pickle")

    Open pytorch.org/memory_viz and drag/drop the pickled snapshot file into the visualizer. The visualizer is a javascript application that runs locally on your computer. It does not upload any snapshot data.

    Active Memory Timeline in PyTorch Memory visualizer

    Allocator State History in PyTorch Memory visualizer

    • Integrates seamlessly with PyTorch models and training scripts.
    • Example usage:
    import torch.profiler as profiler
    with profiler.profile() as prof:
    # Your PyTorch code here
    # ...

    # Print the profiling results print(prof.key_metrics())

    Nsight Systems:

    • A powerful profiling tool that provides detailed insights into GPU utilization, memory usage, and performance bottlenecks.
    • Offers visualizations for performance analysis.
    • Example usage in Bash
    nsight-systems --profile-gpu --launch-command="python your_script.py"

    Debugging and Optimization Strategies

    1. Reduce Model Size: If possible, use a smaller or optimized version of the Stable Diffusion model to reduce memory requirements.
    2. Adjust Batch Size: Experiment with different batch sizes to find the optimal balance between performance and memory usage.
    3. Optimize Data Loading: Ensure your data loading pipeline is efficient and avoids unnecessary memory copies.
    4. Monitor Memory Usage: Use profiling tools to track memory consumption and identify areas for optimization.
    5. Consider Memory-Efficient Techniques: Explore techniques like gradient checkpointing or quantization to reduce memory usage.
    6. Leverage Cloud-Based GPUs: If your local hardware is constrained, consider using cloud-based GPU instances with larger memory capacities.

    Additional Considerations:

    • GPU Driver Updates: Ensure you have the latest GPU drivers installed to avoid performance issues or memory leaks.
    • Operating System Configuration: Check your operating system’s memory management settings to see if they can be optimized for better GPU performance.
    • TensorFlow vs. PyTorch: If you’re using TensorFlow, explore its memory management features and best practices.

    Advanced Memory Optimization Techniques

    For more advanced scenarios, consider the following techniques:

    • Memory Pooling: Manually create memory pools to allocate and reuse memory blocks efficiently. This can be helpful for specific use cases where memory allocation is frequent.
    • Custom Memory Allocators: If you have deep knowledge of CUDA and memory management, you can create custom memory allocators to address specific memory usage patterns.
    • Profiling and Benchmarking: Use profiling tools to identify performance bottlenecks and benchmark different memory optimization strategies to measure their effectiveness.

    Beyond the Code: A Deeper Dive into Memory Management

    While we’ve covered the essential aspects of PyTorch’s memory management, it’s worth exploring the underlying mechanisms in more detail.

    • CUDA Memory Allocator: CUDA, the underlying framework for NVIDIA GPUs, provides its own memory allocator. PyTorch interacts with this allocator to allocate and manage memory on the device.
    • Memory Fragmentation: When memory is allocated and deallocated frequently, it can lead to fragmentation, where small, unused memory blocks are scattered throughout the memory space. This can make it difficult for PyTorch to allocate larger contiguous blocks of memory.
    • Memory Pooling: PyTorch’s memory pooling strategy involves creating larger memory pools and allocating memory from these pools. This can help reduce fragmentation and improve memory allocation efficiency.
    • Automatic Deallocation: PyTorch uses reference counting to track memory usage and automatically deallocates memory for tensors that are no longer needed. However, it’s important to be aware of potential memory leaks if tensors are not properly managed.
    • Profiling Tools: Profiling tools like Nsight Systems can provide detailed insights into memory usage patterns, including memory allocations, deallocations, and access patterns. This information can be invaluable for identifying memory-related bottlenecks and optimizing your code.

    Conclusion

    Overcoming the “CUDA out of memory” error requires a deep understanding of PyTorch’s memory management strategies and the ability to leverage profiling tools effectively. By following the techniques outlined in this blog post, you can optimize your PyTorch applications for efficient GPU memory usage and unlock the full potential of your NVIDIA GPU

    🚀 The Evolution of YOLO 🚀

    The YOLO (You Only Look Once) series is a real-time object detection algorithm that uses convolutional neural network (CNN). It has dramatically shaped the landscape of real-time computer vision. Each iteration of YOLO brings something unique to the table, enhancing the capabilities and applications of object detection. Let’s dive into the evolution of Yolo, details of each major YOLO model, the companies and organizations behind them, and how they contribute to the evolution of AI.

    🌐 Timeline of Key YOLO Models and Their Innovators:

    1️⃣ Joseph Redmon 🧠:

    • YOLO V1 (2016): Redmon introduced the first YOLO model, which revolutionized object detection by framing it as a single regression problem instead of a classification task. This approach allowed YOLO to detect objects in images at unprecedented speeds, making it suitable for real-time applications.
    • YOLO V2 & YOLO V3: These versions refined the detection process, improving accuracy with techniques like anchor boxes and multi-scale predictions. YOLO V3 was especially known for its balance between speed and accuracy, making it a benchmark for real-time object detection.

    2️⃣ AlexeyAB & WongKinYiu 🔧:

    • YOLO V4: Building on Redmon’s work, AlexeyAB introduced YOLO V4, which incorporated advanced techniques like CSPDarknet53 as the backbone, PANet for path aggregation, and various other improvements that significantly boosted detection accuracy while maintaining speed.
    • Scaled YOLO V4: WongKinYiu extended YOLO V4 by introducing scaling capabilities, allowing the model to adapt to different sizes depending on the computational resources available.
    • YOLO-R & YOLO V7: These versions continued to refine the architecture, focusing on edge-device efficiency without compromising accuracy, and further solidifying YOLO’s role in lightweight, real-time applications.

    3️⃣ Ultralytics (Glenn Jocher) 💻:

    • YOLO V5: Ultralytics’ YOLO V5 made a significant impact by offering an easier-to-use version of YOLO with extensive support for PyTorch. It’s known for its ease of training, deployment, and integration into various projects. YOLO V5 became the go-to model for many practitioners due to its flexibility and performance.
    • YOLO V8: The latest from Ultralytics, YOLO V8, pushes the envelope with state-of-the-art performance, integrating the latest research advancements, and focusing on deployment efficiency in various environments, from cloud to edge devices.

    4️⃣ Meituan Technical Team 🌟:

    • YOLO V6: Aimed at balancing speed and accuracy, YOLO V6 from Meituan was developed with a focus on real-world applications where inference speed on edge devices is critical. It leverages modern techniques like EfficientNet as the backbone to optimize performance.
    • YOLO V6 3.0: This update introduced further refinements in the model architecture, allowing it to perform even better on resource-constrained devices, making it ideal for mobile and embedded applications.

    5️⃣ Baidu 🧬:

    • PP-YOLO Series: Baidu’s PP-YOLO and its successors (V2 and beyond) are optimized for PaddlePaddle, an AI framework developed by Baidu. PP-YOLO models integrate many of the latest research advancements in object detection, providing a powerful tool for various commercial applications. Baidu’s focus on enhancing speed and accuracy makes PP-YOLO particularly well-suited for industrial AI applications where efficiency is key.

    6️⃣ Megvii Technology 🏢:

    • YOLOX: Megvii’s YOLOX introduces a new paradigm by decoupling the head of the network into classification and regression branches, improving performance and making it easier to adapt to different tasks. It’s optimized for versatility, robustness, and ease of deployment, particularly in scenarios requiring high accuracy and low latency. YOLOX’s advancements make it a strong contender in the commercial AI space.

    7️⃣ Alibaba DAMO Academy 🏛️:

    • DAMO YOLO: Alibaba’s DAMO Academy has taken YOLO to new heights with its DAMO YOLO models, which focus on specialized applications requiring high precision. These models leverage PyTorch and Apache licensing and are designed for integration into Alibaba’s vast ecosystem, ensuring scalability and robustness in production environments.

    8️⃣ Deci AI 🛠️:

    • YOLO-NAS: Deci AI introduces a novel approach with YOLO-NAS, utilizing Neural Architecture Search (NAS) to automatically optimize YOLO models for specific tasks. This results in highly efficient, custom-tailored models that excel in specific applications, providing a significant edge in performance and deployment flexibility.
    History of Yolo computer vision model development

    Image Credit : OpenCV.ai

    📜 Licensing Overview:

    • MIT License: Used in the early YOLO versions, allowing for broad use and modification with few restrictions.
    • GPL License: Encourages collaboration while ensuring that derivative work remains open source.
    • Apache License: Offers a balance between open-source freedom and commercial use, widely adopted in enterprise environments.

    💡 Frameworks:

    • Darknet: The original framework used in early YOLO versions, known for its speed and efficiency.
    • PyTorch: Dominates recent YOLO models, providing flexibility and a rich ecosystem for development and deployment.
    • PaddlePaddle: Baidu’s in-house framework, optimized for PP-YOLO models, ensuring tight integration with Baidu’s AI infrastructure.

    As we continue to push the boundaries of AI, the evolution of YOLO to what it is today has been of immense help to computer vision developers. Each new iteration not only refines performance but also expands the possibilities of what AI can achieve in real-world applications. Whether you’re developing on the edge or deploying in the cloud, the YOLO family offers a model for every need. 🌍


    Basic Machine Learning Optimization Algorithms

    Keeping up with my tradition of posting my old handwritten notes, here are my notes on basic Machine Learning optimization algorithms. Optimization algos in ML help minimize the cost function, thereby helping to reduce the error between predicted value and actual value

    1) Most popular – Gradient Descent

    Gradient descent equation

    Gradient Descent is used in linear regression, logistic regression & early implementation of neural networks

    Gradient Descent
    Gradient Descent ML algorithm

    2) ADAM = ADAptive Moment estimation

    Adam optimization algorithm does not use single global learning rate , but a different learning rate for every single parameter

    Adam Optimization ML algorithm
    Adam Optimization ML algorithm

    I hope you liked my notes on basic machine learning optimization algorithms. Feel free to comment here or on my Linkedin post.

    Hand written notes on Neural Networks and ML course by Andrew Ng

    About 2018 when I started working on Machine learning I took many courses. Here are my hand written notes on Neural Networks and ML course by Andrew Ng. It focuses on the fundamental concepts covered in the course, including Logistic Regression, Neural Networks, and Softmax Regression. Buckle up for some equations and diagrams!

    Part 1: Logistic Regression – The Binary Classification Workhorse

    Logistic regression reigns supreme for tasks where the target variable (y) can only take on two distinct values, typically denoted as 0 or 1. It essentially calculates the probability (a) of y belonging to class 1, given a set of input features (x). Here’s a breakdown of the process:

    1. Linear Combination: The model calculates a linear score (z) by taking a weighted sum of the input features (x) and their corresponding weights (w). We can represent this mathematically as:

      z = w_1x_1 + w_2x_2 + … + w_nx_n

      (where n is the number of features)
    2. Sigmoid Function: This linear score (z) doesn’t directly translate to a probability. The sigmoid function (σ) steps in to transform this score into a value between 0 and 1, representing the probability (a) of y belonging to class 1. The sigmoid function is typically defined as:

    a = \sigma(z) = \frac{1}{1 + e^{-z}}

    Sigmoid Function Plot / Logistic Curve


    Key takeaway: 1 – a represents the probability of y belonging to class 0. This is because the sum of probabilities for both classes must always equal 1.

    Part 2: Demystifying Neural Networks – Building Blocks and Forward Propagation

    1. Perceptrons – The Basic Unit: Neural networks are built using perceptrons, the fundamental unit inspired by biological neurons. A perceptron takes weighted inputs (just like logistic regression), performs a linear transformation, and applies an activation function to generate an output.
    2. Activation Functions: While sigmoid functions are common in logistic regression and the initial layers of neural networks, other activation functions like ReLU (Rectified Linear Unit) can also be employed. These functions introduce non-linearity, allowing the network to learn more complex patterns in the data.
    3. Layering Perceptrons: Neural networks are not limited to single perceptrons. We can stack multiple perceptrons into layers, where each neuron in a layer receives outputs from all the neurons in the previous layer. This creates a complex network of interconnected units.
    4. Forward Propagation: Information flows through the network in a forward direction, layer by layer. In each layer, the weighted sum of the previous layer’s outputs is calculated and passed through an activation function. This process continues until the final output layer produces the network’s prediction.

    Part 3: Unveiling Backpropagation – The Learning Algorithm

    But how do these neural networks actually learn? Backpropagation is the hero behind the scenes! It allows the network to adjust its weights and biases in an iterative manner to minimize the error between the predicted and actual outputs.

    1. Cost Function: We define a cost function that measures how well the network’s predictions align with the actual labels. A common cost function for classification problems is the cross-entropy loss.
    2. Error Calculation: Backpropagation calculates the error (difference between prediction and actual value) at the output layer and propagates it backward through the network.
    3. Weight and Bias Updates: Based on the calculated errors, the weights and biases of each neuron are adjusted in a way that minimizes the overall cost function. This process is repeated iteratively over multiple training epochs until the network converges to a minimum error state.

    Part 4: Softmax Regression – Expanding Logistic Regression for Multi-Class Classification

    Logistic regression excels in binary classification, but what happens when we have more than two possible class labels for the target variable (y)? Softmax regression emerges as a powerful solution!

    1. Generalizing Logistic Regression: Softmax regression can be viewed as an extension of logistic regression for multi-class problems. It calculates a set of class scores (z_i) for each possible class (i).
    2. The Softmax Function: Similar to the sigmoid function, softmax takes these class scores (z_i) and transforms them into class probabilities (a_i) using the following formula:

    a_i = \frac{e^{z_i}}{\sum\limits_{j=1}^{C} e^{z_j}}

    (where Σ represents the sum over all possible classes j)
    Key takeaway: This function ensures that all the class probabilities (a_i) sum up to 1, which is a crucial requirement for a valid probability distribution. Intuitively, for a given input (x), only one class can be true, and the softmax function effectively distributes the probability mass across all classes based on their corresponding z_i scores.

    Softmax Function Curve

    1. Interpretation of Class Probabilities: Each class probability (a_i) represents the model’s estimated probability of the target variable (y) belonging to class i, given the input features (x). This probabilistic interpretation empowers us to not only predict the most likely class but also gauge the model’s confidence in that prediction.

    Part 5: Putting It All Together – Training and Cost Function for Softmax Regression

    Part 5: Putting It All Together – Training and Cost Function for Softmax Regression

    While we’ve focused on the mechanics of softmax, training a softmax regression model involves a cost function. Here’s a brief overview:

    1. Negative Log-Likelihood Cost Function: Softmax regression typically employs the negative log-likelihood cost function. This function penalizes the model for assigning low probabilities to the correct class and vice versa. Mathematically, the cost function can be represented as:Cost=−Σ(yi∗log(ai))
      (where y_i is 1 for the correct class and 0 otherwise)
    2. Model Optimization: During training, the model aims to minimize this cost function by adjusting its weights and biases through backpropagation. As the cost function decreases, the model learns to produce class probabilities that better reflect the underlying data distribution.

    Conclusion: A Stepping Stone to Deep Learning

    These blog and hand written notes on Neural Networks and ML has provided a condensed yet detailed exploration of logistic regression, neural networks, and softmax regression, concepts covered in Andrew Ng’s Advanced Learning Algorithms course. Understanding these fundamental building blocks equips you to delve deeper into the fascinating world of Deep Learning and explore more advanced architectures like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).Remember, this is just the beginning of your Deep Learning journey!

    I hope these detailed hand written notes on Neural Networks and ML with diagrams prove helpful for your Deep Learning studies!

    How to Compile Linux kernel

    There were 22 Billion internet connected devices in the world at the end of 2018. Therefore much more total computing devices including the ones that are not connected. Plus, by 2024 the number must be substantially higher. These devices run various programs written in different programming languages. Some run boring mainframe code, while others run trendy AI and ML models. Something really really fundamental to all these devices is that they run an OS – an Operating System. Majority of them run Linux. Lets go back to basics today and do something really fundamental. Let’s learn how to compile Linux kernel.

    Step 1: Install Dependencies Before compiling the kernel, you’ll need to install some dependencies. These may include development tools, compilers, and libraries. The required packages vary depending on your distribution. For example, on Debian-based systems, you can install the necessary packages with the following command:

    sudo apt-get install build-essential libncurses-dev bison flex libssl-dev libelf-dev

    Step 2: Download the Kernel Source Code You can download the kernel source code from the official Linux kernel website (https://www.kernel.org/). Choose a long term supported version e.g. linux-6.6.24.tar.xz and download the corresponding tarball.

    Step 3: Extract the Source Code Navigate to the directory where you downloaded the tarball and extract it using the following command:

    tar xvf linux-6.6.24.tar.xz

    Step 4: Configure the Kernel Change into the kernel source directory:

    cd linux-6.6.24
    Steps to compile Linux kernel
    Steps to compile Linux kernel

    Run the following command to start the kernel configuration:

    make menuconfig

    This command opens a text-based menu where you can configure various kernel options. You can navigate through the menu using the arrow keys and select options using the spacebar. Once you’re done configuring, save your changes and exit the menu.

    Compile the Kernel Once you’ve configured the kernel, you’re ready to compile it. Run the following commands:

    make -j$(nproc)

    This command starts the compilation process. The “-j$(nproc)” option tells make to use as many parallel processes as there are CPU cores, which can speed up the compilation process significantly.

    Install the Kernel Modules After the compilation is complete, you can install the kernel modules using the following command:

    sudo make modules_install

    Install the Kernel To install the newly compiled kernel, run the following command:

    sudo make install

    This command installs the kernel image, kernel modules, and other necessary files.

    Step 5: Update Boot Loader Configuration Finally, you need to update your boot loader configuration to include the new kernel. The procedure for doing this varies depending on your boot loader (e.g., GRUB, LILO).

    Reboot Once you’ve updated the boot loader configuration, reboot your system to boot into the newly compiled kernel.

    That’s it! You’ve successfully compiled and installed the Linux kernel.

    Rust Programming Language learning roadmap

    Rust is a multi-paradigm, general-purpose programming language exploding in popularity. But what makes it special? Rust offers a unique blend of blazing speed, unparalleled memory safety, and powerful abstractions, making it ideal for building high-performance, reliable systems. This blog delves into the Rust Programming Language learning roadmap

    Why Embrace Rust?

    • Unmatched Performance: Rust eliminates the need for a garbage collector, resulting in lightning-fast execution and minimal memory overhead. This makes it perfect for resource-constrained environments and applications demanding real-time responsiveness.
    • Rock-Solid Memory Safety: Rust enforces memory safety at compile time through its ownership system. This eliminates entire classes of memory-related bugs like dangling pointers and use-after-free errors, leading to more stable and secure software.
    • Zero-Cost Abstractions: Unlike some languages where abstractions incur performance penalties, Rust achieves powerful abstractions without sacrificing speed. This allows you to write expressive, concise code while maintaining peak performance.

    Language Fundamentals: Understanding the Building Blocks

    Syntax and Semantics: Rust borrows inspiration from C-like languages in its syntax, making it familiar to programmers from that background. However, Rust’s semantics are distinct, emphasizing memory safety through ownership and immutability by default.

    Constructs and Data Structures: Rust offers a rich set of control flow constructs like if, else, loop, and while for building program logic. Data structures encompass primitive types like integers, booleans, and floating-point numbers, along with powerful composite types like arrays, vectors, structs, and enums.

    Ownership System: The Heart of Rust

    The ownership system is the cornerstone of Rust’s memory safety. Let’s delve deeper:

    • Ownership Rules: Every value in Rust has a single owner – the variable that binds it. When the variable goes out of scope, the value is automatically dropped, freeing the associated memory. This ensures memory is never left dangling or leaked.
    • Borrowing: Borrowing allows temporary access to a value without taking ownership. References (&) and mutable references (&mut) are used for borrowing. The borrow checker, a powerful Rust feature, enforces strict rules to prevent data races and ensure references always point to valid data.
    • Stack vs. Heap: Understanding these memory regions is crucial in Rust. The stack is a fixed-size memory area used for local variables and function calls. It’s fast but short-lived. The heap is a dynamically allocated memory region for larger data structures. Ownership dictates where data resides: stack for small, short-lived data, and heap for larger, long-lived data.

    Rust programming language learning roadmap

    Beyond the Basics: Advanced Features

    • Error Handling: Rust adopts an Result type for error handling. It represents either successful computation with a value or an error with an error code. This promotes explicit error handling, leading to more robust code.
    • Modules and Crates: Rust promotes code organization through modules and crates. Modules group related code within a source file, while crates are reusable libraries published on https://crates.io/.
    • Concurrency and Parallelism: Rust provides mechanisms for writing concurrent and parallel programs. Channels and mutexes enable safe communication and synchronization between threads, allowing efficient utilization of multi-core processors.
    • Traits and Generics: Traits define shared behaviors for different types, promoting code reusability. Generics allow writing functions and data structures that work with various types, enhancing code flexibility.
    • Lifetimes and Borrow Checker: Lifetimes specify the lifetime of references in Rust. The borrow checker enforces rules ensuring references are valid for their intended usage duration. This prevents data races and memory unsafety issues.

    Rust’s Reach: Applications Across Domains

    • Web Development: Frameworks like Rocket and Actix utilize Rust’s speed and safety for building high-performance web services and APIs.
    • Asynchronous Programming: Async/await syntax allows writing non-blocking, concurrent code, making Rust perfect for building scalable network applications.
    • Networking: Libraries like Tokio provide efficient tools for building networking applications requiring low latency and high throughput.
    • Serialization and Deserialization: Rust’s data structures map well to various data formats like JSON and CBOR, making it suitable for data exchange tasks.
    • Databases: Several database libraries like Diesel offer safe and performant database access from Rust applications.
    • Cryptography: Rust’s strong typing and memory safety make it ideal for building secure cryptographic systems.
    • Game Development: Game engines like Amethyst leverage Rust’s performance and safety for creating high-fidelity games.
    • Embedded Systems: Rust’s resource-efficiency and deterministic memory management make it a compelling choice for resource-constrained embedded systems.

    Image Credit : roadmap.sh

    BERYL – new breakthrough Acoustic Echo Cancellation by Meta

    I attended Meta’s RTC@Scale 2024 Conference where Meta talked about two new major changes that it accomplished while revamping the audio processing core stack. BERYL – new breakthrough Acoustic Echo Cancellation by Meta and MLOW – a new low bitrate audio codec fully written in software. this blog contains notes on Beryl. PDF of handwritten notes can be found here.

    BERYL -full software AC (by Sriram Srinivasan & Hoang Do)

    • META did 20% reduction in “No Audio” or “Audio device reliability” issue on iOS & Android
    • 15% reduction in P50 mouth to ear latency on Android
    • Revamp of Audio processing stack core for WhatsApp, Instagram messenger
      • Very diverse user base
      • Different kinds of handsets
      • Different Geography
      • Noisy conditions
      • Both high end & Low end phones (more than 20% low end ARMV7)
    • Based on telemetry and user feedback Meta decided to tackle 1. ECHO and 2. Audio Quality under low bit rate network
    • High end devices use ML to suppress echo
    • To accommodate low end devices which cannot run ML, a baseline solution for echo cancellation is needed
    • Welcome BERYL
    • Bery/replaces WebRTC‘s AEC3, AECM on all devices
    • Interestingly users experiencing echo issues are also on low end devices which cannot run ML
    • Meta’s scale is too larger
      • High end phones have hardware AEC
      • Low end phones do not
      • Stereo I spatial audio only possible in s/w
      • H/w only does mono AEC
    • Beryl was needed because AM either leaves lot of residual echo or degrades quality of double-talk
    • AECM – Not scalable for millions of users & Quality not best
    • Beryl AEC = Low compute – DSP based s/w AEC
      • Lite mode for low end devices
      • Full made for high end
      • Both modes adaptive vs. ACT being simple echo suppressor
      • Near instant adaptation to changes
      • Better double talk performance
      • Multi-channel capture & render l6k1tz & 48 kHz
      • Tuned using 3000 music t speech (monot stereo on 20T devices
      • CPU usage increase of less than 7% compared to WebRTC AEC

    Beryl Components

    1. Delay Estimator

    • Clock drift when using external mic & speaker as they do not share common clock
    • Delay estimator, estimates delay between far- end reference signal (speaker) & near end capture signals (mic)
    • Beryl full made can handle non-causal delays (-ve delay)
    • Can handle delay up to 1 sec

    2 Linear AEC

    • Estimate echo & subtract from capture signal
    • Beryl AEC is normalized least mean squared (NLMS) frequency domain dual filter algo
    • One fixed & one adaptive filter
    • Coefficients can be copied between filters
      • relative difference in the powers of error signal between two filters and input mic signal
      • Coupling factor between echo estimate E error signal *
    • Adaptation step size is configurable I depends on coherence between mic & reference signals, power and SIR
    • Great double talk performance compared to WebRTC AEC

    3 Acoustic Echo Suppressor (AES)

    • Non linear distortions are introduced by amplifiers before speaker and after microphone
    • AES removes this non-linear echo (residual echo)
    • AES removes stationary echo noise, distortion, applies perceptual filtering & ambient noise matching

    Implementation

    • Reduce memory, CPU & latency
    • Synchronization needed due to work on audio from input & output devices from different threads
      • mutex in functions (Good safety but worse real time performance)
      • Low level locks on shared data structures
      • Thread safe low level data structures (ok safety, great realtime Performance)
    • Neon on ARMY7 & ARMG4
    • AUX on Intel
    • CPU 4110% of WebRTC AEC

    Demystifying WebRTC

    WebRTC (Web Real-Time Communication) has revolutionized the way web applications handle communication. It empowers developers to embed real-time audio, video, and data exchange functionalities directly within web pages and apps, eliminating the need for plugins or additional downloads. This blog’s attempts in demystifying WebRTC is the first step in learning the basics of this technology.

    Signaling: The Orchestrator of Connections

    WebRTC itself doesn’t establish direct connections between browsers. Signaling, the first act in the WebRTC play, takes center stage. It involves exchanging information about the communication session between peers. This information typically includes:

    • Session Description Protocol (SDP): An SDP carries details about the media streams (audio/video) each peer intends to send or receive, along with the codecs they support.
    • ICE Candidates: These describe the network addresses and ports a peer can use for communication.
    • Offer/Answer Model: The initiating peer sends an SDP (offer) outlining its capabilities. The receiving peer responds with an SDP (answer) indicating its acceptance and potentially modifying the offer.

    Several signaling mechanisms can be employed, including WebSockets, Server-Sent Events (SSE), or even custom solutions. The choice depends on the application’s specific needs and desired level of real-time interaction.

    NAT Traversal: Hurdles and Leapfrogs

    WebRTC connections often face the obstacle of Network Address Translation (NAT). NAT devices on home networks hide private IP addresses behind a single public address. Direct communication between peers behind NATs becomes a challenge. WebRTC employs a combination of techniques to overcome this hurdle:

    • STUN (Session Traversal Utilities for NAT): A peer sends a STUN request to a public server, which reveals the public IP and port the NAT maps the request to. This helps a peer learn its own public facing address.
    • TURN (Traversal Using Relays around NAT): When a direct connection isn’t feasible due to restrictive firewalls, TURN servers act as relays. Peers send their media streams to the TURN server, which then forwards them to the destination peer. While TURN provides a reliable fallback, it introduces latency and may not be suitable for bandwidth-intensive applications.
    NAT traversal in WebRTC

    NAT Traversal in webRTC

    Image Credit : García, Boni & Gallego, Micael & Gortázar, Francisco & Bertolino, Antonia. (2019). Understanding and estimating quality of experience in WebRTC applications. Computing. 101. 10.1007/s00607-018-0669-7.

    ICE: The Candidate for Connectivity

    The Interactive Connectivity Establishment (ICE) framework plays a pivotal role in NAT traversal. Here’s how it works:

    1. Gathering Candidates: Each peer gathers potential connection points (local IP addresses and ports) it can use for communication. These include public addresses obtained via STUN and local network interfaces.
    2. Candidate Exchange: Peers exchange their gathered candidates with each other through the signaling channel.
    3. Connectivity Checks: Each peer attempts to establish a connection with the other using the received candidates. This might involve trying different combinations of local and remote candidates.
    4. Best Path Selection: Once a successful connection is established, the peers determine the optimal path based on factors like latency and bandwidth.

    SDP: The Session Description

    The Session Description Protocol (SDP) acts as a blueprint for the WebRTC session. It’s a text-based format that conveys essential information about the media streams involved:

    • Media types: Whether it’s audio, video, or data communication.
    • Codecs: The specific compression formats used for encoding and decoding media.
    • Transport protocols: The underlying protocols used for media transport (e.g., RTP for real-time data).
    • ICE candidates: The potential connection points offered by each peer.

    The SDP is exchanged during the signaling phase, allowing peers to negotiate and agree upon a mutually supported configuration for the communication session.

    v=0 
    o=- 487255629242026503 2 IN IP4 127.0.0.1 
    s=- 
    t=0 0 
    
    a=group:BUNDLE audio video 
    a=msid-semantic: WMS 6x9ZxQZqpo19FRr3Q0xsWC2JJ1lVsk2JE0sG 
    m=audio 9 RTP/SAVPF 111 103 104 9 0 8 106 105 13 126 
    c=IN IP4 0.0.0.0
    
    a=rtcp:9 IN IP4 0.0.0.0 
    a=ice-ufrag:8a1/LJqQMzBmYtes 
    a=ice-pwd:sbfskHYHACygyHW1wVi8GZM+ 
    a=ice-options:google-ice 
    a=fingerprint:sha-256 28:4C:19:10:97:56:FB:22:57:9E:5A:88:28:F3:04:
       DF:37:D0:7D:55:C3:D1:59:B0:B2:81 :FB:9D:DF:CB:15:A8 
    a=setup:actpass 
    a=mid:audio 
    a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level 
    a=extmap:3 http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time 
    
    a=sendrecv 
    a=rtcp-mux 
    a=rtpmap:111 opus/48000/2 
    a=fmtp:111 minptime=10 
    a=rtpmap:103 ISAC/16000 
    a=rtpmap:104 ISAC/32000 
    a=rtpmap:9 G722/8000 
    a=rtpmap:0 PCMU/8000 
    a=rtpmap:8 PCMA/8000 
    a=rtpmap:106 CN/32000 
    a=rtpmap:105 CN/16000 
    a=rtpmap:13 CN/8000 
    a=rtpmap:126 telephone-event/8000 
    
    a=maxptime:60 
    a=ssrc:3607952327 cname:v1SBHP7c76XqYcWx 
    a=ssrc:3607952327 msid:6x9ZxQZqpo19FRr3Q0xsWC2JJ1lVsk2JE0sG 9eb1f6d5-c3b246fe
       -b46b-63ea11c46c74 
    a=ssrc:3607952327 mslabel:6x9ZxQZqpo19FRr3Q0xsWC2JJ1lVsk2JE0sG 
    a=ssrc:3607952327 label:9eb1f6d5-c3b2-46fe-b46b-63ea11c46c74 
    m=video 9 RTP/SAVPF 100 116 117 96 
    
    c=IN IP4 0.0.0.0 
    a=rtcp:9 IN IP4 0.0.0.0 
    a=ice-ufrag:8a1/LJqQMzBmYtes
    a=ice-pwd:sbfskHYHACygyHW1wVi8GZM+ 
    a=ice-options:google-ice 
    
    a=fingerprint:sha-256 28:4C:19:10:97:56:FB:22:57:9E:5A:88:28:F3:04:
       DF:37:D0:7D:55:C3:D1:59:B0:B2:81 :FB:9D:DF:CB:15:A8 
    a=setup:actpass 
    a=mid:video 
    a=extmap:2 urn:ietf:params:rtp-hdrext:toffset 
    a=extmap:3 http://www.webrtc.org/experiments/rtp-hdrext/abs-send-time
    
    a=sendrecv 
    a=rtcp-mux 
    a=rtpmap:100 VP8/90000 
    a=rtcp-fb:100 ccm fir 
    a=rtcp-fb:100 nack 
    a=rtcp-fb:100 nack pli 
    a=rtcp-fb:100 goog-remb 
    a=rtpmap:116 red/90000 
    a=rtpmap:117 ulpfec/90000 
    a=rtpmap:96 rtx/90000 
    
    a=fmtp:96 apt=100 
    a=ssrc-group:FID 1175220440 3592114481 
    a=ssrc:1175220440 cname:v1SBHP7c76XqYcWx 
    a=ssrc:1175220440 msid:6x9ZxQZqpo19FRr3Q0xsWC2JJ1lVsk2JE0sG
       43d2eec3-7116-4b29-ad33-466c9358bfb3 
    a=ssrc:1175220440 mslabel:6x9ZxQZqpo19FRr3Q0xsWC2JJ1lVsk2JE0sG 
    a=ssrc:1175220440 label:43d2eec3-7116-4b29-ad33-466c9358bfb3 
    a=ssrc:3592114481 cname:v1SBHP7c76XqYcWx 
    a=ssrc:3592114481 msid:6x9ZxQZqpo19FRr3Q0xsWC2JJ1lVsk2JE0sG
       43d2eec3-7116-4b29-ad33-466c9358bfb3 
    a=ssrc:3592114481 mslabel:6x9ZxQZqpo19FRr3Q0xsWC2JJ1lVsk2JE0sG 
    a=ssrc:3592114481 label:43d2eec3-7116-4b29-ad33-466c9358bfb3

    SDP Example

    Security: Guarding the Communication Channel

    WebRTC prioritizes secure communication. Two key protocols ensure data integrity and confidentiality:

    • Secure Real-time Transport Protocol (SRTP): SRTP encrypts the media content (audio/video) being transmitted between peers. This safeguards the content from eavesdroppers on the network.
    • Datagram Transport Layer Security (DTLS): DTLS secures the signaling channel, protecting the SDP and ICE candidates exchanged during session establishment. It establishes a secure connection using digital certificates and encryption.

    SCTP: Streamlining Data Delivery

    While WebRTC primarily relies on RTP for media transport, it also supports the Stream Control Transmission Protocol (SCTP). SCTP offers several advantages over RTP:

    • Ordered Delivery: SCTP guarantees the order in which data packets are delivered, which is crucial for reliable data communication.
    • Multihoming: A peer can use multiple network interfaces with SCTP, improving reliability and redundancy.
    • Partial Reliability: SCTP allows selective retransmission of lost packets, improving efficiency.

    WebRTC might look complex to a beginner, however it is not a new technology. It is infact combination of existing protocols, codecs, networking mechanisms and transport to enable two clients behind firewall start a P2P session to exchange media and data. The beauty of WebRTC is displayed in two humans able to exchange the bond of love despite being continents apart. Lookout for future blogs for more on this amazing technology.

    Bibliography:

    How to succeed at work

    Lenny Rachitsky recently interviewed Elizabeth Stone – CTO of Netflix.  It was a great discussion about how very high performing teams at Netflix work and an eye opening level of information about how Netflix’s unique culture is operationalized. It is also a great lesson about “How to succeed at work”. Here are some of the tidbits

    • On Elizabeth’s background in economics and datascience and how it helped her professionally.
      • Understanding incentives : How to clarify priorities or motivate company or define problems leadership wants to solve. Externally – understanding how consumers perceive about Netflix or what kind of competition is Netflix up against. Comparison between behaviors of rational intelligent people vs same people when provided certain incentives.
      • Thinking about unintended consequences – cause and effect
    • Secret behind Elizabeth’s meteoric rise in corporate ladder and advice to people on how to be successful.
      • Dedication to work – not about long working hours. More about excellence and giving your best. 
      • Enjoy the work
      • Do the best work – not about ambition but more so for betterment of team and company
      • Build partnerships inside and across teams
      • Set others up for success
      • Communicate well – especially to both technical as well as non-technical audience and help bridge the gap of communication between both.
      • Observe and learn from others.
    • How can managers help people who report to them into leveling up to the high bar.
      • Example setting
      • Set expectations about high bar
      • Provide specific feedback on the gap (in private)
      • Help fill the gap
    • How can people avoid long hours and burnout but still meet high bar
      • Make sure the objective for a deliverable is clear. Manager should be able to help set clear expectations around the objective of deliverables.
      • According to Lenny, 3 important elements of netflix culture
        • Very high talent density with focus on high performers
        • Radical candor – being very direct
        • Giving people freedom and responsibility

      • How is the mental model about the keeper’s test operationalized?
        • Managers at Netflix regularly ask themselves about people on their team – “Am I willing to do everything in my power to keep my employees”
        • Employees in team should be able to approach managers and ask – “Am I meeting your expectations or am I meeting your keepers test. What’s going well or what’s not going well” 
        • A formal keepers test helps create a lightness around a very heavy and loaded conversation regarding employee performance and expectations.

      • How can companies hire best talent as Netflix does
        • Pay top dollar
        • However, do not bind employees in golden handcuffs. 
        • Figure out if the person is going to help company identify how to solve a problem or solve current problems more efficiently.
        • Hire people with newer perspective
        • Hire people who uplevel the current employees. Raise the bar for whole team.

      • Things other companies should not do what Netflix is able to do because of unique culture and high talent density
        • Freedom and responsibility aspect of netflix culture. Netflix has very high level of freedom for employees to figure out a way to solve a particular problem. Other companies may not have that luxury and may want to be more prescriptive about method and processes. “Lack of process and prescriptiveness at netflix hinges on great people at netflix who are smart but have even better judgement”