Timeline from Transformers to LLM and Agentic AI

Since the groundbreaking 2017 paper “Attention is All You Need” introduced the Transformer architecture, the field of artificial intelligence has undergone a rapid and transformative evolution. This blog post will explore the chronology of important events that have shaped the AI landscape, leading up to the current era of Large Language Models (LLMs) and agentic AI. Let us visit the timeline from Transformers to LLM and Agentic AI

2017: The Transformer Revolution

The journey begins with the publication of “Attention is All You Need” by Google scientists in 2017. This paper introduced the Transformer architecture, which relied solely on attention mechanisms, dispensing with recurrence and convolutions entirely1. The new model demonstrated superior translation quality and efficiency in machine translation tasks, setting the stage for a paradigm shift in natural language processing.

2018: BERT and the Rise of Bidirectional Models

Building on the success of Transformers, 2018 saw the introduction of BERT (Bidirectional Encoder Representations from Transformers) by Google researchers. BERT’s innovation lay in its bidirectional nature, allowing it to capture context from both directions in text data2. This breakthrough significantly improved performance across various language tasks, from question-answering to sentiment analysis.

2019-2020: The GPT Era Begins

OpenAI’s release of GPT-3 (Generative Pre-trained Transformer 3) in 2020 marked a significant milestone in the development of large language models. With 175 billion parameters, GPT-3 demonstrated unprecedented capabilities in natural language understanding and generation, capturing the imagination of researchers and the public alike4.

2021-2022: AI Goes Mainstream

During this period, AI technologies began to permeate various industries and applications:

  • AI in Healthcare: The healthcare and pharmaceutical sectors emerged as early adopters of AI, leveraging it for tasks such as appointment scheduling, patient care, and personalized treatment3.
  • Self-Driving Vehicles: AI agents moved beyond software into the physical world, making real-time, high-stakes decisions in autonomous vehicles4.
  • Code Generation: AI systems like GitHub Copilot began assisting developers in writing code, hinting at the potential for AI to transform software development.
Timeline from Transformers to LLM and Agentic AI

2023: The Year of Generative AI

2023 saw an explosion in generative AI applications, with tools like DALL-E, Midjourney, and ChatGPT capturing public attention. These models demonstrated the ability to generate high-quality text, images, and even code, sparking discussions about the future of creative work and knowledge work.

2024: The Dawn of Agentic AI

As we moved into 2024, the concept of Agentic AI began to take shape. This new paradigm represented a shift from isolated AI tasks to specialized, interconnected agents capable of more autonomous operation3. Key developments included:

  • Multi-Agent Systems: AI agents began working collaboratively to solve complex problems, simulating human teamwork in digital environments4.
  • Small Language Models (SLMs): The adoption of SLMs alongside LLMs offered new possibilities for efficient, task-specific AI solutions3.
  • AI Orchestration: Frameworks for coordinating multiple AI agents emerged, allowing for more complex problem-solving approaches3.

2025: The Year of Agentic AI

As we stand in 2025, Agentic AI has become the new frontier in artificial intelligence. This evolution is characterized by several key trends:

  1. Autonomous Decision-Making: AI agents now operate with greater independence, capable of long-term planning and adapting to changing conditions without constant human oversight4.
  2. AI Engineers: Systems like Devin AI are now capable of debugging and writing code on their own, pushing the boundaries of what AI can achieve in software development4.
  3. Industry Transformation: Agentic AI is revolutionizing various sectors, with the potential to take over entire departments in organizations5. For example:
    • In healthcare, AI agents manage tasks from appointment scheduling to personalized treatment plans3.
    • In customer service, AI-driven virtual assistants provide increasingly sophisticated and personalized support.
  4. Multi-Agent Collaboration: OpenAI’s introduction of “Swarm,” an experimental framework for coordinating networks of AI agents, has opened new possibilities for complex problem-solving5.
  5. Enhanced Personalization: Advanced learning algorithms enable AI agents to tailor services and products to individual needs, creating highly personalized experiences across industries7.
  6. Scalable Automation: AI agents are driving automation at an unprecedented scale, from small businesses to large enterprises, significantly reducing costs and operational inefficiencies7.
  7. Continuous Learning and Adaptation: Agentic AI systems demonstrate the ability to learn autonomously and adapt to dynamic environments, enabling faster growth and efficiency across sectors7.

As we look to the future, the potential of Agentic AI seems boundless. From enhancing decision-making processes to revolutionizing entire industries, these intelligent agents are poised to transform the way we work, create, and solve problems. However, this rapid advancement also brings new challenges in ethics, privacy, and workforce adaptation that society must address.

We saw and have actually lived through this timeline from Transformers to LLM and Agentic AI, the journey has been remarkably swift, showcasing the exponential pace of innovation in artificial intelligence. As we explore the vast potential of Agentic AI, the broader quest for Artificial General Intelligence (AGI) remains a captivating goal. AGI, which aims to create intelligent systems capable of performing any intellectual task that humans can, represents the ultimate frontier in artificial intelligence. For a deeper dive into the most basic concepts on AI and Machine Learning please visit my other blog pages.

A Deep Dive into PyTorch’s GPU Memory Management

Here is an error I got when using an image generation deep learning model. It is a common error Engineers get when using PyTorch on GPU. To solve this error, a deep dive into PyTorch’s GPU Memory management is needed. So fasten your seat belts 🙂

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 58.00 MiB. GPU 0 has a total capacity of 3.71 GiB of which 57.00 MiB is free. Including non-PyTorch memory, this process has 3.64 GiB memory in use. Of the allocated memory 3.51 GiB is allocated by PyTorch, and 74.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

This error message provides valuable insights:

  • Memory Exhaustion: The GPU’s available memory (3.71 GiB) has been depleted.
  • Allocation Attempt: PyTorch attempted to allocate 20.00 MiB, but there wasn’t enough free space.
  • Memory Usage: 3.68 GiB is in use, with 3.61 GiB allocated by PyTorch and 5.27 MiB reserved but unallocated.
  • Fragmentation Hint: The message suggests that memory fragmentation might be contributing to the issue, and setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True might help.

PyTorch’s Memory Management Strategies

PyTorch employs a sophisticated memory management system to optimize GPU resource utilization. Here’s a detailed breakdown:

  • Caching Allocator: PyTorch uses a caching allocator to reduce the overhead of frequent memory allocations and deallocations. This improves performance but can also contribute to memory fragmentation if not managed effectively.
  • Memory Pooling: PyTorch pools memory into larger blocks to reduce fragmentation and improve allocation efficiency.
  • Automatic Deallocation: PyTorch automatically deallocates memory for tensors that are no longer needed, reducing the risk of memory leaks.
  • torch.cuda.empty_cache(): This function manually clears the cached memory, potentially freeing up unused resources.
  • PYTORCH_CUDA_ALLOC_CONF: This environment variable allows you to fine-tune memory allocation behavior. Experimenting with different configurations can help address fragmentation issues.

Profiling Tools for Deep Insights

To gain a granular understanding of memory usage and identify bottlenecks, profiling tools are indispensable:

NVIDIA System Management Interface (NVIDIA-smi):

  • Real-time monitoring of GPU utilization, temperature, and memory usage.
  • Provides detailed information about processes and applications consuming GPU resources.
  • Example usage in Bash: nvidia-smi
    watch -n0.1 nvidia-smi

    PyTorch Memory Profiler

    • Records memory allocations and deallocations during program execution.
    • Visualizes memory usage patterns over time.
     enable memory history, which will
    # add tracebacks and event history to snapshots
    torch.cuda.memory._record_memory_history()
    
    run_your_code()
    torch.cuda.memory._dump_snapshot("my_snapshot.pickle")

    Open pytorch.org/memory_viz and drag/drop the pickled snapshot file into the visualizer. The visualizer is a javascript application that runs locally on your computer. It does not upload any snapshot data.

    Active Memory Timeline in PyTorch Memory visualizer

    Allocator State History in PyTorch Memory visualizer

    • Integrates seamlessly with PyTorch models and training scripts.
    • Example usage:
    import torch.profiler as profiler
    with profiler.profile() as prof:
    # Your PyTorch code here
    # ...

    # Print the profiling results print(prof.key_metrics())

    Nsight Systems:

    • A powerful profiling tool that provides detailed insights into GPU utilization, memory usage, and performance bottlenecks.
    • Offers visualizations for performance analysis.
    • Example usage in Bash
    nsight-systems --profile-gpu --launch-command="python your_script.py"

    Debugging and Optimization Strategies

    1. Reduce Model Size: If possible, use a smaller or optimized version of the Stable Diffusion model to reduce memory requirements.
    2. Adjust Batch Size: Experiment with different batch sizes to find the optimal balance between performance and memory usage.
    3. Optimize Data Loading: Ensure your data loading pipeline is efficient and avoids unnecessary memory copies.
    4. Monitor Memory Usage: Use profiling tools to track memory consumption and identify areas for optimization.
    5. Consider Memory-Efficient Techniques: Explore techniques like gradient checkpointing or quantization to reduce memory usage.
    6. Leverage Cloud-Based GPUs: If your local hardware is constrained, consider using cloud-based GPU instances with larger memory capacities.

    Additional Considerations:

    • GPU Driver Updates: Ensure you have the latest GPU drivers installed to avoid performance issues or memory leaks.
    • Operating System Configuration: Check your operating system’s memory management settings to see if they can be optimized for better GPU performance.
    • TensorFlow vs. PyTorch: If you’re using TensorFlow, explore its memory management features and best practices.

    Advanced Memory Optimization Techniques

    For more advanced scenarios, consider the following techniques:

    • Memory Pooling: Manually create memory pools to allocate and reuse memory blocks efficiently. This can be helpful for specific use cases where memory allocation is frequent.
    • Custom Memory Allocators: If you have deep knowledge of CUDA and memory management, you can create custom memory allocators to address specific memory usage patterns.
    • Profiling and Benchmarking: Use profiling tools to identify performance bottlenecks and benchmark different memory optimization strategies to measure their effectiveness.

    Beyond the Code: A Deeper Dive into Memory Management

    While we’ve covered the essential aspects of PyTorch’s memory management, it’s worth exploring the underlying mechanisms in more detail.

    • CUDA Memory Allocator: CUDA, the underlying framework for NVIDIA GPUs, provides its own memory allocator. PyTorch interacts with this allocator to allocate and manage memory on the device.
    • Memory Fragmentation: When memory is allocated and deallocated frequently, it can lead to fragmentation, where small, unused memory blocks are scattered throughout the memory space. This can make it difficult for PyTorch to allocate larger contiguous blocks of memory.
    • Memory Pooling: PyTorch’s memory pooling strategy involves creating larger memory pools and allocating memory from these pools. This can help reduce fragmentation and improve memory allocation efficiency.
    • Automatic Deallocation: PyTorch uses reference counting to track memory usage and automatically deallocates memory for tensors that are no longer needed. However, it’s important to be aware of potential memory leaks if tensors are not properly managed.
    • Profiling Tools: Profiling tools like Nsight Systems can provide detailed insights into memory usage patterns, including memory allocations, deallocations, and access patterns. This information can be invaluable for identifying memory-related bottlenecks and optimizing your code.

    Conclusion

    Overcoming the “CUDA out of memory” error requires a deep understanding of PyTorch’s memory management strategies and the ability to leverage profiling tools effectively. By following the techniques outlined in this blog post, you can optimize your PyTorch applications for efficient GPU memory usage and unlock the full potential of your NVIDIA GPU

    Hand written notes on Neural Networks and ML course by Andrew Ng

    About 2018 when I started working on Machine learning I took many courses. Here are my hand written notes on Neural Networks and ML course by Andrew Ng. It focuses on the fundamental concepts covered in the course, including Logistic Regression, Neural Networks, and Softmax Regression. Buckle up for some equations and diagrams!

    Part 1: Logistic Regression – The Binary Classification Workhorse

    Logistic regression reigns supreme for tasks where the target variable (y) can only take on two distinct values, typically denoted as 0 or 1. It essentially calculates the probability (a) of y belonging to class 1, given a set of input features (x). Here’s a breakdown of the process:

    1. Linear Combination: The model calculates a linear score (z) by taking a weighted sum of the input features (x) and their corresponding weights (w). We can represent this mathematically as:

      z = w_1x_1 + w_2x_2 + … + w_nx_n

      (where n is the number of features)
    2. Sigmoid Function: This linear score (z) doesn’t directly translate to a probability. The sigmoid function (σ) steps in to transform this score into a value between 0 and 1, representing the probability (a) of y belonging to class 1. The sigmoid function is typically defined as:

    a = \sigma(z) = \frac{1}{1 + e^{-z}}

    Sigmoid Function Plot / Logistic Curve


    Key takeaway: 1 – a represents the probability of y belonging to class 0. This is because the sum of probabilities for both classes must always equal 1.

    Part 2: Demystifying Neural Networks – Building Blocks and Forward Propagation

    1. Perceptrons – The Basic Unit: Neural networks are built using perceptrons, the fundamental unit inspired by biological neurons. A perceptron takes weighted inputs (just like logistic regression), performs a linear transformation, and applies an activation function to generate an output.
    2. Activation Functions: While sigmoid functions are common in logistic regression and the initial layers of neural networks, other activation functions like ReLU (Rectified Linear Unit) can also be employed. These functions introduce non-linearity, allowing the network to learn more complex patterns in the data.
    3. Layering Perceptrons: Neural networks are not limited to single perceptrons. We can stack multiple perceptrons into layers, where each neuron in a layer receives outputs from all the neurons in the previous layer. This creates a complex network of interconnected units.
    4. Forward Propagation: Information flows through the network in a forward direction, layer by layer. In each layer, the weighted sum of the previous layer’s outputs is calculated and passed through an activation function. This process continues until the final output layer produces the network’s prediction.

    Part 3: Unveiling Backpropagation – The Learning Algorithm

    But how do these neural networks actually learn? Backpropagation is the hero behind the scenes! It allows the network to adjust its weights and biases in an iterative manner to minimize the error between the predicted and actual outputs.

    1. Cost Function: We define a cost function that measures how well the network’s predictions align with the actual labels. A common cost function for classification problems is the cross-entropy loss.
    2. Error Calculation: Backpropagation calculates the error (difference between prediction and actual value) at the output layer and propagates it backward through the network.
    3. Weight and Bias Updates: Based on the calculated errors, the weights and biases of each neuron are adjusted in a way that minimizes the overall cost function. This process is repeated iteratively over multiple training epochs until the network converges to a minimum error state.

    Part 4: Softmax Regression – Expanding Logistic Regression for Multi-Class Classification

    Logistic regression excels in binary classification, but what happens when we have more than two possible class labels for the target variable (y)? Softmax regression emerges as a powerful solution!

    1. Generalizing Logistic Regression: Softmax regression can be viewed as an extension of logistic regression for multi-class problems. It calculates a set of class scores (z_i) for each possible class (i).
    2. The Softmax Function: Similar to the sigmoid function, softmax takes these class scores (z_i) and transforms them into class probabilities (a_i) using the following formula:

    a_i = \frac{e^{z_i}}{\sum\limits_{j=1}^{C} e^{z_j}}

    (where Σ represents the sum over all possible classes j)
    Key takeaway: This function ensures that all the class probabilities (a_i) sum up to 1, which is a crucial requirement for a valid probability distribution. Intuitively, for a given input (x), only one class can be true, and the softmax function effectively distributes the probability mass across all classes based on their corresponding z_i scores.

    Softmax Function Curve

    1. Interpretation of Class Probabilities: Each class probability (a_i) represents the model’s estimated probability of the target variable (y) belonging to class i, given the input features (x). This probabilistic interpretation empowers us to not only predict the most likely class but also gauge the model’s confidence in that prediction.

    Part 5: Putting It All Together – Training and Cost Function for Softmax Regression

    Part 5: Putting It All Together – Training and Cost Function for Softmax Regression

    While we’ve focused on the mechanics of softmax, training a softmax regression model involves a cost function. Here’s a brief overview:

    1. Negative Log-Likelihood Cost Function: Softmax regression typically employs the negative log-likelihood cost function. This function penalizes the model for assigning low probabilities to the correct class and vice versa. Mathematically, the cost function can be represented as:Cost=−Σ(yi∗log(ai))
      (where y_i is 1 for the correct class and 0 otherwise)
    2. Model Optimization: During training, the model aims to minimize this cost function by adjusting its weights and biases through backpropagation. As the cost function decreases, the model learns to produce class probabilities that better reflect the underlying data distribution.

    Conclusion: A Stepping Stone to Deep Learning

    These blog and hand written notes on Neural Networks and ML has provided a condensed yet detailed exploration of logistic regression, neural networks, and softmax regression, concepts covered in Andrew Ng’s Advanced Learning Algorithms course. Understanding these fundamental building blocks equips you to delve deeper into the fascinating world of Deep Learning and explore more advanced architectures like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).Remember, this is just the beginning of your Deep Learning journey!

    I hope these detailed hand written notes on Neural Networks and ML with diagrams prove helpful for your Deep Learning studies!