ML | KiloBytes by KB

Understanding GPT 1, 2 and 3
Machine Learning

Introduction The goal of this series of posts, is to form foundational knowledge that helps us understanding modern state-of-the-art LLM models, and gain a comprehensive understanding of GPT via reading the seminal papers themselves. In my previous post, I covered transformers via the original paper “Attention is all you need” that brought the innovation that made all this progress possible. This post will focus on GPT-3 and its predecessors GPT-1 and 2....

Understanding GPT - Transformers
Machine Learning

Introduction The goal of this series of posts, is to form foundational knowledge that helps us understanding modern state-of-the-art LLM models, and gain a comprehensive understanding of GPT via reading the seminal papers themselves. In my previous post, I covered some of the seminal papers that formulated sequence based models from RNNs to the Attention mechanism in encoder-decoder architectures. If you don’t know about them, or would like a quick refresher - I recommend reading through the previous post before continuing here....

Understanding GPT - A Journey from RNNs to Attention
Machine Learning

Introduction ChatGPT has rightly taken the world by storm, and has possibly started the 6th wave. Given its importance, the rush to build new products and research on top is understandable. But, I’ve always liked to ground myself with foundational knowledge on how things work, before exploring anything additive. To gain such foundational knowledge, I believe understanding the progression of techniques and models is crucial to comprehend how these LLM models work under the hood....

Loss Functions in ML
Machine Learning

Introduction Loss functions tell the algorithm how far we are from actual truth, and their gradients/derivates help understand how to reduce the overall loss (by changing the parameters being trained on) All losses in keras defined here But why is the loss function expressed as a negative loss? Plot: As probabilities only lie between [0-1], the plot is only relevant between X from 0-1 This means, that it penalises a low probability of success exponentially more....