Build own LLM
- pytorch
- torchtune
- torchchat
LLM101n (Karpathy)
github.com/karpathy/LLM101n (opens in a new tab)
What I cannot create, I do not understand. -Richard Feynman
In this course we will build a Storyteller AI Large Language Model (LLM). Hand in hand, you'll be able to create, refine and illustrate little stories (opens in a new tab) with the AI. We are going to build everything end-to-end from basics to a functioning web app similar to ChatGPT, from scratch in Python, C and CUDA, and with minimal computer science prerequisites. By the end you should have a relatively deep understanding of AI, LLMs, and deep learning more generally.
Syllabus
- Chapter 01 Bigram Language Model (language modeling)
- Chapter 02 Micrograd (machine learning, backpropagation)
- Chapter 03 N-gram model (multi-layer perceptron, matmul, gelu)
- Chapter 04 Attention (attention, softmax, positional encoder)
- Chapter 05 Transformer (transformer, residual, layernorm, GPT-2)
- Chapter 06 Tokenization (minBPE, byte pair encoding)
- Chapter 07 Optimization (initialization, optimization, AdamW)
- Chapter 08 Need for Speed I: Device (device, CPU, GPU, ...)
- Chapter 09 Need for Speed II: Precision (mixed precision training, fp16, bf16, fp8, ...)
- Chapter 10 Need for Speed III: Distributed (distributed optimization, DDP, ZeRO)
- Chapter 11 Datasets (datasets, data loading, synthetic data generation)
- Chapter 12 Inference I: kv-cache (kv-cache)
- Chapter 13 Inference II: Quantization (quantization)
- Chapter 14 Finetuning I: SFT (supervised finetuning SFT, PEFT, LoRA, chat)
- Chapter 15 Finetuning II: RL (reinforcement learning, RLHF, PPO, DPO)
- Chapter 16 Deployment (API, web app)
- Chapter 17 Multimodal (VQVAE, diffusion transformer)
Appendix
Further topics to work into the progression above:
- Programming languages: Assembly, C, Python
- Data types: Integer, Float, String (ASCII, Unicode, UTF-8)
- Tensor: shapes, views, strides, contiguous, ...
- Deep Learning frameworks: PyTorch, JAX
- Neural Net Architecture: GPT (1,2,3,4), Llama (RoPE, RMSNorm, GQA), MoE, ...
- Multimodal: Images, Audio, Video, VQVAE, VQGAN, diffusion
Building LLM from scratch:
Follow How I Studied LLMs in Two Weeks to eventually be able to reproduce GPT-2 with ease (Andrej Karpathy: Let's reproduce GPT-2 (124M) (opens in a new tab)) (playlist Neural Networks: Zero to Hero (opens in a new tab))
Hesam Sheikh (opens in a new tab) ai/ml | rigorously overfitting on a learning curve
How I Studied LLMs in Two Weeks
How I Studied LLMs in Two Weeks: A Comprehensive Roadmap (opens in a new tab) :
ML retreat
hesamsheikh/ml-retreat (opens in a new tab)
A day-by-day detailed LLM roadmap (opens in a new tab) from beginner to advanced, plus some study tips
aman.ai (opens in a new tab) - exploring the art of artificial intelligence one concept at a time
Hesamation (opens in a new tab) x.com github.com/hesamsheikh (opens in a new tab)
Other
...
linked.in post scikit-llm (opens in a new tab)
https://github.com/BeastByteAI (opens in a new tab)
- scikit-learn (opens in a new tab)
- scikit-llm (opens in a new tab)
- Seamlessly integrate powerful language models into scikit-learn for enhanced text analysis tasks.
- https://beastbyte.ai/scikit-llm.html (opens in a new tab)
- https://skllm.beastbyte.ai/ (opens in a new tab)
- falcon (opens in a new tab)
- Train a production-ready machine learning models in a single line of code.
- https://beastbyte.ai/falcon.html (opens in a new tab)
- https://beastbyteai.github.io/falcon/intro.html# (opens in a new tab)
- agent_dingo (opens in a new tab)
- Develop production-ready LLM-powered applications in a simple and efficient way.
- https://beastbyte.ai/agent-dingo.html (opens in a new tab)
- https://dingo.beastbyte.ai/ (opens in a new tab)
Is building your own LLM something necessary for own product? For example, to get a LLM that would work not on probability map, but with certainty?
Or enhancing open source LLM would be enough (Llama 3 405B) ?
What is the vision about it?
Enhance an LLM with own data
...
LMs capabilities
Reasoning
it would be nice to train an LLM all of the best-practices in reasoning, that are natural for myself, but are absent in LLMs
On terms of reasoning, it would be nice to train an LLM all of the best-practices in reasoning, that are natural for myself, but are absent in LLMs I use. Make a list of reasoning best practices or nice-to-haves, and add things to it as they pop up in the dev & testing process. The downsides of LM reasoning became obvious once I tried generating SAFe entities with Llama 3.1 8B (did well for Strategic themes & OKRs, not so well for Portfolio Canvas). There was a lot of repetition.
- Try generating again with Llama 3.2
- define best in-context prompt for best results
- define reasoning downsides
- define possible reasoning improvements
There is a lot of flaws that I see in the current model's reasoning. I also see a lot of possible improvements. Which makes sense to try to implement step-by-step.
It might be a good idea to play with prompts a little bit, to see if and how I can get the best possible results without changing the model itself. And then verifying, what those results could be. E.g., if with in-context prompts results are satisfactory, maybe I can build something right away.
Random ideas
Current SOTA coding models, like DeepSeekV2 - what data are they trained on? Is the dataset annotated well ? Is it of the highest quality? Imagine a dataset of all GitHub open-source codebases, perfectly annotated, where every LOC (line of code) is properly labelled. Also, where certain blocks of functionality are labeled. Like, implemented entities (e.g. drawer, appbar, etc.). Does it help to beld better fine-tuned models? With higher accuracy rate for code generation ? Also, imagine generating sintetic data of highest quality for such training. What would it be and cost in time & effort? What would it allow to achieve?