Generative AI
- A Large Language Model (LLM) is a statistical model.
- A LLM understands Natural Language (NLP)
- Earlier LLMs used to produce one language, now most of them are multilingual.
- Foundational Model is an LLM which supports multi modality
- model vs modal
- model: llm
- modal:
- text
- image
- audio
- video
- LLMs can be classified into
- Masked Models
- Autoregressive models
I have gone to goa, The .... are beautiful and breezy
=> beaches 80%
=> farms 10%
=> .... 7%
=> Masked Models
I have gone to goa, The beaches are ......
=> beautiful 80
=> ugly 10
...
=> Autoregressive models
- Most of LLMs are autoregressive
- Lets look at GPT2
-
An LLM at its very core predicts next token.
-
Token: A sentence is broken into tokens.
- A token gets converted into a vector which is a mathematical point in a larger dimensional space.
- All possible words with vectors is referred as embeddings.
- An llm is trained on very large volumes of data.
- LLAMA 3.1 is trained on following data
-
Major dataset categories used in LLaMA-3:
- Web data (filtered)
- Coding data (GitHub and other public code)
- Mathematics datasets (proofs, reasoning, math Q&A)
- Wikipedia
- Scientific literature (arXiv)
- Books corpora
- Conversational data
- Synthetic data generated by Meta models
