Gen-AI Developer Classroom notes 07/Apr/2025

Problem

We have a text book, primary with text and also some images.
We need to parse the text from pdf file and also images

Relation between chunk and token

A document is broken down into chunks.
Chunk will have multiple tokens
Note: If using a llm model ensure chunk does not have more tokens than what model accepts

Popular models and their token limit

| Model | Token Limit | Estimated Word Count |
|———–|—————–|————————–|
| GPT-3.5 | 4,096 | Approximately 3,083 words[3] |
| GPT-3.5 Turbo | 4,000 | Not specified[1] |
| GPT-3.5 Turbo-16k | 16,000 | Not specified[1] |
| GPT-4 | 32,768 | Approximately 25,000 words[3] |
| Llama2 | 2,048 | Approximately 1,563 words[3] |
| Claude 2 | 100,000 | Approximately 60,000 words[3] |
| PaLM | 8,000 | Approximately 6,200 words[3] |

Different Types of Chunking

Azure AI Docs

Basic Problem

Parsing Content from PDF:
- Lets parse the text page by page
- Lets Also extract images if any in this page.
The page information will be metadata which will be stored in addition to vectors in vector database
Now lets findout a Python library which allows me parsing text page by page.
Refer Here for the code written
We have a parser which can parse text as well images
Note: Images fix required
Design:
- In our system are we going to perform similarity search based on images ?
  - If yes we need to do image embeddings as well
  - If no, Perform text embedding and add images as metadata
Next Steps:
- Chunking
- Seeking library helps which are model aware in chunking.
Concepts required for this
- Transformer
- Langchain

Gen-AI Developer Classroom notes 07/Apr/2025

Problem

Relation between chunk and token

Popular models and their token limit

Different Types of Chunking

Basic Problem

Like this:

By continuous learner

Leave a ReplyCancel reply

Problem

Relation between chunk and token

Popular models and their token limit

Different Types of Chunking

Basic Problem

Share this:

Like this:

By continuous learner

Leave a ReplyCancel reply

Discover more from Direct AI Powered By Quality Thought