Problem
- We have a text book, primary with text and also some images.
- We need to parse the text from pdf file and also images

Relation between chunk and token
- A document is broken down into chunks.
- Chunk will have multiple tokens
- Note: If using a llm model ensure chunk does not have more tokens than what model accepts
Popular models and their token limit
| Model | Token Limit | Estimated Word Count |
|———–|—————–|————————–|
| GPT-3.5 | 4,096 | Approximately 3,083 words[3] |
| GPT-3.5 Turbo | 4,000 | Not specified[1] |
| GPT-3.5 Turbo-16k | 16,000 | Not specified[1] |
| GPT-4 | 32,768 | Approximately 25,000 words[3] |
| Llama2 | 2,048 | Approximately 1,563 words[3] |
| Claude 2 | 100,000 | Approximately 60,000 words[3] |
| PaLM | 8,000 | Approximately 6,200 words[3] |
Different Types of Chunking
Basic Problem
- Parsing Content from PDF:
- Lets parse the text page by page
- Lets Also extract images if any in this page.
- The page information will be metadata which will be stored in addition to vectors in vector database
-
Now lets findout a Python library which allows me parsing text page by page.
-
Refer Here for the code written
- We have a parser which can parse text as well images
- Note: Images fix required
- Design:
- In our system are we going to perform similarity search based on images ?
- If yes we need to do image embeddings as well
- If no, Perform text embedding and add images as metadata
- In our system are we going to perform similarity search based on images ?
- Next Steps:
- Chunking
- Seeking library helps which are model aware in chunking.
- Concepts required for this
- Transformer
- Langchain

