Retrieval-augmented generation starts before any model call — with turning documents into searchable chunks.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=120
)
chunks = splitter.split_text(document_text)
Chunk size is a trade-off: too small and you lose context within a chunk; too large and irrelevant text dilutes the part that actually answers the question. 500–1000 characters with some overlap is a reasonable starting point.
Try it
Run the splitter on one real document and print the chunk count and a sample chunk. That's the input every later step in this series builds on.