Document Formatting

Best practices for preparing documents before uploading them to your bot's knowledge base.

How you format your documents directly affects how well the bot retrieves and answers from them. Well-structured documents lead to more accurate, relevant responses.

Quick checklist

Before uploading any document, verify:

  • Headings use actual heading styles (not just bold or large text)
  • Content is organized with a clear hierarchy
  • One main topic per document
  • No critical information trapped inside images
  • File is in a supported format (PDF, DOCX, TXT, MD, CSV)

Word documents

Word documents work best when you use the built-in heading styles:

The critical difference

  • Correct: Use Word's Heading 1, Heading 2, Heading 3 styles from the Styles panel
  • Incorrect: Making text bold and larger manually to look like a heading

The bot relies on heading styles to understand document structure. Manually formatted "headings" look the same to humans but the bot cannot distinguish them from regular text.

  • Title — The document title (use Title style)
  • Heading 1 — Major sections
  • Heading 2 — Subsections
  • Heading 3 — Detailed topics within subsections
  • Body text — Regular paragraphs

Use Word's built-in list features (numbered and bullet lists) rather than typing numbers manually.

PDF files

Native PDFs (recommended)

Native PDFs are created from digital documents (exported from Word, Google Docs, etc.). You can select and copy text from them. These process quickly and accurately.

Scanned PDFs (limited support)

Scanned PDFs are images of paper documents. The system processes them using OCR (optical character recognition), but accuracy varies:

  • Handwritten text is poorly recognized
  • Low-resolution scans produce errors
  • Complex layouts (multi-column, tables with borders) may be misread

When possible, use the original digital document instead of a scan.

Plain text and Markdown

Both formats work well. For best results:

  • Use clear section headers
  • Separate topics with blank lines
  • Use consistent formatting for lists
  • Markdown headers (#, ##, ###) are recognized and used for document structure

Q&A pairs (CSV format)

For structured question-and-answer content:

  • Use exactly two columns: question and answer
  • Include the column headers in the first row
  • One Q&A pair per row
  • Keep answers concise and complete
question,answer
What is the return policy?,Items can be returned within 30 days of purchase with the original receipt.
Do you offer international shipping?,Yes. We ship to all GCC countries. Delivery takes 3-7 business days.

Images in documents

The bot cannot read text inside images. This includes:

  • Screenshots of text or tables
  • Infographics with text labels
  • Scanned handwritten notes
  • Diagrams with text annotations

If important information is in an image, add the same information as regular text in the document.

Images also significantly slow down processing. Remove decorative images (logos, backgrounds, stock photos) before uploading.

Common mistakes

MistakeImpactFix
Bold text instead of heading stylesBot cannot identify sectionsApply Heading styles in Word
Multiple topics in one large documentIrrelevant content retrieved alongside answersSplit into separate, focused documents
Critical info only in imagesBot cannot access the informationAdd text version alongside images
Scanned PDF when digital existsLower accuracy, slower processingUpload the original digital file
Outdated documents not removedBot gives incorrect answersRemove or replace outdated files
No clear document structurePoor retrieval accuracyAdd headings and organize content