Supported Document Formats
Joinable’s RAG-in-a-Box supports a wide range of commonly used document formats, making it easy to ingest and operationalize your existing data without manual reformatting. Whether your knowledge lives in presentations, spreadsheets, or rich text files, Joinable intelligently processes and prepares them for AI use.

Currently Supported Formats
.pdf
Portable Document Format — ideal for reports, manuals, and scanned content.
Microsoft Word
.doc
, .docx
Used for text-heavy documents like contracts, policies, and instructions.
Microsoft PowerPoint
.ppt
, .pptx
Supports slide-based documents, presentations, and visual content.
Microsoft Excel
.xls
, .xlsx
Used for structured tabular data like spreadsheets, reports, and metrics.
Plain Text
.txt
Basic unformatted text files used for simple note-taking or data storage.
Markdown
.md
Lightweight markup language for formatting text, commonly used in documentation.
CSV
.csv
Comma-Separated Values — commonly used for simple tabular data exchange.
💡 Note: Additional file types (like HTML, JSON, etc.) may be supported in future releases. Contact support if you have specific format needs.
How Joinable Handles Documents
All uploaded files go through a unified ingestion pipeline that includes:
Format parsing and cleanup Structured extraction from Word, PowerPoint, Excel, PDF, and other supported formats.
Structure and section detection Automatic identification of headings, paragraphs, tables, and bullet points for clean segmentation.
Automatic chunking Content is split into semantically meaningful chunks optimized for embedding and retrieval.
Page or slide number preservation Metadata like page numbers and slide references are retained to support accurate source attribution.
OCR and Vision-Language Models (VLM) For image-based PDFs or scanned documents, Joinable uses Optical Character Recognition (OCR) and VLMs to accurately extract embedded text and visual context.
This ensures that your data is not only readable but optimized for fast and accurate retrieval by language models.
Last updated