Supported Document Formats

Joinable’s RAG-in-a-Box supports a wide range of commonly used document formats, making it easy to ingest and operationalize your existing data without manual reformatting. Whether your knowledge lives in presentations, spreadsheets, or rich text files, Joinable intelligently processes and prepares them for AI use.

Currently Supported Formats

Format

Extensions

Description

PDF

.pdf

Portable Document Format — ideal for reports, manuals, and scanned content.

Microsoft Word

.doc, .docx

Used for text-heavy documents like contracts, policies, and instructions.

Microsoft PowerPoint

.ppt, .pptx

Supports slide-based documents, presentations, and visual content.

Microsoft Excel

.xls, .xlsx

Used for structured tabular data like spreadsheets, reports, and metrics.

Plain Text

.txt

Basic unformatted text files used for simple note-taking or data storage.

Markdown

.md

Lightweight markup language for formatting text, commonly used in documentation.

CSV

.csv

Comma-Separated Values — commonly used for simple tabular data exchange.

💡 Note: Additional file types (like HTML, JSON, etc.) may be supported in future releases. Contact support if you have specific format needs.

How Joinable Handles Documents

All uploaded files go through a unified ingestion pipeline that includes:

Format parsing and cleanup Structured extraction from Word, PowerPoint, Excel, PDF, and other supported formats.
Structure and section detection Automatic identification of headings, paragraphs, tables, and bullet points for clean segmentation.
Automatic chunking Content is split into semantically meaningful chunks optimized for embedding and retrieval.
Page or slide number preservation Metadata like page numbers and slide references are retained to support accurate source attribution.
OCR and Vision-Language Models (VLM) For image-based PDFs or scanned documents, Joinable uses Optical Character Recognition (OCR) and VLMs to accurately extract embedded text and visual context.

This ensures that your data is not only readable but optimized for fast and accurate retrieval by language models.

PreviousSupported LLMs NextAPI Reference

Last updated 14 days ago