Split large JSONL files into smaller chunks. Part of the DevTools Surf developer suite. Browse more tools in the Data / SQL collection.
Use Cases
Split a large JSONL training dataset into equal chunks for distributed model training.
Divide a large API export into smaller batches for size-limited bulk upload endpoints.
Split a year's worth of log data into monthly files for archiving.
Create test subsets from large JSONL datasets by extracting the first N lines.
Tips
Split by line count when feeding a fixed-batch-size processor — each output file will have exactly N lines (except possibly the last file).
Use the file-size split mode for output-size-constrained systems — useful when each output file must fit within a fixed memory or upload size limit.
Preserve header metadata: if your JSONL has a metadata record on line 1, use the 'preserve header' option to include it in every split file.
Fun Facts
The UNIX command 'split' (1970s) was the original line-splitter, but it lacks JSONL-awareness — it can split mid-line, which breaks JSON objects. JSONL-aware splitters ensure each split boundary falls between complete JSON objects.
Large Language Model training datasets like The Pile (2021, 825GB) and Common Crawl are distributed as JSONL files split into 100GB chunks — the split size matches distributed filesystem block sizes for optimal parallel read performance.
Kafka topics were designed as the streaming alternative to JSONL file splitting — instead of splitting a file into chunks, producers write records to a partitioned topic and consumers read from assigned partitions. JSONL files remain the batch equivalent.
FAQ
Can I split by both line count and file size?
Yes — choose a primary split mode (lines or bytes) and optionally set a maximum for the other dimension. The file ends at whichever limit is reached first.
Does it guarantee valid JSON in each output line?
Yes — splits always occur at line boundaries, never mid-line. Each output file is a valid JSONL file. No JSON object is ever split across files.
Can I split on a field value (e.g., by date)?
Yes — group-by mode creates one output file per unique value of a specified field. Splitting a year's logs by 'date' field creates one JSONL file per day automatically.