split

The split command is useful to divide the input into smaller parts based on the number of lines, bytes, file size, etc. You can also execute another command on the divided parts before saving the results. An example use case is sending a large file as multiple parts as a workaround for online transfer size limits.

Since a lot of output files will be generated in this chapter (often with the same filenames), remove these files after every illustration.

Default split

By default, the split command divides the input 1000 lines at a time. Newline character is the default line separator. You can pass a single file or stdin data as the input. Use cat if you need to concatenate multiple input sources.

By default, the output files will be named xaa, xab, xac and so on (where x is the prefix). If the filenames are exhausted, two more letters will be appended and the pattern will continue as needed. If the number of input lines is not evenly divisible, the last file will contain less than 1000 lines.


# divide input 1000 lines at a time
$ seq 10000 | split

# output filenames
$ ls x*
xaa  xab  xac  xad  xae  xaf  xag  xah  xai  xaj

# preview of some of the output files
$ head -n1 xaa xab xae xaj
==> xaa <==
1

==> xab <==
1001

==> xae <==
4001

==> xaj <==
9001

$ rm x*

As mentioned earlier, remove the output files after every illustration.

Change number of lines

You can use the -l option to change the number of lines to be saved in each output file.


# maximum of 3 lines at a time
$ split -l3 purchases.txt

$ head x*
==> xaa <==
coffee
tea
washing powder

==> xab <==
coffee
toothpaste
tea

==> xac <==
soap
tea

Split by byte count

The -b option allows you to split the input by the number of bytes. Similar to line based splitting, you can always reconstruct the input by concatenating the output files. This option also accepts suffixes such as K for 1024 bytes, KB for 1000 bytes, M for 1024 * 1024 bytes and so on.


# maximum of 15 bytes at a time
$ split -b15 greeting.txt

$ head x*
==> xaa <==
Hi there
Have a
==> xab <==
 nice day

# when you concatenate the output files, you'll the original input
$ cat x*
Hi there
Have a nice day

The -C option is similar to the -b option, but it will try to break on line boundaries if possible. The break will happen before the given byte limit. Here's an example where input lines do not exceed the given byte limit:


$ split -C20 purchases.txt

$ head x*
==> xaa <==
coffee
tea

==> xab <==
washing powder

==> xac <==
coffee
toothpaste

==> xad <==
tea
soap
tea

$ wc -c x*
11 xaa
15 xab
18 xac
13 xad
57 total

If a line exceeds the given limit, it will be broken down into multiple parts:


$ printf 'apple\nbanana\n' | split -C4

$ head x*
==> xaa <==
appl
==> xab <==
e

==> xac <==
bana
==> xad <==
na

$ cat x*
apple
banana

Divide based on file size

The -n option has several features. If you pass only a numeric argument N, the given input file will be divided into N chunks. The output files will be roughly the same size.


# divide the file into 2 parts
$ split -n2 purchases.txt
$ head x*
==> xaa <==
coffee
tea
washing powder
co
==> xab <==
ffee
toothpaste
tea
soap
tea

# the two output files are roughly the same size
$ wc x*
 3  5 28 xaa
 5  5 29 xab
 8 10 57 total

Since the division is based on file size, stdin data cannot be used. Newer versions of the coreutils package supports this use case by creating a temporary file before splitting.
$ seq 6 | split -n2
split: -: cannot determine file size

By using K/N as the argument, you can view the Kth chunk of N parts on stdout. No output file will be created in this scenario.


# divide the input into 2 parts
# view only the 1st chunk on stdout
$ split -n1/2 greeting.txt
Hi there
Hav

To avoid splitting a line, use l/ as a prefix. Quoting from the manual:

For l mode, chunks are approximately input size / N. The input is partitioned into N equal sized portions, with the last assigned any excess. If a line starts within a partition it is written completely to the corresponding file. Since lines or records are not split even if they overlap a partition, the files written can be larger or smaller than the partition size, and even empty if a line/record is so long as to completely overlap the partition.


# divide input into 2 parts, but don't split lines
$ split -nl/2 purchases.txt
$ head x*
==> xaa <==
coffee
tea
washing powder
coffee

==> xab <==
toothpaste
tea
soap
tea

Here's an example to view the Kth chunk without splitting lines:


# 2nd chunk of 3 parts without splitting lines
$ split -nl/2/3 sample.txt
 7) Believe it
 8) 
 9) banana
10) papaya
11) mango

Interleaved lines

The -n option will also help you create output files with interleaved lines. Since this is based on the line separator and not file size, stdin data can also be used. Use the r/ prefix to enable this feature.


# two parts, lines distributed in round robin fashion
$ seq 5 | split -nr/2

$ head x*
==> xaa <==
1
3
5

==> xab <==
2
4

Here's an example to view the Kth chunk:


$ split -nr/1/3 sample.txt
 1) Hello World
 4) How are you
 7) Believe it
10) papaya
13) Much ado about nothing

Custom line separator

You can use the -t option to specify a single byte character as the line separator. Use \0 to specify NUL as the separator. Depending on your shell you can use ANSI-C quoting to use escapes like \t instead of a literal tab character.


$ printf 'apple\nbanana\n;mango\npapaya\n' | split -t';' -l1

$ head x*
==> xaa <==
apple
banana
;
==> xab <==
mango
papaya

Customize filenames

As seen earlier, x is the default prefix for output filenames. To change this prefix, pass an argument after the input source.


# choose prefix as 'op_' instead of 'x'
$ split -l1 greeting.txt op_

$ head op_*
==> op_aa <==
Hi there

==> op_ab <==
Have a nice day

The -a option controls the length of the suffix. You'll get an error if this length isn't enough to cover all the output files. In such a case, you'll still get output files that can fit within the given length.


$ seq 10 | split -l1 -a1
$ ls x*
xa  xb  xc  xd  xe  xf  xg  xh  xi  xj
$ rm x*

$ seq 10 | split -l1 -a3
$ ls x*
xaaa  xaab  xaac  xaad  xaae  xaaf  xaag  xaah  xaai  xaaj
$ rm x*

$ seq 100 | split -l1 -a1
split: output file suffixes exhausted
$ ls x*
xa  xc  xe  xg  xi  xk  xm  xo  xq  xs  xu  xw  xy
xb  xd  xf  xh  xj  xl  xn  xp  xr  xt  xv  xx  xz
$ rm x*

You can use the -d option to use numeric suffixes, starting from 00 (length can be changed using the -a option). You can use the long option --numeric-suffixes to specify a different starting number.


$ seq 10 | split -l1 -d
$ ls x*
x00  x01  x02  x03  x04  x05  x06  x07  x08  x09
$ rm x*

$ seq 10 | split -l2 --numeric-suffixes=10
$ ls x*
x10  x11  x12  x13  x14

Use -x and --hex-suffixes options for hexadecimal numbering.


$ seq 10 | split -l1 --hex-suffixes=8
$ ls x*
x08  x09  x0a  x0b  x0c  x0d  x0e  x0f  x10  x11

You can use the --additional-suffix option to add a constant string at the end of filenames.


$ seq 10 | split -l2 -a1 --additional-suffix='.log'
$ ls x*
xa.log  xb.log  xc.log  xd.log  xe.log
$ rm x*

$ seq 10 | split -l2 -a1 -d --additional-suffix='.txt' - num_
$ ls num_*
num_0.txt  num_1.txt  num_2.txt  num_3.txt  num_4.txt

Exclude empty files

You can sometimes end up with empty files. For example, trying to split into more parts than possible with the given criteria. In such cases, you can use the -e option to prevent empty files in the output. The split command will ensure that the filenames are sequential even if files in the middle are empty.


# 'xac' is empty in this example
$ split -nl/3 greeting.txt
$ head x*
==> xaa <==
Hi there

==> xab <==
Have a nice day

==> xac <==

$ rm x*

# prevent empty files
$ split -e -nl/3 greeting.txt
$ head x*
==> xaa <==
Hi there

==> xab <==
Have a nice day

Process parts through another command

The --filter option will allow you to apply another command on the intermediate split results before saving the output files. Use $FILE to refer to the output filename of the intermediate parts. Here's an example of compressing the results:


$ split -l1 --filter='gzip > $FILE.gz' greeting.txt

$ ls x*
xaa.gz  xab.gz

$ zcat xaa.gz
Hi there
$ zcat xab.gz
Have a nice day

Here's an example of ignoring the first line of the results:


$ cat body_sep.txt
%=%=
apple
banana
%=%=
red
green

$ split -l3 --filter='tail -n +2 > $FILE' body_sep.txt

$ head x*
==> xaa <==
apple
banana

==> xab <==
red
green

Exercises

The exercises directory has all the files used in this section.

Remove the output files after every exercise.

1) Split the s1.txt file 3 lines at a time.


##### add your solution here

$ head xa?
==> xaa <==
apple
coffee
fig

==> xab <==
honey
mango
pasta

==> xac <==
sugar
tea

$ rm xa?

2) Use appropriate options to get the output shown below.


$ echo 'apple,banana,cherry,dates' | ##### add your solution here

$ head xa?
==> xaa <==
apple,
==> xab <==
banana,
==> xac <==
cherry,
==> xad <==
dates

$ rm xa?

3) What do the -b and -C options do?

4) Display the 2nd chunk of the ip.txt file after splitting it 4 times as shown below.


##### add your solution here
come back before the sky turns dark

There are so many delights to cherish

5) What does the r prefix do when used with the -n option?

6) Split the ip.txt file 2 lines at a time. Customize the output filenames as shown below.


##### add your solution here

$ head ip_*
==> ip_0.txt <==
it is a warm and cozy day
listen to what I say

==> ip_1.txt <==
go play in the park
come back before the sky turns dark

==> ip_2.txt <==

There are so many delights to cherish

==> ip_3.txt <==
Apple, Banana and Cherry
Bread, Butter and Jelly

==> ip_4.txt <==
Try them all before you perish

$ rm ip_*

7) Which option would you use to prevent empty files in the output?

8) Split the items.txt file 5 lines at a time. Additionally, remove lines starting with a digit character as shown below.


$ cat items.txt
1) fruits
apple 5
banana 10
2) colors
green
sky blue
3) magical beasts
dragon 3
unicorn 42

##### add your solution here

$ head xa?
==> xaa <==
apple 5
banana 10
green

==> xab <==
sky blue
dragon 3
unicorn 42

$ rm xa?

CLI text processing with GNU Coreutils