What should a function splitting an empty string return? What about splitting a string on a separator that does not exist within the string? This was the topic of some debate within my team recently, when one of my teammates had assumed that splitting an empty string on a separator would return an empty list. It does not[1].
Splitting on a substring that doesn't appear in the string
When you split a string an arbitrary number of times, it's reasonable to expect to find yourself with a list of all the pieces. If you were to join all these pieces back together using the same separator that you used to split the original string, you should arrive back at where you started. In other words, if sep
is an arbitrary separator, then the following bit of pseudocode should be true: originalString == originalString.split(sep).join(sep)
.
The algorithmic approach
Given a string and a separator, we can describe a string splitting algorithm in two, simple steps:
- Take everything up until the separator and append it to the list of substrings. If you don't find the separator, add the whole string to the list.
- Take what's left of the string after the separator, and go back to step one, using the remaining part of the string as input.
When there is no more string to split, you've completed the substring list and can return that to the caller.
Adding the whole string to the list if you don't find the separator is a crucial part of step one. Without it, you would never get the final piece of a string. If you split "I fell asleep" on spaces, you'd expect to get ["I", "fell", "asleep"]
in return. But if you discard a string if it doesn't have the separator, you'd end up with only ~["I", "fell"]~; a very different outcome.
If the separator doesn't appear in the string at all, we simply add the whole string as the first element of the list and consider our work done. This way, the function is internally consistent. As you go through the string, you will always eventually reach a substring that doesn't contain the separator. That substring is also part of the result.
The user experience approach
There's another angle to come at this from: as the end user. What would you, the developer using the function, expect it to return? This is a subjective thing, so I can only really speak for myself, but if I know that if a function always returns data of the same shape (such as a list), I find it much easier to work with.
What if the string doesn't contain the separator at all? Well, then it can't be split. Imagine we're cooking together. I give you some carrots and ask you to cut the tops off. What do you do if one of the carrots has already had the top cut off? You'd probably realize it doesn't need anything done to it, and put it in the done-pile with the others.
Similarly, a string that has no separator (no top) need not be split, and you can just return it as-is.
Splitting an empty string
Splitting an empty string isn't any different than what we've already explored. In fact, it's just a more specialized version of the above problem. It can also come up if the string you're splitting ends with the separator. For instance "juice," split on commas, would be ["juice", ""]
. An empty string is still a string.
The empty string as separator
What if the separator is an empty string? This varies from language to language. For instance, in JavaScript, an empty string as separator means 'split the string into a list of characters', while Python throws an exception. Rust's split
splits the string into a list of one-character strings (like JavaScript does), but inserts an empty string at the start and at the end (playground link).
What does it even mean to use an empty string as a separator? There's no clear answer here, so it's at your discretion as the function author. For instance, a Haskell-implementation I quickly threw together while writing this post[2] returns an infinite list of empty strings. I'm inclined to say that that's reasonable, but it's probably not very useful.
Summary
There's any number of ways to slice a string. Most languages seem to follow the idea that splitting a string gives you back a list of substrings, even if the string is empty or the separator doesn't appear in the string at all. The disagreement is about what to do with an empty separator, but at least they all agree on it being a special case.
If you're feeling the urge come over you after all this, then how about taking a stab at implementing a string splitting algorithm yourself? It's quite the fun, little exercise if you're looking for something to occupy your mind for a bit.
Footnotes
This particular case was in JavaScript, where splitting using an empty string as the separator will actually return an empty list given an empty string (~''.split('')~ returns []
), but this is a special case. Most languages (and their standard functions), that I'm aware of, require you to opt in to remove empty strings from the output, though there are outliers.
The implementation is included here for the particularly interested reader:
split :: String -> String -> [String]
split sep input =
go "" (Just input)
where go before remaining =
case remaining of
Nothing -> [before]
Just(s) ->
if take (length sep) s == sep
then before : (go "" $ Just $ drop (length sep) s)
else case s of
"" -> go before Nothing
(c:cs) -> go (before ++ [c]) $ Just cs
- [1]
This particular case was in JavaScript, where splitting using an empty string as the separator will actually return an empty list given an empty string (~''.split('')~ returns
[]
), but this is a special case. Most languages (and their standard functions), that I'm aware of, require you to opt in to remove empty strings from the output, though there are outliers.- [2]
The implementation is included here for the particularly interested reader:
split :: String -> String -> [String] split sep input = go "" (Just input) where go before remaining = case remaining of Nothing -> [before] Just(s) -> if take (length sep) s == sep then before : (go "" $ Just $ drop (length sep) s) else case s of "" -> go before Nothing (c:cs) -> go (before ++ [c]) $ Just cs