jugad2 - Vasudev Ram on software innovation: substrings

Sunday, September 23, 2018

How many ways can you substring a string? Part 2

Twine image attribution

Hi readers,

In my last post, i.e.:

How many ways can you substring a string? Part 1, I had said that there can be other ways of doing it, and that some enhancements were possible. This post (Part 2) is about that.

Here is another algorithm to find all substrings of a given string:

Let s be the input string.
Let n be the length of s.
Find and yield all substrings of s of length 1.
Find and yield all substrings of s of length 2.
...
Find and yield all substrings of s of length n.

Even without doing any formal analysis of the algorithm, we can intuitively see that it is correct, because it accounts for all possible cases (except for the empty string, but adding that is trivial).

[ BTW, what about uses for this sort of program? Although I just wrote it for fun, one possible use could be in word games like Scrabble. ]

The code for this new algorithm is in program all_substrings2.py below.

"""
all_substrings2.py
Function and program to find all substrings of a given string.
Author: Vasudev Ram
Copyright 2018 Vasudev Ram
Web site: https://vasudevram.github.io
Blog: https://jugad2.blogspot.com
Twitter: https://mobile.twitter.com/vasudevram
Product store: https://gumroad.com/vasudevram
"""

from __future__ import print_function
import sys
from error_exit import error_exit
from debug1 import debug1

def usage():
    message_lines = [\
        "Usage: python {} a_string".format(sa[0]),
        "Print all substrings of a_string.",
        "",
    ]
    sys.stderr.write("\n".join(message_lines))

def all_substrings2(s):
    """
    Generator function that yields all the substrings 
    of a given string s.
    Algorithm used:
    1. len_s = len(s)
    2. if len_s == 0, return ""
    3. (else len_s is > 0):
       for substr_len in 1 to len_s:
           find all substrings of s that are of length substr_len
           yield each such substring 
    Expected output for some strings:
    For "a":
        "a"
    For "ab":
        "a"
        "b"
        "ab"
    For "abc:
        "a"
        "b"
        "c"
        "ab"
        "bc"
        "abc"
    For "abcd:
        "a"
        "b"
        "c"
        "d"
        "ab"
        "bc"
        "cd"
        "abc"
        "bcd"
        "abcd"
    """

    len_s = len(s)
    substr_len = 1
    while substr_len <= len_s:
        start = 0
        end = start + substr_len
        while end <= len_s:
            debug1("s[{}:{}] = {}".format(\
                start, end, s[start:end]))
            yield s[start:end]
            start += 1
            end = start + substr_len
        substr_len += 1

def main():
    if lsa != 2:
        usage()
        error_exit("\nError: Exactly one argument must be given.\n")

    if sa[1] == "":
        print("")
        sys.exit(0)

    for substring in all_substrings2(sa[1]):
        print(substring)

sa = sys.argv
lsa = len(sa)

if __name__ == "__main__":
    main()

BTW, I added the empty string as the last item in the message_lines list (in the usage() function), as a small trick, to avoid having to explicitly add an extra newline after the joined string in the write() call.

Here are some runs of the program, with outputs, using Python 2.7 on Linux:

(pyo, in the commands below, is a shell alias I created for 'python -O', to disable debugging output. And a*2*y expands to all_substrings2.py, since there are no other filenames matching that wildcard pattern in my current directory. It's a common Unix shortcut to save typing. In fact, the bash shell expands that shortcut to the full filename when you type the pattern and then press Tab. But the expansion happens without pressing Tab too, if you just type that command and hit Enter. But you have to know for sure, up front, that the wildcard expands to only one filename (if you want that), or you can get wrong results, e.g. if such a wildcard expands to 3 filenames, and your program expects command-line arguments, the 2nd and 3rd filenames will be treated as command-line arguments for the program represented by the 1st filename. This will likely not be what you want, and may create problems.)

Run it without any arguments:

$ pyo a*2*y
Usage: python all_substrings2.py a_string
Print all substrings of a_string.

Error: Exactly one argument must be given.

Run a few times with some input strings of incrementally longer lengths:

$ pyo a*2*y a
a
$ pyo a*2*y ab
a
b
ab
$ pyo a*2*y abc
a
b
c
ab
bc
abc
$ pyo a*2*y abcd
a
b
c
d
ab
bc
cd
abc
bcd
abcd

Count the number of substrings in the above run for string abcd:

$ pyo a*2*y abcd | wc -l
10
$ pyo a*2*y abcde
a
b
c
d
e
ab
bc
cd
de
abc
bcd
cde
abcd
bcde
abcde

Count the number of substrings in the above run for string abcde:

$ pyo a*2*y abcde | wc -l
15
$ pyo a*2*y abcdef
a
b
c
d
e
f
ab
bc
cd
de
ef
abc
bcd
cde
def
abcd
bcde
cdef
abcde
bcdef
abcdef

Count the number of substrings in the above run for string abcdef:

$ pyo a*2*y abcdef | wc
     21      21      77

Now a few more with only the count:

$ pyo a*2*y abcdefg | wc
     28      28     112
$ pyo a*2*y abcdefgh | wc
     36      36     156
$ pyo a*2*y abcdefghi | wc
     45      45     210

Notice a pattern?

The count of substrings for each succeeding run (which has one more character in the input string than the preceding run has), is equal to the sum of the count for the preceding run and the length of the input string for the succeeding run; e.g. 10 + 5 = 15, 15 + 6 = 21, 21 + 7 = 28, etc. This is the same as the sum of the first n natural numbers.

There is a well-known formula for that sum: n * (n + 1) / 2.

There is a story (maybe apocryphal) that the famous mathematician Gauss was posed this problem - to find the sum of the numbers from 1 to 100 - by his teacher, after he misbehaved in class. To the surprise of the teacher, he gave the answer in seconds. From the Wikipedia article about Gauss:

[ Gauss's presumed method was to realize that pairwise addition of terms from opposite ends of the list yielded identical intermediate sums: 1 + 100 = 101, 2 + 99 = 101, 3 + 98 = 101, and so on, for a total sum of 50 × 101 = 5050. ]

From this we can see that the sum of this sequence satisfies the formula n * (n + 1) / 2, where n = 100, i.e. 100 * (100 + 1) / 2 = 50 * 101.

(Wikipedia says that Gauss "is ranked among history's most influential mathematicians".)

We can also run the all_substrings2.py program multiple times with different inputs, using a for loop in the shell:

$ for s in a ab abc abcd
> do
>   echo All substrings of $s:
>   pyo al*2*py $s
> done

All substrings of a:
a
All substrings of ab:
a
b
ab
All substrings of abc:
a
b
c
ab
bc
abc
All substrings of abcd:
a
b
c
d
ab
bc
cd
abc
bcd
abcd

Some remarks on the program versions shown (i.e. all_substrings.py and all_substrings2.py, in Parts 1 and 2 respectively):

Both versions use a generator function, to lazily yield each substring on demand. Either version can easily be changed to use a list instead of a generator (and the basic algorithm used will not need to change, in either case.) To do that, we have to delete the yield statement, collect all the generated substrings in a new list, and at the end, return that list to the caller. The caller's code will not need to change, although we will now be iterating over the list returned from the function, not over the values yielded by the generator. Some of the pros and cons of the two approaches (generator vs. list) are:

- the list approach has to create and store all the substrings first, before it can return them. So it uses memory proportional to the sum of the sizes of all the substrings generated, with some overhead due to Python's dynamic nature (but that per-string overhead exists for the generator approach too). (See this post: Exploring sizes of data types in Python.) The list approach will need a bit of overhead for the list too. But the generator approach needs to handle only one substring at a time, before yielding it to the caller, and does not use a list. So it will potentially use much less memory, particularly for larger input strings. The generator approach may even be faster than the list version, since repeated memory (re)allocation for the list (as it expands) has some overhead. But that is speculation on my part as of now. To be sure of it, one would have to do some analysis and/or some speed measurements of relevant test programs.

- the list approach gives you the complete list of substrings (after the function that generates them returns). So, in the caller, if you want to do multiple processing passes over them, you can. But the generator approach gives you each substring immediately as it is generated, you have to process it, and then it is gone. So you can only do one processing pass over the substrings generated. In other words, the generator's output is sequential-access, forward-only, one-item-at-a-time-only, and single-pass-only. (Unless you store all the yielded substrings, but then that becomes the same as the list approach.)

Another enhancement that can be useful is to output only the unique substrings. As I showed in Part 1, if there are any repeated characters in the input string, there can be duplicate substrings in the output. There are two obvious ways of getting only unique substrings:

1) By doing it internal to the program, using a Python dict. All we have to do is add each substring (as a key, with the corresponding value being anything, say None), to a dict, as and when the substring is generated. Then the substrings in the dict are guaranteed to be unique. Then at the end, we just print the substrings from the dict instead of from the list. If we want to print the substrings in the same order they were generated, we can use an OrderedDict.

See: Python 2 OrderedDict
and: Python 3 OrderedDict

(Note: In Python 3.7, OrderedDict may no longer be needed, because dicts are defined as keeping insertion order.)

2) By piping the output of the program (which is all the generated substrings, one per line) to the Unix uniq command, whose purpose is to select only unique items from its input. But for that, we have to sort the list first, since uniq requires sorted input to work properly. We can do that with pipelines like the following:

First, without sort and uniq; there are duplicates:

$ pyo all_substrings2.py aabbb | nl -ba
1 a
2 a
3 b
4 b
5 b
6 aa
7 ab
8 bb
9 bb
10 aab
11 abb
12 bbb
13 aabb
14 abbb
15 aabbb

Then with sort and uniq; now there are no duplicates:

$ pyo all_substrings2.py aabbb | sort | uniq | nl -ba
1 a
2 aa
3 aab
4 aabb
5 aabbb
6 ab
7 abb
8 abbb
9 b
10 bb
11 bbb

The man pages for sort and uniq are here:

sort
uniq

That's it for now. I have a few more points which I may want to add; if I decide to do so, I'll do them in a Part 3 post.

The image at the top of the post is of spools of twine (a kind of string) from Wikipedia.

- Enjoy.

- Vasudev Ram - Online Python training and consulting

I conduct online courses on Python programming, Unix/Linux (commands and shell scripting) and SQL programming and database design, with personal coaching sessions.

Contact me for details of course content, terms and schedule.

DPD: Digital Publishing for Ebooks and Downloads.

Hit the ground running with my vi quickstart tutorial. I wrote it at the request of two Windows system administrator friends who were given additional charge of some Unix systems. They later told me that it helped them to quickly start using vi to edit text files on Unix.

Check out WP Engine, powerful WordPress hosting.

Creating online products for sale? Check out ConvertKit, email marketing for online creators.

Teachable: feature-packed course creation platform, with unlimited video, courses and students.

Track Conversions and Monitor Click Fraud with Improvely.

Posts about: Python * DLang * xtopdf

My ActiveState Code recipes

Follow me on:

Wednesday, September 12, 2018

How many ways can you substring a string? Part 1

By Vasudev Ram

String image attribution

Recently, something I read made me think of writing a simple program to generate all substrings of a given string.
(To be precise, excluding the null string.)

Here is an initial version I came up with, all_substrings.py:

"""
all_substrings.py
Function and program to find all the substrings of a given string.
Author: Vasudev Ram
Copyright 2018 Vasudev Ram
Web site: https://vasudevram.github.io
Blog: https://jugad2.blogspot.com
Twitter: https://mobile.twitter.com/vasudevram
Product Store: https://gumroad.com/vasudevram
"""

from __future__ import print_function
import sys
from error_exit import error_exit
from debug1 import debug1

def usage():
    message_lines = [\
        "Usage: python {} a_string".format(sa[0]),
        "Print all substrings of a_string.",
    ]
    sys.stderr.write("\n".join(message_lines))

def all_substrings(s):
    """
    Generator function that yields all the substrings of a given string.
    """

    ls = len(s)
    if ls == 0:
        usage()
        error_exit("\nError: String argument must be non-empty.")

    start = 0
    while start < ls:
        end = start + 1
        while end <= ls:
            debug1("s[{}:{}] = {}".format(start, end, s[start:end]))
            yield s[start:end]
            end += 1
        start += 1

def main():
    if lsa != 2:
        usage()
        error_exit("\nError: Exactly one argument must be given.")

    for substring in all_substrings(sa[1]):
        print(substring)

sa = sys.argv
lsa = len(sa)

if __name__ == "__main__":
    main()

Some runs and output of the program:

With no command-line arguments:

$ python all_substrings.py
Usage: python all_substrings.py a_string
Print all substrings of a_string.
Error: Exactly one argument must be given.

With one command-line argument, an empty string:

$ python all_substrings.py ""
Usage: python all_substrings.py a_string
Print all substrings of a_string.
Error: String argument must be non-empty.

Now with a 3-character string, with debugging enabled, via the use of my debug1 debugging function [1] (and Python's __debug__ built-in variable, which is set to True by default):

$ python all_substrings.py abc
s[0:1] = a
a
s[0:2] = ab
ab
s[0:3] = abc
abc
s[1:2] = b
b
s[1:3] = bc
bc
s[2:3] = c
c

[1] You can read about and get the code for that debugging function here:

Improved simple Python debugging function

The remaining runs are with debugging turned off via Python's -O flag:

With a 4-character string:

$ python -O all_substrings.py abcd
a
ab
abc
abcd
b
bc
bcd
c
cd
d

With a 4-character string, not all characters unique:

$ python -O all_substrings.py FEED
F
FE
FEE
FEED
E
EE
EED
E
ED
D

Note that when there are duplicated characters in the input, we can get duplicate substrings in the output; in this case, E appears twice.

With a string of length 6, again with some characters repeated (E and D):

$ python -O all_substrings.py FEEDED
F
FE
FEE
FEED
FEEDE
FEEDED
E
EE
EED
EEDE
EEDED
E
ED
EDE
EDED
D
DE
DED
E
ED
D

Again, we get duplicate substrings in the output.

With a 6-character string, no duplicate characters:

$ python -O all_substrings.py 123456
1
12
123
1234
12345
123456
2
23
234
2345
23456
3
34
345
3456
4
45
456
5
56
6

Is there any other way of doing it?
Any interesting enhancements possible?

Yes to both questions.
I will cover some of those points in a follow-up post.

Actually, I already did one thing in the current version, which is of interest: I used a generator to yield the substrings lazily, instead of creating them all upfront, and then returning them all in a list. I'll show and discuss a few pros and cons of some other approaches later.

Meanwhile, want to have a bit of fun with visual effects?

Try some variations of runs of the program like these:

python -O all_substrings.py +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-

python -O all_substrings.py /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\

python -O all_substrings.py "%% $$ @@ && %% $$ @@ && %% $$ @@ && %% $$ @@ && %% $$ @@ && %% $$ @@ && %% $$ @@ && %% $$ @@ && %% $$ @@ && %% $$ @@ && %% $"

$ python -O all_substrings.py 10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010

python -O all_substrings.py ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>"

You can change the characters used in the string argument to any combination of any punctuation characters, or even letters or digits - anything you like. You can also vary the number of characters used in the string. Longer ones (though not too long) tend to give better visual effects and the display also lasts for longer. Note that all the characters of the string you use, should come on the same single line, the same line as the python command you use. Also, if using a pipe character (|) (or any other characters that are special to your OS shell), enclose the whole string in quotes as I have done in an example above. I ran this on Windows and so used double quotes for such cases. Single quotes give errors. On Unix-like systems, either may work, but some characters may get interpreted inside double quotes. Experiment :)

You can also add an import time statement in the imports section of the program, and then use a time.sleep(number) inside the for loop, say, just above the print(substring) statement. I used values like:

time.sleep(0.002)

which works well for my display. You can tweak that number for your hardware.

- Have fun.

Did you know that there are a large number of meanings and contexts for the word string? Here are some of them:

String (Wikipedia).

This Wikipedia article about strings in computer science is interesting, and has a lot more points than one might imagine at first:

(computer) strings

- Vasudev Ram - Online Python training and consulting

Hit the ground running with my vi quickstart tutorial, vetted by two Windows system administrator friends.

Jump to posts: Python * DLang * xtopdf

Interested in a Python, SQL or Linux course?

Get WP Engine, powerful managed WordPress hosting.

Subscribe to my blog (jugad2.blogspot.com) by email

My ActiveState Code recipes

Follow me on:

Gumroad * LinkedIn * Twitter

Do you create online products? Get Convertkit:

Email marketing for digital product creators