Showing posts with label text-processing. Show all posts
Showing posts with label text-processing. Show all posts

Saturday, November 5, 2016

[xtopdf] Batch convert text files to PDF (with xtopdf and fileinput)

By Vasudev Ram


file1.txt + file2.txt + file3.txt => file123.pdf

I created this new xtopdf app recently. (For those unfamiliar with it, xtopdf (source here) is my open source Python project for PDF generation from other formats and sources. Here is a good high-level overview of xtopdf, describing what it is and can do, its supported input formats, platforms (Windows, Linux, Mac OS X, Unix) and environments (CLI, GUI, Web), etc. The core of the xtopdf project is a library, and what I call xtopdf apps, are applications built using that library.)

This particular app lets you batch-convert multiple text files at a time, to a PDF file. The content of each text file starts on a new page in the PDF file. The program uses xtopdf (which uses ReportLab) and the fileinput module from Python's standard library. The program could be written without using the fileinput module too, and I've done a variant of it that way earlier, but I used fileinput this time for convenience, and to show a use of it.

(BTW, fileinput is a pretty useful module in its own right, for this sort of work - applying the same process (any process, not just PDF generation) to a bunch of input files. fileinput can also read from standard input if no input filenames are specified, but I don't use that feature here. Also, I used 4 functions from the fileinput module, on 4 consecutive lines, in this short program :) - not just for the sake of it, though; it made sense to do so.)

Here is the code, in file BatchTextToPDF.py:
from __future__ import print_function

# BatchTextToPDF.py
# Convert a batch of text files to a single PDF.
# Each text file's content starts on a new page in the PDF file.
# Requires:
# - xtopdf: https://bitbucket.org/vasudevram/xtopdf
# - ReportLab: https://www.reportlab.com/ftp/reportlab-1.21.1.tar.gz
# Author: Vasudev Ram
# Copyright 2016 Vasudev Ram
# Product store: https://gumroad.com/vasudevram
# Web site: https://vasudevram.github.io
# Blog: http://jugad2.blogspot.com

import sys
import fileinput
from PDFWriter import PDFWriter

def usage(prog_name):
    sys.stderr.write("Usage: {} outfile.pdf infile1.txt ...".format(prog_name))

def main():

    if len(sys.argv) < 3:
        usage(sys.argv[0])
        sys.exit(0)

    try:
        pw = PDFWriter(sys.argv[1])
        pw.setFont('Courier', 12)
        pw.setFooter('xtopdf: https://google.com/search?q=xtopdf')

        for line in fileinput.input(sys.argv[2:]):
            if fileinput.filelineno() == 1:
                pw.setHeader(fileinput.filename())
                if fileinput.lineno() != 1:
                    pw.savePage()
            pw.writeLine(line.strip('\n'))

        pw.savePage()
        pw.close()
    except Exception as e:
        print("Caught Exception: type: {}, message: {}".format(\
            e.__class__, str(e)))

if __name__ == '__main__':
    main()
Here is a sample run of the program. I created 3 text files, text1.txt through text3.txt, with the respective number of lines in them. Then ran the command:
python BTTP123.pdf text1.txt text2.txt text3.txt
This created the PDF file BTTP123.pdf. Cropped screenshots of the 1st and 3rd (last) page of the PDF are below:

1st page:


3rd page:


In this example I've closed the PDFWriter instance manually, using pw.close(), but PDFWriter can also be used with the Python with statement, since I had added context manager support to PDFWriter earlier. I use the with statement in some of my xtopdf app examples, and not in others, to show that both possibilities exist.

Here is a Guide to installing and using xtopdf, including creating simple PDF e-books with it.

- Enjoy.

- Vasudev Ram - Online Python training and consulting

Get updates on my software products / ebooks / courses.

Jump to posts: Python   DLang   xtopdf

Subscribe to my blog by email

My ActiveState recipes

FlyWheel - Managed WordPress Hosting



Thursday, May 19, 2016

i18nify any word with this Python utility

By Vasudev Ram

I18Nify

While I was browsing some web pages, reading a word triggered a chain of thoughts. The word had to do with internationalization (often shortened to i18n by developers, because there are 18 letters between the first i and the last n). That's how I thought of writing this small program that "i18nifies" a given word - not in the original sense, but in the way shown below - making a numeronym out of the word.

Here is i18nify.py:
from __future__ import print_function
'''
Utility to "i18nify" any word given as argument.

You Heard It Here First (TM):
"i18nify" signifies making a numeronym of the given word, in the 
same manner that "i18n" is a numeronym for "internationalization" 
- because there are 18 letters between the starting "i" and the 
ending "n". Another example is "l10n" for "localization".
Also see a16z.

Author: Vasudev Ram
Copyright 2016 Vasudev Ram - https://vasudevram.github.io
'''

def i18nify(word):
    # If word is too short, don't bother, return as is.
    if len(word) < 4:
        return word
    # Return (the first letter) plus (the string form of the 
    # number of intervening letters) plus (the last letter).
    return word[0] + str(len(word) - 2) + word[-1]

def get_words():
    for words in [ \
        ['a', 'bc', 'def', 'ghij', 'klmno', 'pqrstu', 'vwxyz'], \
        ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', \
        'lazy', 'dog'], \
        ['all', 'that', 'glitters', 'is', 'not', 'gold'], \
        ['often', 'have', 'you', 'heard', 'that', 'told'], \
        ['jack', 'and', 'jill', 'went', 'up', 'the', 'hill', \
        'to', 'fetch', 'a', 'pail', 'of', 'water'],
    ]:
        yield words

def test_i18nify(words):
    print("\n")
    print(' '.join(words))
    print(' '.join([i18nify(word) for word in words]))

def main():
    for words in get_words():
        test_i18nify(words)
        print

if __name__ == "__main__":
    main()
Running it with:
$ python i18nify.py
gives this output:
a bc def ghij klmno pqrstu vwxyz
a bc def g2j k3o p4u v3z

the quick brown fox jumped over the lazy dog
the q3k b3n fox j4d o2r the l2y dog

all that glitters is not gold
all t2t g6s is not g2d

often have you heard that told
o3n h2e you h3d t2t t2d

jack and jill went up the hill to fetch a pail of water
j2k and j2l w2t up the h2l to f3h a p2l of w3r

Notes:

- The use of yield makes function get_words a generator function. It is not strictly needed, but I left it in there. I could have used "return words" instead of "yield words".

- Speaking of generators, also see this post: Python generators are pluggable.

- The article on numeronyms (link near top of post) reminded me of run-length encoding

Anyway, e3y :)

- Vasudev Ram - Online Python training and consulting

Signup to hear about my new courses and products.

My Python posts     Subscribe to my blog by email

My ActiveState recipes

Sunday, July 12, 2015

Cut the crap! An absolutely essential tool for writers

By Vasudev Ram


KEEP CALM

AND

BE CONCISE




Like anyone else, every now and then I hear people use redundant words words or phrases. Hey, I do it myself sometimes, but am trying to do less of it.

So one day recently, thinking about ways to help with this issue, I came up with the idea for this program, called "Cut the crap!".

You feed it a redundant word or phrase, and if it "knows" it, it spits out the concise (unredundant? dundant? :-) version. Think a Strunk-and-White-like Python bot.

So, here's the Python code for cut_the_crap.py, an absolutely essential tool for writers. The phrases and words are hardcoded as of now, in this first version, but you can easily modify the program to read them from any persistent store (such as a file or database), along with the concise substitutes:
# cut_the_crap.py
# Author: Vasudev Ram
# Purpose: Given a redundant word or phrase, emits a concise synonym.
# See Strunk and White, et al.

from string import lower
from random import randint

d = {
    'at this point in time':
    ['now', 'at present', 'at the moment', 'at this moment', 
    'currently', 'now', 'presently', 'right now'],
    'absolutely complete': ['complete'],
    'absolutely essential': ['essential', 'indispensable'],
    'actual experience': ['past experience', 'experience'],
    'as to whether': ['whether'],
    'try out': ['try']
}

def cut_the_crap(word):
    if word in d:
        words = d[word]
        i = randint(0, len(words) - 1)
        return words[i]
    else:
        return ""
        
def get_the_word():
    crap_word = raw_input("Enter your word (or type 'exit'): ")
    return crap_word

def main():
    word = get_the_word()
    while lower(word) != 'exit':
        right_word = cut_the_crap(word)
        if right_word != "":
            print "Cut the crap! Say:", right_word
        print
        word = get_the_word()
    print "Bye."

if __name__ == '__main__':
    main()
And here is a sample run, entering a few redundant words and phrases:
$ py cut_the_crap.py
Enter your word (or type 'exit'): at this point in time
Cut the crap! Say: at the moment

Enter your word (or type 'exit'): at this point in time
Cut the crap! Say: right now

Enter your word (or type 'exit'): at this point in time
Cut the crap! Say: now

Enter your word (or type 'exit'): as to whether
Cut the crap! Say: whether

Enter your word (or type 'exit'): absolutely essential
Cut the crap! Say: indispensable

Enter your word (or type 'exit'): absolutely essential
Cut the crap! Say: essential

Enter your word (or type 'exit'): try out
Cut the crap! Say: try

Enter your word (or type 'exit'): exit
Bye.
- Enjoy.

- Vasudev Ram - Online Python training and programming

Dancing Bison Enterprises

Signup to hear about new products or services that I create.

Posts about Python  Posts about xtopdf

Contact Page

Friday, February 13, 2015

Splitting a string on multiple different delimiters

By Vasudev Ram

Just recently I was working on some ideas related to my text file indexing program - which I had blogged about earlier, here:

A simple text file indexing program in Python

As part of that work, I was using Python's string split() method for something. Found that it had a limitation that the separator can be only one string (though it can comprise more than one character).
Trying to work out a solution for that (i.e. the ability to split a string on any one of a set of separator / delimiter characters), I gave these commands interactively in the Python interpreter:

>>> print "".split.__doc__
S.split([sep [,maxsplit]]) -> list of strings

Return a list of the words in the string S, using sep as the
delimiter string. If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.

>>> s = "abc.def;ghi"
>>> s.split(".;")
['abc.def;ghi']
>>> s.split(".")
['abc', 'def;ghi']
>>> '---'.join(s.split("."))
'abc---def;ghi'
>>> '---'.join(s.split(".")).split(";")
['abc---def', 'ghi']
>>> "---".join('---'.join(s.split(".")).split(";"))
'abc---def---ghi'
>>> "---".join('---'.join(s.split(".")).split(";")).split('---')
['abc', 'def', 'ghi']
>>>

So you can see that by doing repeated manual split()'s and join()'s, I was able to split the original string the way I wanted, i.e. on both the period and semicolon as delimiters. I'll work out a function or class to do it and then blog it in a sequel to this post.

(Using regular expressions to match the delimiters, and extracting all but the matched parts, may be one way to do it, but I'll try another approach. There probably are many ways to go about it).

- Vasudev Ram - Dancing Bison Enterprises Signup to hear about new products or services from me. Contact Page


Sunday, February 9, 2014

A simple text file indexing program in Python

By Vasudev Ram



Recently, something that I was working on made me think of creating a program to index text files, that is, to create an index file for a text file, something like the index of a book (*), in which, for words in the book, there is a list of page numbers where that word occurs. The difference here is that this program will create, for each word, a list of line numbers where the word occurs in the text file being processed.

(*) To be more specific, what I created was something like a back-of-the-book index, but for text files. I mention that because there are many types of index (Wikipedia), and not just for books. In fact, I was surprised to see the number of meanings or uses of the word index :-) Check the Wikipedia link in the previous sentence to see them. One type of index familiar to programmers, of course, is an array index (or list index, for Python).

Here is the program, called text_file_indexer.py, with a sample input, run and output shown below it. Comments in the code explain the key parts of the logic. Some improvements to the program are possible, of course. I may work on some of them over time. You can already customize the delimiter characters string that is used to remove those characters from around words.

"""
text_file_indexer.py
A program to index a text file.
Author: Vasudev Ram - www.dancingbison.com
Copyright 2014 Vasudev Ram
Given a text file somefile.txt, the program will read it completely, 
and while doing so, record the occurrences of each unique word, 
and the line numbers on which they occur. This information is 
then written to an index file somefile.idx, which is also a text 
file.
"""

import sys
import os
import string
from debug1 import debug1

def index_text_file(txt_filename, idx_filename, 
    delimiter_chars=",.;:!?"):
    """
    Function to read txt_file name and create an index of the 
    occurrences of words in it. The index is written to idx_filename.
    There is one index entry per line in the index file. An index entry 
    is of the form: word line_num line_num line_num ...
    where "word" is a word occurring in the text file, and the instances 
    of "line_num" are the line numbers on which that word occurs in the 
    text file. The lines in the index file are sorted by the leading word 
    on the line. The line numbers in an index entry are sorted in 
    ascending order. The argument delimiter_chars is a string of one or 
    more characters that may adjoin words and the input and are not 
    wanted to be considered as part of the word. The function will remove 
    those delimiter characters from the edges of the words before the rest 
    of the processing.
    """
    try:
        txt_fil = open(txt_filename, "r")
        """
        Dictionary to hold words and the line numbers on which 
        they occur. Each key in the dictionary is a word and the 
        value corresponding to that key is a list of line numbers 
        on which that word occurs in txt_filename.
        """

        word_occurrences = {}
        line_num = 0

        for lin in txt_fil:
            line_num += 1
            debug1("line_num", line_num)
            # Split the line into words delimited by whitespace.
            words = lin.split()
            debug1("words", words)
            # Remove unwanted delimiter characters adjoining words.
            words2 = [ word.strip(delimiter_chars) for word in words ]
            debug1("words2", words2)
            # Find and save the occurrences of each word in the line.
            for word in words2:
                if word_occurrences.has_key(word):
                    word_occurrences[word].append(line_num)
                else:
                    word_occurrences[word] = [ line_num ]

        debug1("Processed {} lines".format(line_num))

        if line_num < 1:
            print "No lines found in text file, no index file created."
            txt_fil.close()
            sys.exit(0)

        # Display results.
        word_keys = word_occurrences.keys()
        print "{} unique words found.".format(len(word_keys))
        debug1("Word_occurrences", word_occurrences)
        word_keys = word_occurrences.keys()
        debug1("word_keys", word_keys)

        # Sort the words in the word_keys list.
        word_keys.sort()
        debug1("after sort, word_keys", word_keys)

        # Create the index file.
        idx_fil = open(idx_filename, "w")

        # Write the words and their line numbers to the index file.
        # Since we read the text file sequentially, there is no need 
        # to sort the line numbers associated with each word; they are 
        # already in sorted order.
        for word in word_keys:
            line_nums = word_occurrences[word]
            idx_fil.write(word + " ")
            for line_num in line_nums:
                idx_fil.write(str(line_num) + " ")
            idx_fil.write("\n")

        txt_fil.close()
        idx_fil.close()
    except IOError as ioe:
        sys.stderr.write("Caught IOError: " + repr(ioe) + "\n")
        sys.exit(1)
    except Exception as e:
        sys.stderr.write("Caught Exception: " + repr(e) + "\n")
        sys.exit(1)

def usage(sys_argv):
    sys.stderr.write("Usage: {} text_file.txt index_file.txt\n".format(
        sys_argv[0]))

def main():
    if len(sys.argv) != 3:
        usage(sys.argv)
        sys.exit(1)
    index_text_file(sys.argv[1], sys.argv[2])

if __name__ == "__main__":
    main()

# EOF
Here is a sample input text file, file01.txt, that I tested the program with:
This file is a test of the text_file_indexer.py program.
The program indexes a text file.
The output of the program is another file called an index file.
The index file is like the index of a book.
For each word that occurs in the text file, there will be a line 
in the index file, starting with that word, and followed by all 
the line numbers in the text file on which that word occurs.
I ran the text file indexer program with the command:
python text_file_indexer.py file01.txt file01.idx
And here is the output of running the program on that text file, that is, the contents of the file file01.idx:
For 5 
The 2 3 4 
This 1 
a 1 2 4 5 
all 6 
an 3 
and 6 
another 3 
be 5 
book 4 
by 6 
called 3 
each 5 
file 1 2 3 3 4 5 6 7 
followed 6 
in 5 6 7 
index 3 4 4 6 
indexes 2 
is 1 3 4 
like 4 
line 5 7 
numbers 7 
occurs 5 7 
of 1 3 4 
on 7 
output 3 
program 1 2 3 
starting 6 
test 1 
text 2 5 7 
text_file_indexer.py 1 
that 5 6 7 
the 1 3 4 5 6 7 7 
there 5 
which 7 
will 5 
with 6 
word 5 6 7 
- Vasudev Ram - Python training and consulting

O'Reilly 50% Ebook Deal of the Day

Tuesday, May 21, 2013

A partial crossword solver in Python

A Cryptic Crossword Clue Solver ←

Saw this via Twitter.

It is a partial crossword solver, because it only helps solve a particular category of crossword clues - those in which the clue (which is usually a sentence or phrase) contains both a "definition" of the answer as well a hint of some kind that leads to the same answer. This solver tries to compute the answer using both the definition and the hint, and checks whether the results match. Ingenious.

I found it interesting because this is a somewhat difficult problem, and yet the author managed to create a solution (involving NLTK and parsing) that works in many, if not all cases.

Also, long ago, in college days, I had written another kind of partial crossword solver (in BASIC); it was much simpler, using a brute force method - what it did was help solve the kind of crossword clues in which the answer is a permutation of a substring of the characters comprising the clue sentence or phrase. The program would generate and display on the screen, all possible permutations of all possible substrings of the sentence, that were of the same length as the answer. Then you had to view those permutations and guess whether any of them was the right answer, based on the clue.

I wrote the permutation-generation code by hand, but saw recently that the Python itertools module has methods to generate permutations (as well as combinations) from sequences:

http://docs.python.org/2/library/itertools.html

http://en.m.wikipedia.org/wiki/Permutation

http://en.wikipedia.org/wiki/Crossword

- Vasudev Ram
dancingbison.com

Thursday, January 24, 2013

Open Text Summarizer

Linux.com :: Condensing with Open Text Summarizer

Open Text Summarizer is a tool for automating summarization of non-fiction text.

It could be used to (partially) automate the process of creating abstracts or executive summaries.

Interesting idea.

I had worked on a project with roughly similar goals, but in it the automation was not for  summarizing the text,  but for related aspects.

- Vasudev Ram
www.dancingbison.com
Software training and consulting

Sunday, July 15, 2012

The Bentley-Knuth problem and solutions

By Vasudev Ram


I recently saw this post about an interesting programming problem on the Web (apparently initially posed by Jon Bentley to Donald Knuth.

For lack of a better term (and also because the name is somewhat memorable), I'm calling it the Bentley-Knuth problem: More shell, less egg

The problem description, from the above post:

[
The program Bentley asked Knuth to write is one that’s become familiar to people who use languages with serious text-handling capabilities: Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.
]

The post is interesting in itself - read it. For fun, I decided to write solutions to the problem in Python and also in UNIX shell.

My initial Python solution is below. The code is not very Pythonic / refactored / tested, but it works, and does have some minimal error checking. See this Python sorting HOWTO page for some ways it could be improved. UNIX shell solution coming in a while.

UPDATE: Unix shell solution added below the Python one.

Note: I should mention that neither my Python nor UNIX shell solution works exactly the same as the McIlroy shell solution, since that one converts upper case letters to lower case, and also, uses a strict "English dictionary"-style definition of a "word", i.e. only alphabetic characters, whereas my two solutions use the definition of a word as "a sequence of non-blank characters", as is more commonly used in parsing computer programs. But I could add both of the tr invocations to the front of my shell pipeline and get the same result as McIlroy.

# bentley_knuth.py
# Author: Vasudev Ram - http://www.dancingbison.com
# Version: 0.1

# The problem this program tries to solve is from the page:
# http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/

# Description: The program Bentley asked Knuth to write:

# Read a file of text, determine the n most frequently 
# used words, and print out a sorted list of those words 
# along with their frequencies.

import sys
import os
import string

sys_argv = sys.argv

def usage():
 sys.stderr.write("Usage: %s n file\n" % sys_argv[0])
 sys.stderr.write("where n is the number of most frequently\n")
 sys.stderr.write("used words you want to find, and \n")
 sys.stderr.write("file is the name of the file in which to look.\n")

if len(sys_argv) < 3:
 usage()
 sys.exit(1)

try:
 n = int(sys_argv[1])
except ValueError:
 sys.stderr.write("%s: Error: %s is not a decimal numeric value" % (sys_argv[0], 
  sys_argv[1]))
 sys.exit(1)

print "n =", n
if n < 1:
 sys.stderr.write("%s: Error: %s is not a positive value" % 
  (sys_argv[0], sys_argv[1]))

in_filename = sys.argv[2]
print "%s: Finding %d most frequent words in file %s" % \
 (sys_argv[0], n, in_filename)

try:
 fil_in = open(in_filename)
except IOError:
 sys.stderr.write("%s: ERROR: Could not open in_filename %s\n" % \
  (sys_argv[0], in_filename))
 sys.exit(1)

word_freq_dict = {}

for lin in fil_in:
 words_in_line = lin.split()
 for word in words_in_line:
  if word_freq_dict.has_key(word):
   word_freq_dict[word] += 1
  else:
   word_freq_dict[word] = 1

word_freq_list = []
for item in word_freq_dict.items():
 word_freq_list.append(item)

wfl = sorted(word_freq_list, 
 key=lambda word_freq_list: word_freq_list[1], reverse=True)
#wfl.reverse()
print "The %d most frequent words sorted by decreasing frequency:" % n
len_wfl = len(wfl)
if n > len_wfl:
 print "n = %d, file has only %d unique words," % (n, len_wfl)
 print "so printing %d words" % len_wfl
print "Word: Frequency"
m = min(n, len_wfl)
for i in range(m):
 print wfl[i][0], ": ", wfl[i][1]

fil_in.close()


And here is my initial solution in UNIX shell:

# bentley_knuth.sh

# Usage:
# ./bentley_knuth.sh n file
# where "n" is the number of most frequent words 
# you want to find in "file".

awk '
    {
        for (i = 1; i <= NF; i++)
            word_freq[$i]++
    }
END     {
            for (i in word_freq)
                print i, word_freq[i]
        }
' < $2 | sort -nr +1 | sed $1q
- Vasudev Ram - Dancing Bison Enterprises