jugad2 - Vasudev Ram on software innovation: Docx

Wednesday, October 2, 2013

Convert Microsoft Word files to PDF with DOCXtoPDF

DOCX to PDF

Building upon my recent post, here:

Extract text from Word .docx files with python-docx,

I came up with the idea of combining that DOCX text extraction functionality of python-docx with my xtopdf toolkit, to create a program that can convert the text in Microsoft Word DOCX files to PDF format.

[ Note: The conversion has some limitations. E.g. fonts, tables, etc. from the input are not preserved in the output. ]

Here is the program, called DOCXtoPDF.py. It will become a part of my xtopdf toolkit.

# DOCXtoPDF.py

# Author: Vasudev Ram - http://www.dancingbison.com
# Copyright 2012 Vasudev Ram, http://www.dancingbison.com

# This is open source code, released under the New BSD License -
# see http://www.opensource.org/licenses/bsd-license.php .

import sys
import os
import os.path
import string
from textwrap import TextWrapper
from docx import opendocx, getdocumenttext
from PDFWriter import PDFWriter

def docx_to_pdf(infilename, outfilename):

    # Extract the text from the DOCX file object infile and write it to 
    # a PDF file.

    try:
        infil = opendocx(infilename)
    except Exception, e:
        print "Error opening infilename"
        print "Exception: " + repr(e) + "\n"
        sys.exit(1)

    paragraphs = getdocumenttext(infil)

    pw = PDFWriter(outfilename)
    pw.setFont("Courier", 12)
    pw.setHeader("DOCXtoPDF - convert text in DOCX file to PDF")
    pw.setFooter("Generated by xtopdf and python-docx")
    wrapper = TextWrapper(width=70, drop_whitespace=False)

    # For Unicode handling.
    new_paragraphs = []
    for paragraph in paragraphs:
        new_paragraphs.append(paragraph.encode("utf-8"))

    for paragraph in new_paragraphs:
        lines = wrapper.wrap(paragraph)
        for line in lines:
            pw.writeLine(line)
        pw.writeLine("")

    pw.savePage()
    pw.close()
    
def usage():

    return "Usage: python DOCXtoPDF.py infile.docx outfile.txt\n"

def main():

    try:
        # Check for correct number of command-line arguments.
        if len(sys.argv) != 3:
            print "Wrong number of arguments"
            print usage()
            sys.exit(1)
        infilename = sys.argv[1]
        outfilename = sys.argv[2]

        # Check for right infilename extension.
        infile_ext = os.path.splitext(infilename)[1]
        if infile_ext.upper() != ".DOCX":
            print "Input filename extension should be .DOCX"
            print usage()
            sys.exit(1)

        # Check for right outfilename extension.
        outfile_ext = os.path.splitext(outfilename)[1]
        if outfile_ext.upper() != ".PDF":
            print "Output filename extension should be .PDF"
            print usage()
            sys.exit(1)

        docx_to_pdf(infilename, outfilename)

    except Exception, e:
        sys.stderr.write("Error: " + repr(e) + "\n")
        sys.exit(1)

if __name__ == '__main__':
    main()

# EOF

To run DOCXtoPDF, give a command of the form:

python DOCXtoPDF.py infilename.docx outfilename.pdf

After this, the text content of the DOCX file will be in the PDF file.

- Enjoy.

Read other posts about xtopdf on this blog.
Read other posts about Python on this blog.

- Vasudev Ram - Dancing Bison Enterprises

Training or consulting inquiry

Share |

Thursday, September 27, 2012

Docverter HTTP API converts marked-up docs to PDF, Docx, RTF or ePub

Docverter, an HTTP API to convert marked-up docs to PDF, Docx, RTF or ePub (and other formats, both input and output).

Docverter is a paid service.

It uses pandoc, the swiss-army-knife format conversion tool (open source), which I've blogged about a couple of times before.

UPDATE: The Docverter service is not available yet - it is in closed beta. When you try to sign up, you see a form to enter your email address so they can inform you when it is open to use. They are using the model of gauging user interest and getting email addresses of people interested, as some other startups are doing nowadays. But in this case, the creator says that he already has some working code, just that it needs some improvement before letting users in. Interesting thread about it on Hacker News, where the creator, HN user zrail, also participates, answering questions about the service, including why it is a paid service when pandoc is free.

Inspired by nature.
- dancingbison.com | @vasudevram | jugad2.blogspot.com

jugad2 - Vasudev Ram on software innovation

Pages

Wednesday, October 2, 2013

Convert Microsoft Word files to PDF with DOCXtoPDF

Thursday, September 27, 2012

Docverter HTTP API converts marked-up docs to PDF, Docx, RTF or ePub

Blog Archive

Labels