Text Processing in Python

uses its focus as way to illuminate the whole field

This is an intermediate-level book discussing text processing. As such it may be interesting to non-Python programmers who do a lot of text-processing and would like to see what Python can do. Its main audience, however, is Python programmers who are comfortable with the language and want to take their programming up a level. I highly recommend this book for these readerships. In fact, anyone who programs in Python would probably do well out of this book.

Apart from a final chapter on the Internet-specific applications, the book's chapters survey, in order of increasing complexity, the variety of text-processing tools available in Python: beginning with 'Python basics', through string methods and regular expressions and up to EBNF parsers. Many sections contain problems and exercises, which can be used either as part of structured learning or just to open out some of the discussions in the text. Appendices give short but informative introductions to Python itself, Unicode and data compression.

The whole book, as well as the code itself and a range of Python text-processing utilities documented in the book, are all available at the author's website,

http://gnosis.cx/TPiP

. Note however, the version on the book at the website is in a custom mark-up language called 'smart ASCII'. This format is quite readable and the site (and book) have code to convert it to HTML (but not to the LaTeX which was used to produce the
hardcopy).

The book, including the code, is well written in a clear, unfussy and readable style. Over and above its documentary function, the commentary and examples are lively enough to act as a stimulus.

The book knows its target audience well and sets a brisk pace. There are no long padding sections on 'how to install Python' or on the interactive shell: the appendix on Python is aimed at programmers who use Python rarely; the opening chapter on 'Python basics' is a fresh look at some facilities the Python programmer might be taking for granted. For example, section 1.1.1 on page 1 is called, "Utilizing Higher-Order Functions in Text Processing" (a higher-order function is a function that returns a function). Every topic in the book gets all and only the space it deserves. There are two corollaries to this:

First, no topic is treated shoddily - as if it was only put in to fill space - and so even the most elementary topics are worth reading even for the experienced programmer, as fresh light will be shed or a piece of handy code will turn up. This is even more the case for the 'improving' programmer.

Second, the reader is treated will similar respect. The head of each section tells the reader which topics are treated in depth, which briefly and which not at all; in the latter cases references to documentation are given (e.g., in the Python Library Reference).

As usual with this kind of documentation, facilities are illustrated with a code snippet or two. Unusually, the code snippets here are 'living code': illustrative rather than merely documentary, in that they illustrate the facility in action and suggestive in that the code does just enough to encourage you to experiment with it and turn it to your own ends.

For example, the chapter on Regular Expressions uses the functions below to help the discussion of the re module:

import re
def re_show(pattern, s):
print
re.compile(pattern,
re.M).sub("{\g0}",
s.rstrip()), '\n'
def re_new(pattern,
replacement, s):
print re.sub(pattern,
'{'+replacement+'}',
s)

These two functions highlight (wrap in braces) a match or a replacement in a given text. As well as making the text more readable, they are useful in themselves: handy for learning how regexes work and/or for getting a regex just right before putting it in your script.

The illustrative code given in the section on the Python Standard Library's HTMLParser builds a stack of HTML tags and gets you started on context-dependent interpretation of HTML (or XML) elements. Contrast this with the HTMLParse examples in the Python Library Reference or in [PIAN] which (a) concentrate on the 'tag collection' side of HTML parsing and (b) are difficult to generalise or reuse.

Apart from the inevitable occasional typo, there are three caveats I'd like to raise:

Firstly, the author is a fan of Functional Programming (FP) and uses it (though not liberally or gratuitously) throughout the book. This might put those who don't like FP off. On the other hand, FP techniques are introduced and motivated very well in the appendix on Python and in the opening chapter. Finally, the reader is not obliged to adopt FP in order to use the ideas from the book: apart from anything else, FP style can easily be converted to more traditional Python style, e.g.:

add = lambda x, y: x + y

is equivalent to

def add(x, y): return x + y

Secondly, given that this book was clearly typeset with LaTeX, I was slightly disappointed to see nothing on LaTeX processing with Python. As the author says on his website, this may be to discourage too many downloads from his website. Notwithstanding the 'need' for DRM, the absence of any discussion of scripting extensions or wrappers for typesetting languages like LaTeX is a flaw.

Finally, the book has a slight bias toward text processing understood as 'processing textual input'. This is understandable of course and no doubt correct. However, the book is correspondingly weak on 'processing textual output'. For example, section 5.2.2 "Parsing, Creating and Manipulating HTML Documents' actually covers only 'parsing' (the sections on XML are also limited to parsing). Similarly the discussion of email concentrates narrowly on reception rather than generation. While it may be fair to say that generation is simpler than comprehension, some discussion would have been useful.

The remit of this book is broader than its title might suggest: first, text processing covers a lot of things (e.g. handling markup languages, Internet protocols, even bioinformatics); second, the way the book is written encourages the reader to explore the ideas. Like a good monograph, it uses its focus (text processing) as way to illuminate the whole field (programming in Python).

References:

Advertisement

Advertisement

Your Privacy

REVIEW - Text Processing in Python

Advertisement

Advertisement

Your Privacy