Hairy Sun

Matt's Blog on Some Geeky Topics

Real-life Generators and a Peeker Too

After a long day of manual labor last weekend, I spent a couple minutes relaxing by converting some PDF’s to mobi files so my mom could read them on her Kindle. Her Kindle supports PDF, but reading PDF’s on Kindles (especially of the non-DX eink variety) is a pain. You can zoom into sections, but it isn’t appropriate for long reading, and the default fonts are too small. One of her favorite features is to bump up the font size on mobi files, especially at night. So I obliged.

To that end I’ve created some code that helps in this process—ebookgenerators. Most readers don’t care about the process of converting pdf to mobi. How I cleaned up the text might be interesting though. I used a chain of Python generators!

A few have complained that my Iteration and Generator book doesn’t have enough real examples. Some form of this blog post will probably end up as a chapter there. So without further ado, here’s some code.

A concept I briefly mention in my book is a Peeker class. A peeker can look ahead during iteration. This is useful if deciding the output of an action requires more than one item. Here’s mine:

Peeker
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class PeekDone(Exception):
    pass


class Peeker(object):
    def __init__(self, seq):
        self.seq = iter(seq)
        self.buffer = []

    def pop(self):
        if self.buffer:
            return self.buffer.pop(0)

    def peek(self, n=0):
        """ this can raise an exception if peeking off the end. be
        aware and handle PeekDone appropriately"""
        try:
            if n == len(self.buffer):
                self.buffer.append(self.seq.next())
        except StopIteration as e:
            raise PeekDone('Exhausted')
        return self.buffer[n]

    def __iter__(self):
         return self

    def next(self):
        if self.buffer:
            return self.buffer.pop(0)
        else:
            return self.seq.next()

I use the PeekDone exception as a sentinel value, rather that returning a special value. Here’s an example of a generator removing double blank lines from lines of text using Peeker:

remove_double_returns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def remove_double_returns(lines):
    lines = Peeker(lines)
    for line in lines:
        try:
            next_line = lines.peek()
        except PeekDone as e:
            yield line
            return

        if blank(next_line):
            yield line
            lines.pop()

        else:
            yield line

That could be done by someone fluent in awk in probably two lines. But here’s one that I wouldn’t want to touch.

fix_space_in_paragraph
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def fix_space_in_paragraph(lines):
    """ If paragraphs span pages (often) then there could be extra
    returns in the paragraphs....
    """
    lines = Peeker(lines)
    for line in lines:
        try:
            line2 = lines.peek()
        except PeekDone as e:
            yield line
            return
        try:
            line3 = lines.peek(1)
        except PeekDone as e:
            yield line
            yield line2
            return
        if blank(line2) and (not ends_sentence(line)):
            # don't use line2 so pop it
            lines.pop()
        yield line

Here’s a simple generator without Peeker. I need to ensure that paragraphs have a empty line between them so docutils does the right thing:

insert_extra_paragraph_line
1
2
3
4
5
6
7
def insert_extra_paragraph_line(lines):
    for line in lines:
        if ends_paragraph(line):
            yield line
            yield '\n'
        else:
            yield line

In the end, using a chain of these generators, I was able to generate three mini-ebooks for my mother before she left for a week-long cruise.

My scripts for cleaning up the text looked something like this:

script
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import sys

import ebookgen


def run():
    data = sys.stdin
    data = ebookgen.remove_leading_space(data)
    data = ebookgen.remove_dash_page(data)
    data = ebookgen.remove_carot_l(data)
    data = ebookgen.remove_two_spaces(data)
    data = ebookgen.remove_double_returns(data)
    data = ebookgen.insert_extra_paragraph_line(data)
    data = ebookgen.insert_rst_sections(data)

    for line in data:
        print line,

if __name__ == '__main__':
    run()

Comments