After a long day of manual labor last weekend, I spent a couple minutes relaxing by converting some PDF's to mobi files so my mom could read them on her Kindle. Her Kindle supports PDF, but reading PDF's on Kindles (especially of the non-DX eink variety) is a pain. You can zoom into sections, but it isn't appropriate for long reading, and the default fonts are too small. One of her favorite features is to bump up the font size on mobi files, especially at night. So I obliged.
To that end I've created some code that helps in this process---ebookgenerators. Most readers don't care about the process of converting pdf to mobi. How I cleaned up the text might be interesting though. I used a chain of Python generators!
A few have complained that my Iteration and Generator book doesn't have enough real examples. Some form of this blog post will probably end up as a chapter there. So without further ado, here's some code.
A concept I briefly mention in my book is a Peeker class. A peeker can look ahead during iteration. This is useful if deciding the output of an action requires more than one item. Here's mine:
class PeekDone(Exception): pass class Peeker(object): def __init__(self, seq): self.seq = iter(seq) self.buffer =  def pop(self): if self.buffer: return self.buffer.pop(0) def peek(self, n=0): """ this can raise an exception if peeking off the end. be aware and handle PeekDone appropriately""" try: if n == len(self.buffer): self.buffer.append(self.seq.next()) except StopIteration as e: raise PeekDone('Exhausted') return self.buffer[n] def __iter__(self): return self def next(self): if self.buffer: return self.buffer.pop(0) else: return self.seq.next()
I use the
PeekDone exception as a sentinel value, rather that returning a special value. Here's an example of a generator removing double blank lines from lines of text using
def remove_double_returns(lines): lines = Peeker(lines) for line in lines: try: next_line = lines.peek() except PeekDone as e: yield line return if blank(next_line): yield line lines.pop() else: yield line
That could be done by someone fluent in awk in probably two lines. But here's one that I wouldn't want to touch.
def fix_space_in_paragraph(lines): """ If paragraphs span pages (often) then there could be extra returns in the paragraphs.... """ lines = Peeker(lines) for line in lines: try: line2 = lines.peek() except PeekDone as e: yield line return try: line3 = lines.peek(1) except PeekDone as e: yield line yield line2 return if blank(line2) and (not ends_sentence(line)): # don't use line2 so pop it lines.pop() yield line
Here's a simple generator without
Peeker. I need to ensure that paragraphs have a empty line between them so
docutils does the right thing:
def insert_extra_paragraph_line(lines): for line in lines: if ends_paragraph(line): yield line yield '\n' else: yield line
In the end, using a chain of these generators, I was able to generate three mini-ebooks for my mother before she left for a week-long cruise.
My scripts for cleaning up the text looked something like this:
import sys import ebookgen def run(): data = sys.stdin data = ebookgen.remove_leading_space(data) data = ebookgen.remove_dash_page(data) data = ebookgen.remove_carot_l(data) data = ebookgen.remove_two_spaces(data) data = ebookgen.remove_double_returns(data) data = ebookgen.insert_extra_paragraph_line(data) data = ebookgen.insert_rst_sections(data) for line in data: print line, if __name__ == '__main__': run()