Python bindings for libepub using Shiboken

I have a Kindle, I like it very much, but unfortunately I can’t say the same about the format it used for books. Not that I have made a detailed study of the mobi format and came to the conclusion that it is technically inferior. No. The problem is that ePub is so much more popular, which mean more books, and more tools to play with. Calibre is awesome to convert between eBook formats, but I prefer the simplicity of downloading a book, copying it to the device and reading right away.

When my Kindle is not at hand, my replacement reader is a N900 with MeBook, one of my favorite apps – it looks for books (in ePub format) on Feedbooks, downloads to device and show them in a nice list with covers and all. Looking at MeBook source code I learned that it uses libepub to read the ePub eBooks, so I thought it would be interesting to make a Python binding to libepub. And here we are.

Ah, and one more thing before we start, I found a related Python project on github called libepub. Hadn’t checked it yet, but I mention it here for information’s sake.

Shiboken

Shiboken is the world famous tool used to generate the world famous PySide bindings (that happen to have reached 1.0 last week – yay!) It is one of the best tools around to generate Python bindings for C++ libraries. But that’s me talking – I work on it. But libepub is written in C, and even though Shiboken can generate bindings for a bunch of global functions and put them together in a Python module, that’s crappy. So I’ll have to make some preparations to make libepub appear beautiful in Python.

Pre-requisites: development files (i.e. the headers) for ApiExtractor, GeneratorRunner, Shiboken, and libepub. (For Debian/Ubuntu users this means: libapiextractor-dev, libgenrunner-dev, libshiboken-dev, and libepub-dev.) And the C++ compiler plus CMake, of course.

Remainder: ApiExtractor, GeneratorRunner and Shiboken are made with Qt (only the core libs, no ui for them), but the bindings generated do not depend on Qt at all. Except, obviously, if the library being wrapped already does.

libepub

First let’s have an overview of libepub. It has three structures used as opaque pointers:

// Contains information about the ePub file.
struct epub;
// Iterator object for the Table of Contents.
struct titerator;
// Iterator object for the book contents.
struct eiterator;

If you want to read the contents of an ePub file, call the function that will create an epub structure, iterate through its table of contents with a titerator, and then the contents themselves with an eiterator.

Most functions follow that “object oriented C” format seen in other libraries, with the first argument being a pointer to the structure that represents the “this” or “self” in OO languages. Examples:

void
epub_dump(struct epub* epub);

unsigned char**
epub_get_metadata(struct epub* epub,
                  enum epub_metadata type,
                  int* size);

int
epub_get_data(struct epub* epub,
              const char* name,
              char** data);

CMake

I’ll use CMake as the build system for the bindings because I feel comfortable with it – it is the same used in PySide and all the Shiboken generator tool chain. I’ll not give too much attention to this part of the process, just check the CMakeLists.txt files; more detailed explanation on the process of building a binding can be found in the PySide Binding Generation Tutorial. For really really basic information check PySide CMake Primer.

Handmade C++ Wrapper for a C library

To generate Python bindings for a C++ library one must write a XML description called type system, which will declare what must be exposed on Python land, and if any of this needs modification: global functions, classes, namespaces, enums. For example, if in C++ I have the class Rectangle, I would declare it in the type system this way:

<value-type name='Rectangle' />

All the methods belonging to Rectangle would be exposed to Python automatically. The same is not true for the C functions that representing the methods for the epub struct. Unfortunately there’s no way to tell the Shiboken generator that I want to have the epub structure and functions represented as a class with methods, so we have to make a thin C++ wrapper around the C structures. That would be a lot of work for a huge C library, but even with a tiny one I feel uncomfortable having to resort for this kind of hackery. The type system should be expressive enough to bind C structs and functions as if they were proper objects with methods. I’ll mark this for future community/out-of-work improvements on Shiboken.

Let’s see a snippet from epub_cpp_wrapper.h:

class EPub {
public:
    ...

    ~EPub() { epub_close(m_epub); }

    static inline EPub*
    open(const char* filename, int debug = 0) {
        struct epub* book = epub_open(filename, debug);
        if (book)
            return new EPub(book);
        return 0;
    }

    ...
    inline void dump() { epub_dump(m_epub); }

    inline int
    get_data(const char* name, char** data) {
        return epub_get_data(m_epub, name, data);
    }

    ...

private:
    explicit EPub(struct epub* ptr) : m_epub(ptr) {}
    EPub(const EPub&amp; other) {}
    EPub&amp; operator=(const EPub&amp; other) {}
    struct epub* m_epub;
};

Notice that all methods were marked as inline to make this wrapper as thinner as possible. (GCC, I’m looking at you!)

Epub::open

The epub_open function becomes the static method EPub::open that will return a new EPub object for the ePub file given by filename parameter, or a null pointer if the file is invalid or doesn’t exist.
The constructor for this class was made private so the only way to create EPub objects is via EPub::open, that’ll never create an invalid EPub.

~Epub

In C the responsible for freeing the epub structure is epub_close, but I’ll not make it EPub::close() because the C++ equivalent for it is the class’ destructor.

A generated Python binding for what we have until now would look roughly like this:

import epub
book = epub.EPub.open('sample.epub')
title = book.get_metadata(epub.EPUB_TITLE)

Of course I didn’t explained how this would be generated, that the module would be called epub, what is epub.EPUB_TITLE, and how Python would know what to do with unsigned char**, but bear with me.

Moving enums around

EPUB_TITLE is a value from the epub_metadata enum, if exposed to Python as they are, they’ll look like this:

import epub
epub.epub_metadata
epub.EPUB_TITLE

Which is pretty ugly and lame. Since epub_metadata is an enum related to the epub object (as the epub_ prefix tells us), it would be natural that it was moved inside the EPub class. In my fantasy world, the type system tag that describes a C++ enum to Python would have the option to move it inside another object, and also to be renamed. Something along these lines:

<enum-type name='epub_metadata'
           rename='metatada'
           move-into='EPub'
           remove-enum-value-prefix='EPUB_' />

And here I have another thing for a Shiboken TODO list. While this feature is not implemented, I’ll have to do it manually.

class EPub {
public:
    enum metadata {
        ID = int(EPUB_ID),
        TITLE = int(EPUB_TITLE),
        CREATOR = int(EPUB_CREATOR),
        CONTRIB = int(EPUB_CONTRIB),
        SUBJECT = int(EPUB_SUBJECT),
        PUBLISHER = int(EPUB_PUBLISHER),
        DESCRIPTION = int(EPUB_DESCRIPTION),
        DATE = int(EPUB_DATE),
        TYPE = int(EPUB_TYPE),
        FORMAT = int(EPUB_FORMAT),
        SOURCE = int(EPUB_SOURCE),
        LANG = int(EPUB_LANG),
        RELATION = int(EPUB_RELATION),
        COVERAGE = int(EPUB_COVERAGE),
        RIGHTS = int(EPUB_RIGHTS),
        META = int(EPUB_META)
    };
    ...
    inline unsigned char**
    get_metadata(metadata type, int* size) {
        return epub_get_metadata(m_epub,
                                 epub_metadata(type),
                                 size);
    }
    ...
};

The epub_metadata enum values were cast to int to prevent the ApiExtractor to emitting a bunch of warnings saying that it cannot tell who these guys are.

Anyways, it’s so awful… to have a fancy generator, and having to write all this… noooo!

TIterator and EIterator

The C structures titerator and eiterator will be wrapped by the C++ classes TIterator and EIterator, respectively, and like the EPub class their constructors are private.

class TIterator {
public:
    enum type {
        NAVMAP = int(TITERATOR_NAVMAP),
        GUIDE = int(TITERATOR_GUIDE),
        PAGES = int(TITERATOR_PAGES)
    };
    ~TIterator() { epub_free_titerator(m_iter); }
    inline bool isValid() {
        return epub_tit_curr_valid(m_iter);
    }
    ...
private:
    friend class EPub;
    explicit TIterator(struct titerator* iter)
        : m_iter(iter) {}
    struct titerator* m_iter;
};

New instances of TIterator and EIterator are created by EPub methods, because of that it must be a friend of the iterator classes.

class EPub {
public:
    ...
    inline EIterator*
    get_iterator(EIterator::type type, int opt = 0) {
        struct eiterator* it = epub_get_iterator(m_epub, eiterator_type(type), opt);
        if (it)
            return new EIterator(it);
        return 0;
    }
    inline TIterator*
    get_titerator(TIterator::type type, int opt = 0) {
        struct titerator* it = epub_get_titerator(m_epub, titerator_type(type), opt);
        if (it)
            return new TIterator(it);
        return 0;
    }
    ...
};

python-epub

Now it’s time to say to the generator which goes in and which must change in the bindings.

The global header

The global header is a file that includes all other headers of the library that will be analyzed. In the global header the binding developer may also add some tweaks, like a #define that will trigger some condition in the target library headers, that will affect the generated binding.

The epub_global.h that I use here is very simple:

#ifndef EPUB_GLOBAL_H
#define EPUB_GLOBAL_H

#include &lt;epub.h&gt;
#include &lt;epub_shared.h&gt;
#include &lt;epub_version.h&gt;

#include "epub_cpp_wrappers.h"

// ApiExtractor complains if it finds only pre-definitions.
struct titerator {};
struct eiterator {};

#endif

As the commentary tells us, ApiExtractor doesn’t like when it finds forward declarations without definitions. I choose to add these two bogus definitions for struct titerator and struct eiterator. (For some reason the generator said nothing about struct epub.) Other option would be not to add those bogus definitions, but add a line to the type system file telling the generator to ignore warnings relative to those structures.

The Type System description

Here follows a snippet from typesystem_epub.xml.

<?xml version='1.0'?>
<typesystem package='epub'>
    ...
    <rejection enum-name='epub_metadata' />
    ...
    <object-type name='EPub'>
        <enum-type name='metadata' />
        ...
    </object-type>
</typesystem>

The epub_metadata enum is not exported to Python (at least not as it is) and to avoid the generator emitting his warnings, it must be explicitly rejected; its C++ substitute is added afterwards inside the EPub object type. The same happens to the other enums.

In the type system notation object-type refers to objects that are passed around solely as pointers (like EPub, whose copy constructor and operator are private). If, otherwise, the object can be passed as value, it should be declared as an value-type, as our Rectangle example mentioned before.

Python’s Iterator Protocol

In Python, if an object supports the Iterator Protocol I can use it on for statements, like this:

from epub import EPub, TIterator
book = EPub.open(self.epub_file)
for toc_it in book.get_titerator(TIterator.NAVMAP):
    if not toc_it.isValid():
        continue
    print 'link : ' + toc_it.link()
    print 'label: ' + toc_it.label()

Following the Python Iterator Protocol consists solely of an object implementing the methods __iter__() and __next__().

Note: the correct name for the method is next() and not __next__(). That's a minor bug in the generator that I just found.

The XML to add iterator protocol features into TIterator class will be lengthy, so I'll split it into two parts, the first dealing with type system code templates.

Type System templates

<?xml version='1.0'?>
<typesystem package='epub'>
    ...
    <template name='iterator.__iter__'>
        Py_INCREF(%PYSELF);
        %PYARG_0 = %PYSELF;
    </template>
    <template name='iterator.__next__'>
        if (%CPPSELF.next()) {
            <insert-template name='iterator.__iter__' />
        } else {
            PyErr_SetNone(PyExc_StopIteration);
        }
    </template>
    ...

The iterator methods for TIterator and EIterator have exactly the same implementation, so it’ll be smart to use type system templates and have the code, and its eventual bugs, in a single place.

iterator.__iter__ method just need to return the object itself with its reference counter incremented by one. iterator.__next__ calls the underlying C++ object’s next() (it also returns itself, and increments the refcounter), and raises a Python StopIteration exception when it reaches the end.

The type system variables %PYSELF, %PYARG_0 and %CPPSELF are replaced by values dependent on the context where they are used (e.g. TIterator or EIteration classes). Check the documentation for their meaning·

Adding the iterator methods

    ...
    <object-type name='TIterator'>
        ...
        <modify-function signature='next()' remove='all' />
        <add-function signature='__iter__' return-type='PyObject*'>
            <inject-code class='target' position='beginning'>
                <insert-template name='iterator.__iter__' />
            </inject-code>
        </add-function>
        <add-function signature='__next__' return-type='PyObject*'>
            <inject-code class='target' position='beginning'>
                <insert-template name='iterator.__next__' />
            </inject-code>
        </add-function>
    </object-type>
     ...
</typesystem>

First the original C++ next() is removed, then the ones for the Python iterator protocol are added, using <insert-template/> tag to insert the previously defined custom code. Exactly the same lines will be added to EIterator class.

Just one more bit of hackery

Only one obstacle remains on the way of having proper Python iterators. When Python’s for statement is used to iterate through an iterable object, in the first round it calls the object’s __iter__ method, and immediately after it calls next, and keeps calling next for each new iteration.

The problem here is that our underlying C iterator returns an object loaded with proper content when __iter__ is called, then the way that Python’s for iteration works will cause the first item to be bypassed. A workaround for this case is to use a flag on the C++ wrapper that checks if the iterator has just been created, so that it will not move forward when next is called on it for the first time.

class TIterator {
public:
    ...
    inline bool next() {
        if (m_isFirst) {
            m_isFirst = false;
            return true;
        }
        return epub_tit_next(m_iter);
    }
private:
    friend class EPub;
    explicit TIterator(struct titerator* iter)
        : m_iter(iter), m_isFirst(true) {}
    struct titerator* m_iter;
    bool m_isFirst;
};

I had this freedom because TIterator is a class completely under my (me, the binding developer) control. If struct titerator were a C++ class from the beginning that approach would not be the best. Perhaps libshiboken (the supporting library used by all Shiboken generated bindings) should provide a base iterator class to handle this particular difference between Python and C++ iterators. Or perhaps the generated class, when identified as an iterable by the presence of iterator protocol methods added by the binding developer, should have such provisions. The latter options seems best, and that’s one more item for my list of future improvements.

Custom Conversions

Returning unicode values

The EIterator::curr() methods EIterator::curr_url() returns values of char* type, which doesn’t have a converter (const char* does have), so I’ve written a custom piece of code to convert it to Python’s unicode.

<?xml version='1.0'?>
<typesystem package='epub'>
    ...
    <template name='return_char_pointer'>
        char* %0 = %CPPSELF.%FUNCTION_NAME();
        if (%0) {
            %PYARG_0 = PyUnicode_DecodeUTF8(%0, strlen(%0), "strict");
        } else {
            Py_INCREF(Py_None);
            %PYARG_0 = Py_None;
        }
    </template>
    ...
    <object-type name='EIterator'>
        ...
        <modify-function signature='curr()'>
            <inject-code class='target' position='beginning'>
                <insert-template name='return_char_pointer' />
            </inject-code>
        </modify-function>
        <modify-function signature='curr_url()'>
            <inject-code class='target' position='beginning'>
                <insert-template name='return_char_pointer' />
            </inject-code>
        </modify-function>
        ...
    </object-type>
    ...
</typesystem>

Modifying a method’s signature

Some C++ method signatures couldn’t automatically be converted to meaningful Python code, so I added more custom code to handle the situation on a case by case basis. For example, the method

unsigned char**
EPub::get_metadata(metadata type, int* size);

The int* size argument receives a pointer to an int, which will contain the size of the list of unicode strings returned as unsigned char**. In Python it would merely return a list of unicode objects, and the call to get_metadata doesn’t need the size argument.

I believe the type system description is enough to see how the modification system works.

<object-type name='EPub'>
  <enum-type name='metadata' />
  <modify-function signature='get_metadata(EPub::metadata,int*)'>
    <modify-argument index='2'>
      <remove-argument />
    </modify-argument>
    <modify-argument index='return'>
      <replace-type modified-type='PyTuple' />
    </modify-argument>
    <inject-code class='target'>
      unsigned char** data = 0;
      int size;
      data = %CPPSELF.%FUNCTION_NAME(%1, &size);
      if (data) {
        %PYARG_0 = PyTuple_New(size);
        PyObject* uni = 0;
        for (int i = 0; i < size; ++i) {
         uni = PyUnicode_DecodeUTF8((const char*)data[i],
                                    strlen((const char*)data[i]),
                                    "strict");
         PyTuple_SetItem(%PYARG_0, i, uni);
       }
       for (int i = 0; i < size; ++i)
         free(data[i]);
       free(data);
     } else {
       Py_INCREF(Py_None);
       %PYARG_0 = Py_None;
     }
   </inject-code>
  </modify-function>
  ...
</object-type>

Downloading, building, etc.

Enough with the explanations, now let’s try the code. Remember that you’ll need also development files that I mentioned long time ago.

Clone the latest version from the git repository:

git clone git://github.com/setanta/python-epub.git

or download the tarball: python-epub.tar.bz2

Inside the source code directory create a build directory and …

cd python-epub
mkdir build
cd build
cmake ..
make
ctest

The last command deserves some talking about.

Testing, Testing, Testing

When working with binding development there’s a myriad of things that can go wrong, and a number of them go wrong in complete silence. With this in mind, I tell you that having unit tests makes the binding developer life bearable.

To see detailed results from the tests, run ctest with -V (verbose) option.

Check the tests directory for examples on how to use the python-epub bindings.

UI Example (or else it would be boring)

But the UI itself is very boring, although it was made with the amazing PySide I did it as simple as possible, it doesn’t even show images. On the other hand, with very little code I can now see what’s inside an ePub ebook.

From python-epub/build directory call the ePub viewer like this:

python ../simple-ui/bookviewer.py ../tests/beyond-the-wall-of-sleep.epub

It always expects a parameter with the path to an ePub file, in this case “Beyond the Wall of Sleep” by Lovecraft.

Screenshots:

Simplest ePub viewer (made with PySide)

0 comments ↓

There are no comments yet...Kick things off by filling out the form below.

Leave a Comment