Blog RSS Feed

Archive for the ‘Data Modeling’ Category

Bible Annotation Modeling and Querying in MySQL and CouchDB

Thursday, September 1st, 2011

If you’re storing people’s Bible annotations (notes, bookmarks, highlights, etc.) digitally, you want to be able to retrieve them later. Let’s look at some strategies for how to store and look up these annotations.

Know What You’re Modeling

First you need to understand the shape of the data. I don’t have access to a large repository of Bible annotations, but the Twitter and Facebook Bible citations from the Realtime Bible Search section of this website provide a good approximation of how people cite the Bible. (Quite a few Facebook posts appear to involve people responding to their daily devotions.) These tweets and posts are public, and private annotations may take on a slightly different form, but the general shape of the data should be similar: nearly all (99%) refer to a chapter or less.

Large dots at the bottom indicate many single-verse references. Chapter references are also fairly prominent. See below for more discussion.

Compare Bible Gateway reading habits, which are much heavier on chapter-level usage, but 98% of accesses still involve a chapter or less.

The Numbers

The data consists of about 35 million total references.

Percent of Total Description Example
73.5 Single verse John 3:16
17.1 Verse range in a single chapter John 3:16-17
8.4 Exactly one chapter John 3
0.7 Two or more chapters (at chapter boundaries) John 3-4
0.1 Verses spanning two chapters (not at chapter boundaries) John 3:16-4:2
0.1 Verses spanning three or more chapters (not at chapter boundaries) John 3:16-5:2

About 92.9% of posts or tweets cited only one verse or verse range; 7.1% mentioned more than one verse range. Of the latter, 77% cited exactly two verse ranges; the highest had 323 independent verse ranges. Of Facebook posts, 9.1% contained multiple verse ranges, compared to 4.2% of tweets. When there were multiple ranges, 43% of the time they referred to verses in different books from the other ranges; 39% referred to verses in the same book (but not in the same chapter); and 18% referred to verses in the same chapter. (This distribution is a unusual—normally close verses stick together.)

The data, oddly, doesn’t contain any references that span multiple books. Less than 0.01% of passage accesses span multiple books on Bible Gateway, which is probably a useful upper bound for this type of data.

Key Points

  1. Nearly all citations involve verses in the same chapter; only 1% involve verses in multiple chapters.
  2. Of the 1% spanning two or more chapters, most refer to exact chapter boundaries.
  3. Multiple-book references are even more unusual (under 0.01%) but have outsize effects: an annotation that references Genesis 1 to Revelation 22 would be relevant for every verse in the Bible.
  4. Around 7% of notes contained multiple independent ranges of verses—the more text you allow for an annotation, the more likely someone is to mention multiple verses.

Download

Download the raw social data (1.4 MB zip) under the usual CC-Attribution license.

Data Modeling

A Bible annotation consists of arbitrary content (a highlight might have one kind of content, while a proper note might have a title, body, attachments, etc., but modeling the content itself isn’t the point of this piece) tied to one or more Bible references:

  1. A single verse (John 3:16).
  2. A single range (John 3:16-17).
  3. Multiple verses or ranges (John 3:16, John 3:18-19)

The Relational Model

One user can have many rows of annotations, and one annotation can have many rows of verses that it refers to. To model a Bible annotation relationally, we set up three tables that look something like this:

users

user_id name
1

annotations

user_id annotation_id content
1 101
1 102
1 103

annotation_verses

The verse references here are integers to allow for easy range searches: 43 = John (the 43rd book in the typical Protestant Bible); 003 = the third chapter; the last three digits = the verse number.

I like using this approach over others (sequential integer or separate columns for book, chapter, and verse) because it limits the need for a lookup table. (You just need to know that 43 = John, and then you can find any verse or range of verses in that book.) It also lets you find all the annotations for a particular chapter without having to know how many verses are in the chapter. (The longest chapter in the Bible has 176 verses, so you know that all the verses in John 3, for example, fall between 43003001 and 43003176.) This main disadvantage is that you don’t necessarily know how many verses you’re selecting until after you’ve selected them. And using individual columns, unlike here, does allow you to run group by queries to get easy counts.

annotation_id start_verse end_verse
101 43003016 43003016
102 43003016 43003017
103 43003016 43003016
103 43003019 43003020

Querying

In a Bible application, the usual mode of accessing annotations is by passage: if you’re looking at John 3:16-18, you want to see all your annotations that apply to that passage.

Querying MySQL

In SQL terms:

select distinct(annotations.annotation_id)
from annotations, annotation_verses
where annotation_verses.start_verse <= 43003018 and
annotation_verses.end_verse >= 43003016 and
annotations.user_id = 1 and
annotations.annotation_id = annotation_verses.annotation_id
order by annotation_verses.start_verse asc, annotation_verses.end_verse desc

The quirkiest part of the SQL is the first part of the “where” clause, which at first glance looks backward: why is the last verse in the start_verse field and the first verse in the end_verse field? Because the start_verse and end_verse can span any range of verses, you need to make sure that you get any range that overlaps the verses you’re looking for: in other words, the start_verse is before the end of the range, and the end_verse is after the start.

Visually, you can think of each start_verse and end_verse pair as a line: if the line overlaps the shaded area you’re looking for, then it’s a relevant annotation. If not, it’s not relevant. There are six cases:

Start before, end before: John 3:15 / Start before, end inside: John 3:15-17 / Start before, end after: John 3:15-19 / Start inside, end inside: John 3:16-18 / Start inside, end after: John 3:17-19 / Start after, end after: John 3:19

The other trick in the SQL is the sort order: you generally want to see annotations in canonical order, starting with the longest range first. In other words, you start with an annotation about John 3, then to a section inside John 3, then to individual verses. In this way, you move from the broadest annotations to the narrowest annotations. You may want to switch up this order, but it makes a good default.

The relational approach works pretty well. If you worry about the performance implications of the SQL join, you can always put the user_id in annotation_verses or use a view or something.

Querying CouchDB

CouchDB is one of the oldest entrants in the NoSQL space and distinguishes itself by being both a key-value store and queryable using map-reduce: the usual way to access more than one document in a single query is to write Javascript to output the data you want. It lets you create complex keys to query by, so you might think that you can generate a key like [start_verse,end_verse] and query it like this: ?startkey=[0,43003016]&endkey=[43003018,99999999]

But no. Views are one-dimensional, meaning that CouchDB doesn’t even look at the second element in the key if the first one matches the query. For example, an annotation with both a start and end verse of 19001001 matches the above query, which isn’t useful for this purpose.

I can think of two ways to get around this limitation, both of which have drawbacks.

GeoCouch

CouchDB has a plugin called GeoCouch that lets you query geographic data, which actually maps well to this data model. (I didn’t come up with this approach on my own: see Efficient Time-based Range Queries in CouchDB using GeoCouch for the background.)

The basic idea is to treat each start_verse,end_verse pair as a point on a two-dimensional grid. Here’s the above social data plotted this way:

A diagonal line starts in the bottom left corner and continues to the top right. Large dots indicate popular verses, and book outlines are visible.

The line bisects the grid diagonally since an end_verse never precedes a start_verse: the diagonal line where start_verse = end_verse indicates the lower bound of any reference. Here are some points indicating where ranges fall on the plot:

This chart looks the same as the previous one but has points marked to illustrate that longer ranges are farther away from the bisecting line.

To find all the annotations relevant to John 3:16-18, we draw a region starting in the upper left and continuing to the point 43003018,43003016:

This chart looks the same as the previous one but has a box from the top left ending just above and past the beginning of John near the upper right of the chart.

GeoCouch allows exactly this kind of bounding-box query: ?bbox=0,43003016,43003018,99999999

You can even support multiple users in this scheme: just give everyone their own, independent box. I might occupy 1×1 (with an annotation at 1.43003016,1.43003016), while you might occupy 2×2 (with an annotation at 2.43003016,2.43003016); queries for our annotations would never overlap. Each whole number to the left of the decimal acts as a namespace.

The drawbacks:

  1. The results aren’t sorted in a useful way. You’ll need to do sorting on the client side or in a show function.
  2. You don’t get pagination.

Repetition at Intervals

Given the shape of the data, which is overwhelmingly chapter-bound (and lookups, which at least on Bible Gateway are chapter-based), you could simply repeat chapter-spanning annotations at the beginning of every chapter. In the worst case annotation (Genesis 1-Revelation 22), you end up with about 1200 repetitions.

For example, in the Genesis-Revelation case, for John 3 you might create a key like [43000000.01001001,66022021] so that it sorts at the beginning of the chapter—and if you have multiple annotations with different start verses, they stay sorted properly.

To get annotations for John 3:16-18, you’d query for ?startkey=[43003000]&endkey=[43003018,{}]

The drawbacks:

  1. You have to filter out all the irrelevant annotations: if you have a lot of annotations about John 3:14, you have to skip through them all before you get to the ones about John 3:16.
  2. You have to filter out duplicates when the range you’re querying for spans multiple chapters.
  3. You’re repeating yourself, though given how rarely a multi-chapter span (let alone a multi-book span) happens in the wild, it might not matter that much.

Other CouchDB Approaches

Both these approaches assume that you want to make only one query to retrieve the data. If you’re willing to make multiple queries, you could create different list functions and query them in parallel: for example, you could have one for single-chapter annotations and one for multi-chapter annotations. See interval trees and geohashes for additional ideas. You could also introduce a separate query layer, such as elasticsearch, to sit on top of CouchDB.