The concept of the "Book Genome Project" appeals to a good number of people in concept, but can become confusing when you try to actually define what you're trying to measure. What is the DNA and RNA of a book? How can you extract it meaningfully and accurately at scale? Unlike music, books are long and complicated, making it difficult to measure the literary equivalent of "beats per minute", or the type of instrument used, as in the Music Genome Project run by Pandora.com. Unlike with music, where you know most of what there is to know about the song within the first 80 seconds - I'd assume - a book rarely contains the fundamental essence of the story in the first chapter alone, and so the rise and fall of more subtle elements on a chapter-by-chapter basis is important to track.
So what is BookLamp's genome structure? I'll briefly touch on one branch of the multi-part genome structure, what we call Story DNA. Fundamentally, a story is made up of many components, but a good portion of it has to do with setting and content. Where does a story take place, and what are the elements that physically act in that story? The BookLamp algorithms, when it looks at a book and breaks it down for comparisons to other books, divides the genome into two components of Story DNA, known as Setting and Actors. Story Setting is defined by the environment that a story appears in, such as that it takes place in the forests, city, or on the sea. Story Actors, on the other hand, are the elements that act in the settings. An example of a Story Setting is the amount of "Forests & Trees" that appears in a book, vs "City Streets & Urban Environment" - two very different Story Settings. An example of a Story Actor would be "Medieval Weapons & Armor" - a physical instance that acts in the environment. To put this in perspective, a book with 30% Forests & Trees, along with 10% Medieval Weapons would be a very different story than 30% City Streets & Urban Environment, and the same 10% Medieval Weapons.
An example would be something like The Da Vinci Code, by Dan Brown, which has specific percentages of top thematic elements (there are hundreds of thematic elements for each book) like Religion & Religious Institutions,Police & Murder Investigation, Terrorism and Security, Art and Art Galleries, and Secret Societies & Communities. The system is able to identify how much of each them makes up the book, and if it is using more or less of that theme than in other books in the corpus.
If you're familiar with The Da Vinci Code, you'll see where these themes apply, as well as the many more subtle themes that come and go throughout the novel. Because the software measures on a scene-by-scene basis, we know the exact make-up of the 36th chapter of a book, for example, as well as the first chapter. To give you a perspective of the depth of the system, we currently measure and store a little over 30,120 points of data for every book we analyze, giving us a database with literally hundreds of millions of elements across the corpus, with full expectation that it'll grow into the billions within the next few months. We don't talk much about the academic history of the project, but the research and development required to engineer and build this system over the last few years has been significant, pulling in skill sets from engineers and researchers from universities all around the world. This level of detail, extrapolated across a database of tens of thousands of books, provide a very interesting picture of the world of literature.
While the value of Story DNA is great, it is only one branch of our Book Genome structure. The Language DNA and Character DNA are also very important elements of what make up a book, and can be critical to how a reader responds to a book is. No story elements are worthwhile unless the medium used to delivery them (the language) has a minimum level of appeal, as well. After all, despite Twilight and Romeo and Juliet sharing a lot of storyline similarities, I doubt that floods of Stephanie Myer fans ever instinctively ran out and bought copies of the Shakespearean plays. The language differences naturally represented tremendous barriers to enjoyment for some readers.
It's also worth noting that only a small portion of what makes a book appeal to a reader can be found in the objective analysis of the genome structure; our goal as a project and website is to measure what can be measured, and place that data in the hands of the user to help them discover the books they're interested in.
None of the materials on the site currently make use of any thematic elements at all. You'll see more of these features, though, as we begin to revamp the site from the ground up over the coming weeks and months.
Until next time,