ANOVEL

B. Analysis and Insights

Now, using cleaned.books.df, we can begin.

First, I wanted to see how publication year affected the distribution of the top 50 books across genres. The results can be found below in a Shiny app that aggregates genre count across year.

Some highlights:

Young Adult literature, though relatively under-the-radar for most of the 90s, spiked hard in 2005/2006, something that could possibly be attributed to a sudden influx of YA authors emerging in response to the sheer demand generated by a certain controversially-acclaimed vampire novel.
Contemporary literature has experienced a sharp increase in popularity over the past two years, rising to claim the top spot in 2019 (so far).

Next, in order to explore how trends in synopses varied across genres, I utilized the wordcloud package as well as tidytext: performing sentiment analysis on each synopsis, then grouping the results by genre, proved to be quite illuminating.

Note - some charts may be inaccessible on mobile!

Some highlights:

Nonfiction, Historical, Contemporary, and Romance all trend generally positive on overall sentiment, making them unique in that none of the other genres exhibit this trend to anywhere close to this extent. This may suggest that in general, people prefer to read about positive things happening in situations that are close to real life (which can be extrapolated to suggest that to some degree, people are averse to reading about negative situations that are analogous to real life.
The visualization poses reasonable evidence that a lot of popular Young Adult literature is quite formulaic; many books seem to revolve around the "chosen-one" archetype, illustrated by how some of the most-popular words are "one", "will", and "world". Furthermore, YA synopses tend to hint at upcoming conflict, shown by the prevalence of "anticipation" and "fear" in the sentiment radar chart for this category.
Romance synopses, from a sentiment perspective, are much more positive than other genres: they're strong in anticipation, trust, and joy. In other words, if you're a romance writer, hopeful synopses are probably the way to go.

Armed with these preliminary insights, it was time to dive into prediction.

The first step here was identifying the relevant target metric. There were three potential candidates from earlier: recall...

Success Metrics

Rating
- Out of 5 stars - did people like it?
Rating Count
- Number of people who thought a particular book was worth their time to click on the rating bar
Review Count
- Number of people who felt strongly enough about a particular book to write an actual review

R yields the following summary statistics:

To come up with a success metric that takes all three factors into account (albeit to differing extents), we pose the following equation:

success = (rating / max(rating)) * reviewCount

This allows for books with more reviews and higher ratings to take priority, as though ratings are clearly evident of quality, reviews are generally more indicative of people feeling strongly about a book - one way or another. And reader engagement is good unless it's not

After much trial-and-error, standard linear regression was selected as the prediction method, and after even more trial-and-error, the following model was produced.

success ~ titleWordCount + Fantasy + Romance + Mystery + Historical + Contemporary + Thriller + `Young Adult` + `Science Fiction` + negative + positive + anger + joy

This model factored in word-count metrics, genre, and the most-impactful elements of sentiment. To train, the data was split into 80% training and 20% testing.

In order to assess the performance of the model, we compare it to a naive model that predicts the mean success score every single time. By taking the RMSE of the naive model's performance and comparing it to the RMSE of this model, we find that this model performs roughly 4.91% better than the naive model. Evidently, there's a lot of room for further analysis, but it was an interesting thought experiment nevertheless.

C. Conclusion and Looking Forward

From the results, it's clear that while the data does show some interesting trends regarding overall preference, genres, and sentiment, the question of whether or not it's possible to predict a book's reception from the limited information available in its synopsis is leaning heavily towards a no. Books are complex, synopses are short - so, if you really want to judge a book, it's probably best to just read it.

Looking forward, it would be nice to run a more in-depth analysis using a much larger dataset: the one utilized for this project is inherently biased in that it only contains books that have attained a baseline success level (if they're on Goodreads' bestseller list, they must all be above average in some way). Since there's so much variance in synopses, a larger sample size may help expose further trends. Another useful addition might be to add a script capable of parsing the actual cover image and breaking it down into components to actually attempt "judging a book by its cover".

Furthermore, assuming the data could be obtained, it would be quite intriguing to run this using a book's actual text instead of its synopsis and examining story arcs using sentiment.

D. Appendix

Full code + both Shiny apps: GitHub

Scraped and cleaned dataset: Drive

can you judge a book by its cover?

Introduction

Everyone knows you can't judge a book by its cover... or can you? Many authors spend hours and hours writing the perfect synopsis in the hope that potential readers will find themselves intrigued enough by those few lines on the back of a book jacket to take the leap. Let's find out if all their effort is worth it.

Data

Several potential data sources were considered for this project, including Amazon, the NYT Bestsellers list, and Goodreads. However, after further consideration, I opted to use Goodreads' comprehensive database alongside its custom bestseller list. Goodreads is one of the most-renowned book-review websites, with over 2 billion books registered. Its custom bestseller list stood out due to it illustrating readers' favorite books by publication year as opposed to year of popularity. Since the goal of this project was to not only assess the changing trends of the industry over time but also predict more-generally what makes particular books successful, Goodreads' database was the right choice, as aside from publication years, it also offered full synopses, page counts, and ratings - providing us with more than enough data to dive in.

Objective

To examine reader preferences, analyze how they've changed over time, and predict reception for new material.

A. Data Collection and Processing

When you step into any bookstore, you probably assess a couple things before choosing your next adventure. You look down at the book in your hand, read the title - does it fit with the perceived genre? Does the book seem ridiculously long? And more than anything - what's it about?

This experience can be broken down into a variety of factors and dimensions. Luckily, the Goodreads data has them all.

Factors

Title*
- Are pithy titles better than longer titles?
Synopsis
- Does the overall sentiment of a synopsis matter? Do the specific words used matter?
Page Count
- Are longer books received better? Are shorter ones?

*text/sentiment analysis was considered here, but ultimately discounted since titles in literature are even more variable than synopses

Dimensions

Genre
- Which genres are most-popular?
Publication Year
- How have reader preferences varied over time?

With regards to the model to be discussed later, some important metrics for success conveniently available from Goodreads are as follows.

Success Metrics

Rating
- Out of 5 stars - did people like it?
Rating Count
- Number of people who thought a particular book was worth their time to click on the rating bar
Review Count
- Number of people who felt strongly enough about a particular book to write an actual review

How many books are we looking at? Initially, the plan was to take the top 200 published each year from years 1919-2019 in order to obtain a huge breadth of data. However, since each of these books would necessitate a page load taking up roughly 1 second (with an additional 2 seconds of wait time between requests so as not to inundate the server), it was determined that this would take roughly 200*100*(2+1) seconds, or 16.67 hours. Having only one computer, some rescoping was necessary - in the end, I settled on the top 50 books published each year from 1989-2019 (inclusive), giving us ~1550 rows of comprehensive data: not quite as breadthy, but still good enough and requiring only 1.3 hours of scraping.

All data was collected using R, though the WebScraper Chrome extension and the built-in Inspect Element tool were an excellent resource for determining the tags.

Screen Shot 2019-05-06 at 12.06.13 AM.pn

Screen Shot 2019-05-06 at 12.06.23 AM.pn

Please refer to dp2.Rmd in the Appendix for the full, commented source.

At this point, all the data necessary to continue was contained within raw.books.df. However, there was still a lot that needed to be done to make everything palatable: the relevant code can be found again in the Appendix, but in summary:

Get pubYear strings in multiple formats (ex. "Published 2004 (first published 1994)") into neat single years.
Clean + strip out random characters from synopses.
Use regular expressions to clean non-word characters from ratings, rating counts, review counts, and page counts.
Concatenate all genre strings into one huge string of the form "Fiction, Fantasy, Romance, Science Fiction, etc.", split the mega-string by commas into a vector, identify the top 10 genres in the data via the table() function, then add a binary variable for each of those genres as a new column, and finally flag every book based on whether or not the genres show up in their original genre string.

Screen Shot 2019-05-06 at 12.06.47 AM.pn

Screen Shot 2019-05-06 at 12.07.07 AM.pn