*NOTE: This post contains interactive charts which are best viewed on a large screen.


In this post, I analyse Goodreads’s Goodbooks-10k dataset. Goodreads is the most popular website for readers to share book reviews and maintain reading lists. As of 2020, Goodreads has more than 90 million users. The dataset contains 6 million user ratings for 10,000 most popular books.

I conducted the analysis in Spark on Amazon’s Elastic MapReduce (EMR). The visualisation was done in Plotly.

For the code, see my Github repo.

The most popular book according to the number of ratings is The Hunger Games, followed by Harry Potter. In the top 10, Pride and Prejudice is the oldest book, published more than 100 years before any of the others.

top_goodreads_books


Most unfinished

Have you ever felt the guilt of not finishing something you started? You are not alone. Here, we summarised books that are most often given tags such as “unfinished”, “just-cant-do-it”, “half-finished” by readers.

  • 1. Catch-22, Joseph Heller
  • 2. A Game of Thrones (A Song of Ice and Fire, #1), George R.R. Martin
  • 3. The Book Thief, Markus Zusak
  • 4. Anna Karenina, Leo Tolstoy
  • 5. Lolita, Vladimir Nabokov
  • 6. American Gods (American Gods, #1), Neil Gaiman
  • 7. Jonathan Strange & Mr Norrell, Susanna Clarke
  • 8. Pride and Prejudice, Jane Austen
  • 9. Fifty Shades of Grey, E.L. James
  • 10. 1984, George Orwell

Most controversial

These are the books that have the highest variance in their ratings. I.e., people either give them very high or very low ratings.

top_controlversial_books


So religious texts, Twilight, and Fifty Shade of Grey are the most polarising books. Who would have guessed?

The book network

Next, I turn the dataset into a graph problem. Studying the relationship between books via the readers’ preference, we can identify relationships that may not be obvious.

In the graph below, books are linked if they share more than 2000 unique readers who gave them a 5-star rating. You can hover on the circles to see the titles.

In the centre are seven dark green circles - the seven Harry Potter books, along with To Kill a Mockingbird and The Hunger Games. Let’s call this the Mainstream Centre.


On the bottom left, you find the notorious Game of Thrones (A Song of Ice and Fire) series. It connects back to the center via the first book in the series.

At the top, the Tolkienian works are clustered together and linked to the Mainstream Center via The Hobbit. In graph theory, these connecting nodes are known as gatekeepers, since they establish the gateway between different groups.

On the right side, Georgie Orwell’s dystopian classics Animal Farm and 1984 stand in solitude. We also see that Pride and Prejudice is more “mainstream” than Jane Eyre.

These linkages can be used as a recommendation system. We can also help readers get out of their comfort zone by skipping a few intermediate nodes from their favourite books.

Let’s add more titles, reduce the threshold to form a connection, and zoom on two particular books as below.


To Kill a Mockingbird is the centre of more “deep” books. It connects the Mainstream Center to Shakespeare, John Steinback etc.. In other words, To Kill a Mockingbird is the gateway drug to serious literature.

1984 is also near the centre, and it forms a particular link with Fahrenheit 451. We know that both stories deal with a protagonist who lives and fights back in a dystopian world with a dictatorial government, hence the connection.

Without knowledge of the genres or authors of the books, simply by using the preference of readers, we were able to cluster books into groups of similar themes and genres.

Here we add even more titles. I made the graph below zoomable so you can explore the clusters on your own.


Endnotes

Network diagrams as used above are a very powerful tool to study the relationships between entities. As we saw, it can establish implicit connections between entities, which is useful as a recommendation system or a clustering/segmentation tool. It has been used successfully in fraud detection as well as travel planning and optimisation. It is a topic that I would love to dig deeper into when I have time. For now, I need to reduce the number of unfinished books on my shelf!

Bonus

This is what happened when I visualised all the connections within 10,000 books - an inscrutable mess.