So I thought I'd start small. And simple. So I thought, what is an easily available source of data in my life to do some preliminary work? The answer was right next to me as I sat at my desk.

I am not a bibliophile by any stretch of the imagination, as I try to make good use of the public library when I can. I'd prefer to avoid spending copiously on books which will be read once and then collect dust. I have, over time however, amassed a small collection which is currently surpassing the capacity of my tiny IKEA bookcase.

I catalogued all the books in my collection and kept track of a few simple characteristics: number of pages, list price, publication year, binding, type (fiction, non-fiction or reference), subject, and whether or not I had read the book from cover-to-cover ("Completed").

At the time of cataloguing I had a total of 60 books on my bookshelf. Summary of data:

> source("books.R")

[1] "Reading books.csv"

> summary(books)

Min. : 63.0

1st Qu.: 209.5

Median : 260.0

Mean : 386.1

3rd Qu.: 434.0 Max. :1694.0

Binding Year Type Subject

Hardcover:21 Min. :1921 Fiction :15 Math :12

Softcover:39 1st Qu.:1995 Non-fiction:34 Communications: 7

Median :2002 Reference :11 Humour : 6

Mean :1994 Coffee Table : 5

3rd Qu.:2006 Classics : 4

Max. :2011 Sci-Fi : 4

(Other) :22

Price Completed

Min. : 1.00 -:16

1st Qu.: 16.45 N:13

Median : 20.49 Y:31

Mean : 35.41

3rd Qu.: 30.37

Max. :155.90

Some of this information is a bit easier to interpret if provided in visual form (click to enlarge):

Looking at the charts we can see that I'm not really into novels, and that almost 1/5th of my library is reference books - due mainly to textbooks from university I still have kicking around. For about 1/3rd of the books which are intended to be read cover-to-cover I have not done so ("Not Applicable" refers to books like coffee-table and reference books which are not intended to be read in their entirety).

Breaking it down further we look at the division by subject/topic:

Interestingly enough, the topics in my book collection are varied (apparently I am well-read?), with the largest chunks being made up by math (both pop-science and textbooks) and communications (professional development reading in the last year).

Let's take a look at the relationship between the list price of books and other factors.

As expected, there does not appear to be any particular relationship between the publication year of the book and the list price. The outliers near the top of the price range are the textbooks and those on the very far left of publication date are Kafka.

A more likely relationship would be that between a book's length and its price, as larger books are typically more costly. Having a look at the data for all the books it appears this could be the case:

We can coarsely fit a trendline to the data:

> price <- books$Price

> pages <- books$Pages

> page_price_line <- lm(price ~ pages)

> summary(page_price_line)

Call:

lm(formula = price ~ pages)

Residuals:

Min 1Q Median 3Q Max

-56.620 -13.948 -6.641 -1.508 109.802

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 9.95801 6.49793 1.532 0.131

pages 0.06592 0.01294 5.096 3.97e-06 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 32.19 on 58 degrees of freedom

Multiple R-squared: 0.3092, Adjusted R-squared: 0.2973

F-statistic: 25.96 on 1 and 58 DF, p-value: 3.971e-06

> pages <- books$Pages

> page_price_line <- lm(price ~ pages)

> summary(page_price_line)

Call:

lm(formula = price ~ pages)

Residuals:

Min 1Q Median 3Q Max

-56.620 -13.948 -6.641 -1.508 109.802

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 9.95801 6.49793 1.532 0.131

pages 0.06592 0.01294 5.096 3.97e-06 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 32.19 on 58 degrees of freedom

Multiple R-squared: 0.3092, Adjusted R-squared: 0.2973

F-statistic: 25.96 on 1 and 58 DF, p-value: 3.971e-06

Our p-value is super small however our goodness of fit (R-squared) is not great. There appears to be some sort of clustering going on here as the larger values (both in price and pages) are more dispersed. We re-examine the plot and divide by binding type:

The softcovers make up the majority of the tightly clustered values and the values for the hardcovers seem to be more spread out. The dashed line is the linear fit for the hardcovers and the solid line for the soft. However the small number (

*n=21*) and dispersion of the points for the former make even doing this questionable. That point aside, we can see on the whole that hardcovers appear to be more expensive, as one would expect. This is illustrated in the box plot below:However there a lot of outlying points on the plot. Looking at the scatterplot again we divide by book type and the picture becomes clearer:

It is clear the reference books make up the majority of the extreme values away from those clustered in the lower regions of the plot and thus could be treated separately.

**Closing notes:**

- I did not realize how many non-fiction / general interest / popular reading books have subtitles (
*e.g. Zero - The Biography of A Dangerous Idea*) until cataloguing the ones I own. I suppose this is to make them seem more interesting, with the hopes that people browsing at bookstores to read the blurb on the back and be enticed to purchase the book.

- Page numbering appears to be completely arbitrary. When I could I used the last page of each book which had a page number listed. Some books have the last page in the book numbered, others have the last full page of text numbered, and still others the last written page before supplementary material at the back (index, appendix, etc.) numbered. The first numbered page also varies, accounting for things like the table of contents, introduction, prologue, copyright notices and the like.

- Textbooks are expensive. Unreasonably so.

- Amazon has metadata for each book which you can see under "Details" when you view it (I had to look up some things like price when it was not listed on the book. In these cases, I used Amazon's "list price", the crossed out value at the top of the page for a book). I imagine there is an enormous trove of data which would lend itself to much more interesting and detailed analysis than I could perform here.