Context:

In 1797, Jean-Auguste-Dominique Ingres travelled to Paris to study in the studio of Jacques Louis David. Ingres noted that one should:

“never use bright colours […] they are anti-historic. It is better to fall into gray than to into bright colours.”

Therefore, I wondered if there was a statistically significant difference between the saturation values used by Ingres and David.

Hypotheses

Null Hypothesis: There is no statistically significant difference between the saturation values used by Ingres and David (95% confidence interval)

Alternative Hypothesis: There is a statistically significant difference between the saturation values used by Ingres and David as a result of a ‘real’ effect

Load Data

Data downloaded from Web Gallery of Art. As of September 2017, there are 44,307 images in the database.

setwd("D:/R projects/Web Gallery of Art/Data")

web_gal <- read.csv('catalog_1.csv',sep=';',stringsAsFactors=F)

Exploration

Before delving into the research question, let’s explore the dataset

str(web_gal)
## 'data.frame':    44307 obs. of  11 variables:
##  $ AUTHOR   : chr  "AACHEN, Hans von" "AACHEN, Hans von" "AACHEN, Hans von" "AACHEN, Hans von" ...
##  $ BORN.DIED: chr  "(b. 1552, Köln, d. 1615, Praha)" "(b. 1552, Köln, d. 1615, Praha)" "(b. 1552, Köln, d. 1615, Praha)" "(b. 1552, Köln, d. 1615, Praha)" ...
##  $ TITLE    : chr  "Allegory" "Bacchus, Ceres and Cupid" "Joking Couple" "Portrait of Emperor Rudolf II" ...
##  $ DATE     : chr  "1598" "-" "-" "1590s" ...
##  $ TECHNIQUE: chr  "Oil on copper, 56 x 47 cm" "Oil on canvas, 163 x 113 cm" "Copperplate, 25 x 20 cm" "Oil on canvas, 60 x 48 cm" ...
##  $ LOCATION : chr  "Alte Pinakothek, Munich" "Kunsthistorisches Museum, Vienna" "Kunsthistorisches Museum, Vienna" "Kunsthistorisches Museum, Vienna" ...
##  $ URL      : chr  "https://www.wga.hu/html/a/aachen/allegory.html" "https://www.wga.hu/html/a/aachen/bacchus.html" "https://www.wga.hu/html/a/aachen/j_couple.html" "https://www.wga.hu/html/a/aachen/rudolf2.html" ...
##  $ FORM     : chr  "painting" "painting" "painting" "painting" ...
##  $ TYPE     : chr  "mythological" "mythological" "genre" "portrait" ...
##  $ SCHOOL   : chr  "German" "German" "German" "German" ...
##  $ TIMEFRAME: chr  "1601-1650" "1601-1650" "1601-1650" "1601-1650" ...

The database is missing the URLs to the paintings. In order to analyse the saturation values, these will have to be added.

web_gal$IMAGE_URL <- gsub("/html/", "/art/", web_gal$URL) 
web_gal$IMAGE_URL <- gsub(".html", ".jpg", web_gal$IMAGE_URL)
web_gal$IMAGE_URL[1:5]
## [1] "https://www.wga.hu/art/a/aachen/allegory.jpg"
## [2] "https://www.wga.hu/art/a/aachen/bacchus.jpg" 
## [3] "https://www.wga.hu/art/a/aachen/j_couple.jpg"
## [4] "https://www.wga.hu/art/a/aachen/rudolf2.jpg" 
## [5] "https://www.wga.hu/art/a/aachen/selfport.jpg"
web_gal$SCHOOL <- as.factor(web_gal$SCHOOL)
web_gal$TYPE <- as.factor(web_gal$TYPE)
web_gal$FORM <- as.factor(web_gal$FORM)

type <- ggplot(data = web_gal, aes(x = reorder(web_gal$SCHOOL, web_gal$SCHOOL, length), fill = TYPE)) + geom_bar(stat = "count") + coord_flip() + ylab('') + xlab("ART SCHOOL")

form <- ggplot(data = web_gal, aes(x = reorder(web_gal$SCHOOL, web_gal$SCHOOL, length), fill = FORM)) + geom_bar(stat = "count") + coord_flip() + ylab('') + xlab("ART SCHOOL")

grid.arrange(type, form, ncol=2)

As probably expected, the large majority of paintings in the database are from the Italian School, with religious being the most popular Type. Furthermore, ‘painting’ is the most popular Form.

Now to see the Date distribution of the database.

web_gal$DATE <- gsub("[^0-9]", "", web_gal$DATE) 
web_gal$DATE <- substr(web_gal$DATE, start = 1, stop = 4)

web_gal$DATE <- as.numeric(web_gal$DATE)

ggplot(data = web_gal, aes(x = DATE, fill = TYPE)) + 
    geom_histogram() + scale_x_continuous(limits = c(min(web_gal$DATE), 2017))

note: 4600 NA DATE values were removed. Again, not surprisingly, the large majority of paintings from ~1000 AD to ~1700 AD are religious.

More analysis could be done regarding Italian paintings, however that will have to be another project.

Ingres and David comparison

For now, let’s subset the paintings by Ingres and David.

dav_ing <- web_gal[web_gal$AUTHOR == "DAVID, Jacques-Louis" | web_gal$AUTHOR == "INGRES, Jean-Auguste-Dominique" & web_gal$FORM == "painting" & !web_gal$TYPE == "other",]

Before downloading the images (using the previously created Image URL), the title of the paintings are extracted, and used to name the file.

title <- dav_ing$TITLE
url <- dav_ing$IMAGE_URL

for (i in 1:length(url)) {
  try(download.file(url[i], destfile= paste("D:/R projects/Web Gallery of Art/Artworks/", title[i], ".jpg", sep = ""), mode="wb"))
  Sys.sleep(2)
}

Downloaded images:

Read each painting and create list containing the Red, Green, and Blue values of each pixel that makes the painting.

downloaded_paintings <- list.files(path = "D:/R projects/Web Gallery of Art/Artworks/", pattern="\\.jpg$")

paintings_list <- list()

for (i in 1:length(downloaded_paintings)) {
  paintings <- readJPEG(paste("D:/R projects/Web Gallery of Art/Artworks/", downloaded_paintings[i], sep=""), native = FALSE)
  paintings_list[[i]] <- paintings 
}

Convert Red, Green, Blue (RGB) values to Hue, Saturation, and Value (HSV). This is done as HSV is the color model that comes closest to mimicking how humans perceive color (https://www.iteea.org/File.aspx?id=86467&v=1beee30d)

paintings_HSV <- lapply(paintings_list, function(paintings_list) apply(paintings_list, c(1,2), RGB2HSV))

In order to identify who the painting is by, the AUTHOR is attached to the name attribute.

downloaded_paintings_gsub <- gsub("\\.jpg$","", list.files(path = "D:/R projects/Web Gallery of Art/Artworks/", pattern="\\.jpg$"))

dav_ing <- dav_ing[dav_ing$TITLE %in% downloaded_paintings_gsub,]
dav_ing <- dav_ing[!duplicated(dav_ing$TITLE),]
dav_ing$AUTHOR <- as.factor(dav_ing$AUTHOR)

names(paintings_HSV) <- dav_ing$AUTHOR

Before looking at Saturation independently, Hue and Saturation will be explored. To do this, the mean Hue and Saturation of each painting will be extracted.

paintings_HSV_mean <- lapply(paintings_HSV, function(paintings_HSV) apply(paintings_HSV, 1, mean))

paintings_HSV_mean_hue <- unlist(lapply(paintings_HSV_mean, "[", 1))
paintings_HSV_mean_sat <- unlist(lapply(paintings_HSV_mean, "[", 2))

Bind HSV values, with information about Artist.

dav_ing_subset_df <- as.data.frame(cbind(dav_ing["AUTHOR"], dav_ing["DATE"], paintings_HSV_mean_hue, paintings_HSV_mean_sat))

str(dav_ing_subset_df)

In order to visualise the mean Hues and Saturations, the values are plotted on a colour spectrum wheel.

d <- expand.grid(h=seq(0,0.95,0.05), s=seq(0,0.9,0.1))
ggplot() +
  coord_polar(theta="x") +  scale_x_continuous(name="Hue", limits=c(0,1), breaks=seq(0.025,0.925,0.1), labels=seq(0,0.9,0.1)) +
  scale_y_continuous(name="Saturation", breaks=seq(0, 1, 0.2)) +
  scale_fill_identity() +
  geom_rect(data=d, mapping=aes(xmin=h, xmax=h+resolution(h), ymin=s, ymax=s+resolution(s), fill=hsv(h,s)), color="white", size=0.1) + geom_point(aes(x = dav_ing_subset_df$paintings_HSV_mean_hue, y = dav_ing_subset_df$paintings_HSV_mean_sat, colour = dav_ing_subset_df$AUTHOR), data = dav_ing_subset_df, size = 3) + labs(colour = "Artist", title = "Mean colours in Ingres' painting similar to David's", subtitle = "Mean Hue and Saturation used by David and Ingres") +  theme_tufte() 

From this view, it does appear that Ingres uses less saturated mean values in his historical paintings. However, the use of mean values limits futher statistical analysis. Furthermore, mean does not allow for distribution to be interrogated. Therefore, further analysis will look at the individual pixel Saturation values of each painting.

Ingres and David pixel-by-pixel Saturation

In order to do this, the saturation values for each pixel in each painting will be extracted for both Ingres and David.

sublist_dav <- paintings_HSV[grep("DAVID, Jacques-Louis", names(paintings_HSV))]
paintings_dav_sat <- lapply(sublist_dav, '[', 2,,)
dav_sat_vector <- unlist(paintings_dav_sat)
dav_sat_vector_random <- sample(dav_sat_vector, 10000000)
paintings_ing <- paintings_HSV[grep("INGRES, Jean-Auguste-Dominique", names(paintings_HSV))]
paintings_ing_sat <- lapply(paintings_ing, '[', 2,,)
ing_sat_vector <- unlist(paintings_ing_sat)
ing_sat_vector_random <- sample(ing_sat_vector, 10000000)

Due to computational reasons, 10 million random Saturation values from both Ingres and David’s paintings were extracted.

From the boxplot, it is apparant that the median saturation value for Ingres is less than the median saturation value for David.

ing_plot <- ggplot() + geom_histogram(aes(x = ing_sat_vector_random), bins = '30', fill = gray.colors(30, start =1, end = 0)) + theme_hc(bgcolor = "darkunica") + scale_y_continuous(labels = comma, limits=c(0, 1000000)) + labs(x = "Saturation in Ingres' Paintings", y = "Count") + theme(legend.position="none")

dav_plot <- ggplot() + geom_histogram(aes(x =dav_sat_vector_random), bins = '30', fill = gray.colors(30, start =1, end = 0)) + theme_hc(bgcolor = "darkunica") + scale_y_continuous(labels = comma, limits=c(0, 1000000))+ labs(x = "Saturation in David's Paintings", y = "Count") + theme(legend.position="none", axis.title.y=element_blank())

grid.arrange(ing_plot, dav_plot, ncol=2)

Statistical Assessment of Saturation Values

To further check whether the saturation values are normally distributed, the skewness will be examined. If the values are normally ditributed, the skewness should be less than half of the standard error.

Ingres Skewness: 0.5815511 Ingres standard error: 6.743155e-05

David Skewness: 0.3710303 David standard error: 7.011028e-05

The positive skewness value implies that the distribution of saturation values for both Ingres and David is skewed to the right. This is evident from looking at the histograms above.

Due to the skewness values being more than twice the standard error, both the saturation values for Ingres and David are not symmetric, and therefore not normally distributed. Due to the distribution being non-parametric, the Mann-Whitney-Wilcoxon Test was used.

wilcox.test(ing_sat_vector_random, dav_sat_vector_random) 
    Wilcoxon rank sum test with continuity correction

data:  ing_sat_vector_random and dav_sat_vector_random
W = 4.8249e+13, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

As the p-value is p < 0.001, which is less than the .05 significance level, we reject the null hypothesis. Therefore, we can conclude that the Ingres used less saturated values relative to David.

Ingres and David pixel-by-pixel Hue

Now that we know that the difference in saturation values used by Ingres and David are statistically significant, let’s do the same for Hue values.

sublist_dav <- paintings_HSV[grep("DAVID, Jacques-Louis", names(paintings_HSV))]
paintings_dav_hue <- lapply(sublist_dav, '[', 1,,)
dav_hue_vector <- unlist(paintings_dav_hue)
dav_hue_vector_random <- sample(dav_hue_vector, 10000000)
paintings_ing <- paintings_HSV[grep("INGRES, Jean-Auguste-Dominique", names(paintings_HSV))]
paintings_ing_hue <- lapply(paintings_ing, '[', 1,,)
ing_hue_vector <- unlist(paintings_ing_hue)
ing_hue_vector_random <- sample(ing_hue_vector, 10000000)
ing_plot_hue <- ggplot() + geom_histogram(aes(x = ing_hue_vector_random), bins = '30', fill = rainbow(30)) + theme_hc(bgcolor = "darkunica") + scale_y_continuous(labels = comma, limits=c(0, 2500000)) + labs(x = "Hue in Ingres' Paintings", y = "Count") + theme(legend.position="none")

dav_plot_hue <- ggplot() + geom_histogram(aes(x =dav_hue_vector_random), bins = '30', fill = rainbow(30)) + theme_hc(bgcolor = "darkunica") + scale_y_continuous(labels = comma, limits=c(0, 2500000))+ labs(x = "Hue in David's Paintings", y = "Count") + theme(legend.position="none", axis.title.y=element_blank())

grid.arrange(ing_plot_hue, dav_plot_hue, ncol=2)

Statistical Assessment of Hue Values

wilcox.test(ing_sat_vector_random, dav_sat_vector_random) 
    Wilcoxon rank sum test with continuity correction

data:  ing_hue_vector_random and dav_hue_vector_random
W = 4.8249e+13, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

As the p-value is p < 0.001, which is less than the .05 significance level, we reject the null hypothesis. Therefore, we can conclude that the Ingres used different hues than David in his paintings.

Artist prediction based on Saturation and Hue values

From the above statistical tests, we have deduced that both the saturation and hue values used by Ingres and David are statistically different. Therefore, we will explore the possibility of using the saturation and hue values to predict which artist used the colour.

Let’s first explore the relationship between hue and saturation for both artists. First, let’s create a dataframe.

dav_scatter <- as.data.frame(dav_sat_vector_random)
dav_scatter$Artist <- 'DAVID, Jacques-Louis'
dav_scatter <- cbind(dav_hue_vector_random, dav_scatter)
colnames(dav_scatter) <- c("Hue","Saturation", "Artist")

ing_scatter <- as.data.frame(ing_sat_vector_random)
ing_scatter$Artist <- 'INGRES, Jean-Auguste-Dominique'
ing_scatter <- cbind(ing_hue_vector_random, ing_scatter)
colnames(ing_scatter) <- c("Hue","Saturation", "Artist")

combined_scatter <- rbind(dav_scatter, ing_scatter)

In order to be able plot, let’s reduce the number of rows to 75,000. By using sample to do the split, Ingres and David values are roughly 50/50 in quantity.

combined_scatter_random <- combined_scatter[sample(nrow(combined_scatter), 750000, replace = FALSE, prob = NULL),]
nrow(combined_scatter_random[combined_scatter_random$Artist == 'DAVID, Jacques-Louis',])
nrow(combined_scatter_random[combined_scatter_random$Artist == 'INGRES, Jean-Auguste-Dominique',])
dav_sc <- ggplot(combined_scatter_random[combined_scatter_random$Artist == 'DAVID, Jacques-Louis',], aes(Hue, Saturation)) + geom_point(alpha = 1/10) + labs(title = "David, Jacques-Louis")

ing_sc <- ggplot(combined_scatter_random[combined_scatter_random$Artist == 'INGRES, Jean-Auguste-Dominique',], aes(Hue, Saturation)) + geom_point(alpha = 1/10) + labs(title = "Ingres, Jean-Auguste-Dominique")

grid.arrange(ing_sc,dav_sc, ncol=2)

Let’s see how they look plotted together

ggplot(combined_scatter_random, aes(Hue, Saturation)) + geom_point(aes(colour = Artist))

Prediction preparation

Split into train and test data. 60% train, 40% test.

samp <- sample(nrow(combined_scatter_random), 0.6 * nrow(combined_scatter_random))
train <- combined_scatter_random[samp, ]
test <- combined_scatter_random[-samp, ]
model <- randomForest(as.factor(Artist) ~ Hue + Saturation, data = train, importance = TRUE, ntree = 10)

From the importance plot, the saturation has greater effect on accuracy were it to be removed, compared to Hue.

varImpPlot(model)

Prediction!

Prediction <- predict(model, test)
table(Prediction, test$Artist)

Prediction                       DAVID, Jacques-Louis INGRES, Jean-Auguste-Dominique
  DAVID, Jacques-Louis                          83324                          48340
  INGRES, Jean-Auguste-Dominique                66494                         101842

Random Forest predicts with 62% accurracy.