Let us explore some common geoms. Look also at OrgPad and at the ggplot2 cheatsheet by RStudio.
library(tidyverse)
A large dataset at a glance:
glimpse(diamonds)
Rows: 53,940
Columns: 10
$ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.30, 0.23, 0.22, 0...
$ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Very Good, Fair, Ver...
$ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I, E, H, J, J, G, I...
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, SI1, SI2, SI2, I1...
$ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64.0, 62.8, 60.4, 6...
$ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58, 54, 54, 56, 59,...
$ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 342, 344, 345, 345,...
$ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.25, 3.93, 3.88, 4...
$ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.28, 3.90, 3.84, 4...
$ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.73, 2.46, 2.33, 2...
summary(diamonds)
carat cut color clarity depth table
Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00 Min. :43.00
1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00 1st Qu.:56.00
Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80 Median :57.00
Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75 Mean :57.46
3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50 3rd Qu.:59.00
Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00 Max. :95.00
J: 2808 (Other): 2531
price x y z
Min. : 326 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
Median : 2401 Median : 5.700 Median : 5.710 Median : 3.530
Mean : 3933 Mean : 5.731 Mean : 5.735 Mean : 3.539
3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
Max. :18823 Max. :10.740 Max. :58.900 Max. :31.800
geom_histogram
for one continuous variablediamonds %>%
ggplot() + geom_histogram(mapping = aes(x = price, y = ..count..),
binwidth = 500) #bin represents 500 USD
NA
Help:
Computed variables
count
number of points in bin
density
density of points in bin, scaled to integrate to 1
ncount
count, scaled to maximum of 1
ndensity
density, scaled to maximum of 1
geom_bar
for one discrete variablediamonds %>%
ggplot() + geom_bar(mapping = aes(x = clarity))
geom_bar
for two discrete variables. Positiondiamonds %>%
ggplot() + geom_bar(mapping = aes(x = clarity, fill = cut), position = "stack")
diamonds %>%
ggplot() + geom_bar(mapping = aes(x = clarity, fill = cut), position = "fill")
diamonds %>%
ggplot() + geom_bar(mapping = aes(x = clarity, fill = cut), position = "dodge")
geom_point
and its derived plotsA good dataset to demonstrate geom_point()
is diamonds
, because they have many observations and we will have to deal with overplotting (too many points in the exact same place).
glimpse(diamonds)
Rows: 53,940
Columns: 10
$ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.30, 0.23, 0.22, 0...
$ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Very Good, Fair, Ver...
$ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I, E, H, J, J, G, I...
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, SI1, SI2, SI2, I1...
$ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64.0, 62.8, 60.4, 6...
$ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58, 54, 54, 56, 59,...
$ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 342, 344, 345, 345,...
$ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.25, 3.93, 3.88, 4...
$ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.28, 3.90, 3.84, 4...
$ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.73, 2.46, 2.33, 2...
summary(diamonds)
carat cut color clarity depth table
Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00 Min. :43.00
1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00 1st Qu.:56.00
Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80 Median :57.00
Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75 Mean :57.46
3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50 3rd Qu.:59.00
Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00 Max. :95.00
J: 2808 (Other): 2531
price x y z
Min. : 326 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
Median : 2401 Median : 5.700 Median : 5.710 Median : 3.530
Mean : 3933 Mean : 5.731 Mean : 5.735 Mean : 3.539
3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
Max. :18823 Max. :10.740 Max. :58.900 Max. :31.800
Relations between two continuous variables - and optional other variables, as seen below.
diamonds %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_point(stat = "identity", position = "identity", na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE) #default parameter values added, explain
Deal with overplotting ## alpha
diamonds %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_point(stat = "identity", position = "identity", alpha = 0.1) #alpha added
diamonds %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_point(stat = "identity", position = "identity", shape = ".") #position
geom_hex
diamonds %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_hex(bins = 20)+
scale_x_continuous(breaks = seq(from = 0,to = 5, by = 0.5)) +
scale_y_continuous(breaks = seq(from = 0, to = max(diamonds$price), by = 1000))
diamonds %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_density_2d_filled(contour_var = "count", binwidth = 10) +
scale_x_continuous(breaks = seq(from = 0,to = 5, by = 0.5)) +
scale_y_continuous(breaks = seq(from = 0, to = max(diamonds$price), by = 1000))
diamonds %>%
ggplot(mapping = aes(x = carat, y = price)) +
geom_smooth(method = "gam")
jitter
when dataset smallerThe table variable has fewer values than the other numeric ones
diamonds %>%
ggplot() + geom_histogram(mapping = aes(x = table, y = ..count..), binwidth = 1)
With carat is going to be overplotted. This can impossibly be over 53,000 items. Table increments by 1.
diamonds %>%
ggplot() + geom_point(mapping = aes(x = table, y = carat), shape = ".")
diamonds %>%
ggplot() + geom_point(mapping = aes(x = table, y = carat),
position = position_jitter(width = 0.6, height = 0.05, seed = 1222
),
shape = ".")
geom_count
for two discrete variablesdiamonds %>%
ggplot(mapping = aes(x = cut, y = clarity)) +
geom_count()