Penguins Dataset Overview – iris alternative in R (2024)

If there’s a dataset that’s been most used by data scientists / data analysts while they’re learning something or coaching something – it’s either iris (more R users) or titanic (more Python users).

iris dataset isn’t most used just because it’s easy accessible but it’s something that you can use to demonstrate many data science concepts like correlation, regression, classification.

The objective of this post is to introduce you to penguins dataset and get you started with a few code snippets so that you can take off yourself!

At this very time, We’re blessed with another iris-like dataset about penguins. Thanks to Allison Horst who packaged it as an R package palmerpenguins under CC-0 license.

Video Walkthrough

### Please subscribe to the channel and leave a feedback if it’s useful. It’ll really good to hear from you!

Installation

palmerpenguins is yet to make it to CRAN, so you can install it from Github

remotes::install_github("allisonhorst/palmerpenguins")

Accessing Data

After successful installation, you can find out that there are two datasets attached with the package – penguins and penguins_raw. You can check out their help page (?penguins_raw) to understand more about respective datasets.

Loading Library

library(tidyverse)library(palmerpenguins)

Meta – Glimpse of penguins dataset

penguins dataset has got the following 7 columns and 344 columns

names(penguins)## [1] "species" "island" "bill_length_mm" ## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g" ## [7] "sex"

Of the 7 columns, 3 are categorical (species,island,sex) and the rest are numeric.

glimpse(penguins)## Rows: 344## Columns: 7## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…## $ sex <fct> male, female, female, NA, female, male, female, mal…

Penguins Data Column Definition

speciesa factor denoting penguin species (Adélie, Chinstrap and Gentoo)

islanda factor denoting island in Palmer Archipelago, Antarctica (Biscoe, Dream or Torgersen)

bill_length_mma number denoting bill length (millimeters)

bill_depth_mma number denoting bill depth (millimeters)

flipper_length_mman integer denoting flipper length (millimeters)

body_mass_gan integer denoting body mass (grams)

sexa factor denoting penguin sex (female, male)

Missing Values

A good thing about penguins over iris is that, It’s got missing values NA. It’s quite an important thing to be present while using for educational purposes!

penguins %>% #group_by(species) %>% select(everything()) %>% summarise_all(funs(sum(is.na(.)))) %>% pivot_longer(cols = 1:7, names_to = 'columns', values_to = 'NA_count') %>% arrange(desc(NA_count)) %>% ggplot(aes(y = columns, x = NA_count)) + geom_col(fill = 'darkorange') + geom_label(aes(label = NA_count)) +# scale_fill_manual(values = c("darkorange","purple","cyan4")) + theme_minimal() + labs(title = 'Penguins - NA Count')## Warning: `funs()` is deprecated as of dplyr 0.8.0.## Please use a list of either functions or lambdas: ## ## # Simple named list: ## list(mean = mean, median = median)## ## # Auto named with `tibble::lst()`: ## tibble::lst(mean, median)## ## # Using lambdas## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))## This warning is displayed once every 8 hours.## Call `lifecycle::last_warnings()` to see where this warning was generated.

Penguins Dataset Overview – iris alternative in R (1)

Simple Scatter Plot

Like iris, You can simply make a scatter plot using base-R’s plot()

plot(penguins)

Penguins Dataset Overview – iris alternative in R (2)### Bar Plot

In this Bar plot, We can visualize the count of each species in the penguins dataset

penguins %>% count(species) %>% ggplot() + geom_col(aes(x = species, y = n, fill = species)) + geom_label(aes(x = species, y = n, label = n)) + scale_fill_manual(values = c("darkorange","purple","cyan4")) + theme_minimal() + labs(title = 'Penguins Species & Count')

Penguins Dataset Overview – iris alternative in R (3)### Bar Plot for each Species

In this Bar Plot, We can visualize Species distribution of each Sex (with faceted plot)

penguins %>% drop_na() %>% count(sex, species) %>% ggplot() + geom_col(aes(x = species, y = n, fill = species)) + geom_label(aes(x = species, y = n, label = n)) + scale_fill_manual(values = c("darkorange","purple","cyan4")) + facet_wrap(~sex) + theme_minimal() + labs(title = 'Penguins Species ~ Gender')

Penguins Dataset Overview – iris alternative in R (4)

Correlation Matrix

penguins %>% select_if(is.numeric) %>% drop_na() %>% cor() ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g## bill_length_mm 1.0000000 -0.2350529 0.6561813 0.5951098## bill_depth_mm -0.2350529 1.0000000 -0.5838512 -0.4719156## flipper_length_mm 0.6561813 -0.5838512 1.0000000 0.8712018## body_mass_g 0.5951098 -0.4719156 0.8712018 1.0000000

Scatter Plot – Penguins Size Relation wrt Species

In this scatter plot, we’ll try to visualize relationship between flipper_length_mm and body_mass_g with respect to each species.

library(tidyverse)ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species, shape = species), size = 3, alpha = 0.8) + #theme_minimal() + scale_color_manual(values = c("darkorange","purple","cyan4")) + labs(title = "Penguin size, Palmer Station LTER", subtitle = "Flipper length and body mass for Adelie, Chinstrap and Gentoo Penguins", x = "Flipper length (mm)", y = "Body mass (g)", color = "Penguin species", shape = "Penguin species") + theme_minimal()

Penguins Dataset Overview – iris alternative in R (5)

Scatter Plot – Penguins Size Relation wrt Island

library(tidyverse)ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = island, shape = species), size = 3, alpha = 0.8) + #theme_minimal() + scale_color_manual(values = c("darkorange","purple","cyan4")) + labs(title = "Penguin size, Palmer Station LTER", subtitle = "Flipper length and body mass for each island", x = "Flipper length (mm)", y = "Body mass (g)", color = "Penguin island", shape = "Penguin species") + theme_minimal()

Penguins Dataset Overview – iris alternative in R (6)

References

citation('palmerpenguins')## ## To cite palmerpenguins in publications use:## ## Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism## and Environmental Variability within a Community of Antarctic## Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081.## https://doi.org/10.1371/journal.pone.0090081## ## A BibTeX entry for LaTeX users is## ## @Article{,## title = {Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis)},## author = {Gorman KB and Williams TD and Fraser WR},## journal = {PLoS ONE},## year = {2014},## volume = {9(3)},## number = {e90081},## pages = {-13},## url = {https://doi.org/10.1371/journal.pone.0090081},## }
Penguins Dataset Overview – iris alternative in R (2024)
Top Articles
Latest Posts
Article information

Author: Annamae Dooley

Last Updated:

Views: 6710

Rating: 4.4 / 5 (45 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Annamae Dooley

Birthday: 2001-07-26

Address: 9687 Tambra Meadow, Bradleyhaven, TN 53219

Phone: +9316045904039

Job: Future Coordinator

Hobby: Archery, Couponing, Poi, Kite flying, Knitting, Rappelling, Baseball

Introduction: My name is Annamae Dooley, I am a witty, quaint, lovely, clever, rich, sparkling, powerful person who loves writing and wants to share my knowledge and understanding with you.