Welcome R - Where to Get Data?

Welcome R - Where to Get Data?

R is a tool for working with data. As a beginner to R, where to get data to practice on?

This post outlines two types of data sources: ready-to-go datasets and synthetic data.

Ready-to-go Datasets

Ready-to-go datasets are datasets which are immediately usable for analysis, visualisation, and machine learning. They are also self-contained and come in standard formats.

Internally to R, the function data lists the available datasets across the currently loaded packages.

Here is its sample (truncated) output:

Data sets in package ‘datasets’:

AirPassengers               Monthly Airline Passenger Numbers 1949-1960
BJsales                     Sales Data with Leading Indicator
...

Data sets in package ‘dplyr’:

band_instruments            Band membership
band_instruments2           Band membership
...

Data sets in package ‘ggplot2’:

diamonds                    Prices of over 50,000 round cut diamonds
economics                   US economic time series
...

Use ‘data(package = .packages(all.available = TRUE))’
to list the data sets in all *available* packages.

Externally to R, datasets can be found through various repositories such as

Synthetic Data

Synthetic data is data which is algorithmically generated. Its defining feature is 100% reproducibility given the data-generation algorithm and the random seed.

Here is a simple example. The data consists of two variables: X and Y. X is sampled from the standard normal distribution, while Y is generated by exponentiating X and adding noise, which is also sampled from the standard normal distribution. The random seed is set to 0:

library(tidyverse)
set.seed(0)

n <- 100
data <- tibble(
  x = rnorm(n, mean = 0, sd = 1),
  y = exp(x) + rnorm(n, mean = 0, sd = 1),
)

ggplot(data, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = F, col = 'brown') +
  theme_minimal() +
  labs(
    title = "Linear Fit to Exponential Relation",
    x = 'X',
    y = 'Y',
    ) +
  theme(
    axis.title.y = element_text(angle = 0, vjust = 0.5),
    plot.background = element_rect(fill = "white", colour = "white"),
    plot.margin = margin(30, 30, 30, 30),
  )

Here is the relation between X and Y visualised:

Conclusion

This post has outlined internal and external to R data sources for ready-to-go datasets, as well as the data generation approach for synthetic datasets. With this covered, nothing else should stand in the way of selecting a dataset and honing one's R skills on it.

Read more