Welcome R - Where to Get Data?

R is a tool for working with data. As a beginner to R, where to get data to practice on?
This post outlines two types of data sources: ready-to-go datasets and synthetic data.
Ready-to-go Datasets
Ready-to-go datasets are datasets which are immediately usable for analysis, visualisation, and machine learning. They are also self-contained and come in standard formats.
Internally to R, the function data
lists the available datasets across the currently loaded packages.
Here is its sample (truncated) output:
Data sets in package ‘datasets’:
AirPassengers Monthly Airline Passenger Numbers 1949-1960
BJsales Sales Data with Leading Indicator
...
Data sets in package ‘dplyr’:
band_instruments Band membership
band_instruments2 Band membership
...
Data sets in package ‘ggplot2’:
diamonds Prices of over 50,000 round cut diamonds
economics US economic time series
...
Use ‘data(package = .packages(all.available = TRUE))’
to list the data sets in all *available* packages.
Externally to R, datasets can be found through various repositories such as
- Kaggle Datasets - besides basic information and download options, also includes community code and discussion.
- UC Irvine Machine Learning Repository - a collection of several hundred datasets.
- Awesome Public Datasets - a list of topic-centric public data sources in high quality, collected and tidied from blogs, answers, and user responses.
Synthetic Data
Synthetic data is data which is algorithmically generated. Its defining feature is 100% reproducibility given the data-generation algorithm and the random seed.
Here is a simple example. The data consists of two variables: X
and Y
. X
is sampled from the standard normal distribution, while Y
is generated by exponentiating X
and adding noise, which is also sampled from the standard normal distribution. The random seed is set to 0
:
library(tidyverse)
set.seed(0)
n <- 100
data <- tibble(
x = rnorm(n, mean = 0, sd = 1),
y = exp(x) + rnorm(n, mean = 0, sd = 1),
)
ggplot(data, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", se = F, col = 'brown') +
theme_minimal() +
labs(
title = "Linear Fit to Exponential Relation",
x = 'X',
y = 'Y',
) +
theme(
axis.title.y = element_text(angle = 0, vjust = 0.5),
plot.background = element_rect(fill = "white", colour = "white"),
plot.margin = margin(30, 30, 30, 30),
)
Here is the relation between X and Y visualised:

Conclusion
This post has outlined internal and external to R data sources for ready-to-go datasets, as well as the data generation approach for synthetic datasets. With this covered, nothing else should stand in the way of selecting a dataset and honing one's R skills on it.