Welcome R - Where to Get Data?

Yamesant

25 Mar 2025 — 2 min read

R is a tool for working with data. As a beginner to R, where to get data to practice on?

This post outlines two types of data sources: ready-to-go datasets and synthetic data.

Ready-to-go Datasets

Ready-to-go datasets are datasets which are immediately usable for analysis, visualisation, and machine learning. They are also self-contained and come in standard formats.

Internally to R, the function data lists the available datasets across the currently loaded packages.

Here is its sample (truncated) output:

Data sets in package ‘datasets’:

AirPassengers               Monthly Airline Passenger Numbers 1949-1960
BJsales                     Sales Data with Leading Indicator
...

Data sets in package ‘dplyr’:

band_instruments            Band membership
band_instruments2           Band membership
...

Data sets in package ‘ggplot2’:

diamonds                    Prices of over 50,000 round cut diamonds
economics                   US economic time series
...

Use ‘data(package = .packages(all.available = TRUE))’
to list the data sets in all *available* packages.

Externally to R, datasets can be found through various repositories such as

Kaggle Datasets - besides basic information and download options, also includes community code and discussion.
UC Irvine Machine Learning Repository - a collection of several hundred datasets.
Awesome Public Datasets - a list of topic-centric public data sources in high quality, collected and tidied from blogs, answers, and user responses.

Synthetic Data

Synthetic data is data which is algorithmically generated. Its defining feature is 100% reproducibility given the data-generation algorithm and the random seed.

Here is a simple example. The data consists of two variables: X and Y. X is sampled from the standard normal distribution, while Y is generated by exponentiating X and adding noise, which is also sampled from the standard normal distribution. The random seed is set to 0:

library(tidyverse)
set.seed(0)

n <- 100
data <- tibble(
  x = rnorm(n, mean = 0, sd = 1),
  y = exp(x) + rnorm(n, mean = 0, sd = 1),
)

ggplot(data, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", se = F, col = 'brown') +
  theme_minimal() +
  labs(
    title = "Linear Fit to Exponential Relation",
    x = 'X',
    y = 'Y',
    ) +
  theme(
    axis.title.y = element_text(angle = 0, vjust = 0.5),
    plot.background = element_rect(fill = "white", colour = "white"),
    plot.margin = margin(30, 30, 30, 30),
  )

Here is the relation between X and Y visualised:

Conclusion

This post has outlined internal and external to R data sources for ready-to-go datasets, as well as the data generation approach for synthetic datasets. With this covered, nothing else should stand in the way of selecting a dataset and honing one's R skills on it.

Getting Started with Protein and Nucleic Acid Sequences

Sequences, in particular protein and nucleic acid sequences, are at the core of bioinformatics. This post shares development of a simple project to start working with sequences. The project implements absolute basics: sequence representation and equality comparisons. Access the project’s source code on GitHub 📁, check the initial state 0️

Workflow Sets in Tidymodels - Second Look into Pima Indians Diabetes Dataset

This post presents an example of using workflow sets in the tidymodels framework with cross-validation and varying recipes and models. The dataset in use is Pima Indians Diabetes. The previous post established the baseline accuracy of 73%. The goal for this time is to improve upon this number. Preliminary This

Welcome R - Starter Analysis of Pima Indians Diabetes Dataset

This post walks through a starter analysis of the Pima Indians Diabetes dataset using R. The dataset has been sourced from Kaggle - one of the data sources mentioned in the previous post. The dataset is in tabular format. Each row represents a patient of Pima Indian heritage. The columns

Farewell C#, Welcome R

The last three toy projects showcased on this blog are Arithmetic Trainer, Achievement Quantifier, and Noter. Besides being written in C#, the commonality between all three is involvement of randomness. Arithmetic Trainer generates random practice questions. Achievement Quantifier uses AutoFixture to generate random data for unit testing. Noter uses Bogus