4 Tabular Data

Important

Effective Data Science is still a work-in-progress. This chapter is undergoing heavy restructuring and may be confusing or incomplete.

If you would like to contribute to the development of EDS, you may do so at https://github.com/zakvarty/data_science_notes.

4.1 Loading Tabular Data

Recall that simpler, open source formats improve accessibility and reproducibility. We will begin by reading in three open data formats for tabular data.

random-data.csv
random-data.tsv
random-data.txt

Each of these data sets contains 26 observations of 4 variables:

id, a Roman letter identifier;
gaussian, standard normal random variates;
gamma, gamma(1,1) random variates;
uniform, uniform(0,1) random variates.

4.1.1 Base R

random_df <- read.csv(file = 'random-data.csv')
print(random_df)
#>    id    gaussian      gamma    uniform
#> 1   a -1.20706575 0.98899970 0.22484576
#> 2   b  0.27742924 0.03813386 0.08498474
#> 3   c  1.08444118 1.09462335 0.63729826
#> 4   d -2.34569770 1.49301101 0.43101637
#> 5   e  0.42912469 5.40361248 0.07271609
#> 6   f  0.50605589 1.72386539 0.80240202
#> 7   g -0.57473996 1.95357133 0.32527830
#> 8   h -0.54663186 0.07807803 0.75728904
#> 9   i -0.56445200 0.21198194 0.58427152
#> 10  j -0.89003783 0.20803673 0.70883941
#> 11  k -0.47719270 2.08607862 0.42697577
#> 12  l -0.99838644 0.49463708 0.34357270
#> 13  m -0.77625389 0.77171305 0.75911999
#> 14  n  0.06445882 0.37216648 0.42403021
#> 15  o  0.95949406 1.88207991 0.56088725
#> 16  p -0.11028549 0.76622568 0.11613577
#> 17  q -0.51100951 0.50488585 0.30302180
#> 18  r -0.91119542 0.22979791 0.47880269
#> 19  s -0.83717168 0.75637275 0.34483055
#> 20  t  2.41583518 0.62435969 0.60071414
#> 21  u  0.13408822 0.64638373 0.07608332
#> 22  v -0.49068590 0.11247545 0.95599261
#> 23  w -0.44054787 0.11924307 0.02220682
#> 24  x  0.45958944 4.91805535 0.84171063
#> 25  y -0.69372025 0.60282666 0.63244245
#> 26  z -1.44820491 0.64446571 0.31009417

Output is a data.frame object. (List of vectors with some nice methods)

4.1.2 `{readr}`

random_tbl <- readr::read_csv(file = 'random-data.csv')
#> Rows: 26 Columns: 4
#> ── Column specification ─────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): id
#> dbl (3): gaussian, gamma, uniform
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(random_tbl)
#> # A tibble: 26 × 4
#>   id    gaussian  gamma uniform
#>   <chr>    <dbl>  <dbl>   <dbl>
#> 1 a       -1.21  0.989   0.225 
#> 2 b        0.277 0.0381  0.0850
#> 3 c        1.08  1.09    0.637 
#> 4 d       -2.35  1.49    0.431 
#> 5 e        0.429 5.40    0.0727
#> 6 f        0.506 1.72    0.802 
#> # ℹ 20 more rows

Output is a tibble object. (List of vectors with some nicer methods)

4.1.2.1 Benefits of `readr::read_csv()`

Increased speed (approx. 10x) and progress bar.
Strings are not coerced to factors. No more stringsAsFactors = FALSE
No row names and nice column names.
Reproducibility bonus: does not depend on operating system.

4.1.3 WTF: Tibbles

4.1.3.1 Printing

Default to first 10 rows and as many columns as will comfortably fit on your screen.
Can adjust this behaviour in the print call:

# print first three rows and all columns
print(random_tbl, n = 3, width = Inf)
#> # A tibble: 26 × 4
#>   id    gaussian  gamma uniform
#>   <chr>    <dbl>  <dbl>   <dbl>
#> 1 a       -1.21  0.989   0.225 
#> 2 b        0.277 0.0381  0.0850
#> 3 c        1.08  1.09    0.637 
#> # ℹ 23 more rows

Bonus: Colour formatting in IDE and each column tells you it’s type.

4.1.3.2 Subsetting

Subsetting tibbles will always return another tibble.

# Row Subsetting
random_tbl[1, ] # returns tibble
random_df[1, ]  # returns data.frame

# Column Subsetting
random_tbl[ , 1]      # returns tibble
random_df[ , 1]       # returns vector

# Combined Subsetting
random_tbl[1, 1]      # returns 1x1 tibble
random_df[1, 1]       # returns single value

This helps to avoids edge cases associated with working on data frames.

4.1.4 Other `{readr}` functions

See readr documentation, there are lots of useful additional arguments that can help you when reading messy data.

Functions for reading and writing other types of tabular data work analogously.

4.1.4.1 Reading Tabular Data

library(readr)
read_tsv("random-data.tsv")
read_delim("random-data.txt", delim = " ")

4.1.4.2 Writing Tabular Data

write_csv(random_tbl, "random-data-2.csv")
write_tsv(random_tbl, "random-data-2.tsv")
write_delim(random_tbl, "random-data-2.tsv", delim = " ")

4.1.5 Need for Speed

Some times you have to load lots of large data sets, in which case a 10x speed-up might not be sufficient.

If each data set still fits inside RAM, then check out data.table::fread() which is optimised for speed. (Alternatives exist for optimal memory usage and data too large for working memory, but not covered here.)

Note: While it can be much faster, the resulting data.table object lacks the consistancy properties of a tibble so be sure to check for edge cases, where the returned value is not what you might expect.

4.2 Tidy Data

4.2.1 Wide vs. Tall Data

4.2.1.1 Wide Data

First column has unique entries
Easier for humans to read and compute on
Harder for machines to compute on

4.2.1.2 Tall Data

First column has repeating entries
Harder for humans to read and compute on
Easier for machines to compute on

4.2.1.3 Examples

Example 1 (Wide)

Person	Age	Weight	Height
Bob	32	168	180
Alice	24	150	175
Steve	64	144	165

Example 1 (Tall)

Person	Variable	Value
Bob	Age	32
Bob	Weight	168
Bob	Height	180
Alice	Age	24
Alice	Weight	150
Alice	Height	175
Steve	Age	64
Steve	Weight	144
Steve	Height	165

[Source: Wikipedia - Wide and narrow data]

Example 2 (Wide)

Team	Points	Assists	Rebounds
A	88	12	22
B	91	17	28
C	99	24	30
D	94	28	31

Example 2 (Tall)

Team	Variable	Value
A	Points	88
A	Assists	12
A	Rebounds	22
B	Points	91
B	Assists	17
B	Rebounds	28
C	Points	99
C	Assists	24
C	Rebounds	30
D	Points	94
D	Assists	28
D	Rebounds	31

[Source: Statology - Long vs wide data]

4.2.1.4 Pivoting Wider and Longer

Error control at input and analysis is format-dependent.
Switching between long and wide formats useful to control errors.
Easy with the tidyr package functions

tidyr::pivot_longer()
tidyr::pivot_wider()

4.2.2 Tidy What?

Tidy Data is an opinionated way to store tabular data.

Image Source: Chapter 12 of R for Data Science.

Each column corresponds to a exactly one measured variable
Each row corresponds to exactly one observational unit
Each cell contains exactly one value.

Benefits of tidy data

Consistent data format: Reduces cognitive load and allows specialised tools (functions) to efficiently work with tabular data.
Vectorisation: Keeping variables as columns allows for very efficient data manipulation. (this goes back to data frames and tibbles being lists of vectors)

4.2.3 Example - Tidy Longer

Consider trying to plot these data as time series. The year variable is trapped in the column names!

#> # A tibble: 3 × 3
#>   country     `1999` `2000`
#>   <chr>        <dbl>  <dbl>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

To tidy this data, we need to pivot_longer(). We will turn the column names into a new year variable and retaining cell contents as a new variable called cases.

library(magrittr)

countries %>% 
  tidyr::pivot_longer(cols = c(`1999`,`2000`), names_to = "year", values_to = "cases")
#> # A tibble: 6 × 3
#>   country     year   cases
#>   <chr>       <chr>  <dbl>
#> 1 Afghanistan 1999     745
#> 2 Afghanistan 2000    2666
#> 3 Brazil      1999   37737
#> 4 Brazil      2000   80488
#> 5 China       1999  212258
#> 6 China       2000  213766

Much better!

4.2.4 Example - Tidy Wider

There are other times where we might have to widen our data to tidy it.

This example is not tidy. Why not?

Team	Variable	Value
A	Points	88
A	Assists	12
A	Rebounds	22
B	Points	91
B	Assists	17
B	Rebounds	28
C	Points	99
C	Assists	24
C	Rebounds	30
D	Points	94
D	Assists	28
D	Rebounds	31

The observational unit here is a team. However, each variable should be a stored in a separate column, with cells containing their values.

To tidy this data we first generate it as a tibble. We use the tribble() function, which allows us to create a tibble row-wise rather than column-wise.

We can then tidy it by creating new columns for each value of the current Variable column and taking the values for these from the current Value column.

tournament %>% 
  tidyr::pivot_wider(
    id_cols = "Team", 
    names_from = "Variable",
    values_from = "Value")
#> # A tibble: 4 × 4
#>   Team  Points Assists Rebounds
#>   <chr>  <dbl>   <dbl>    <dbl>
#> 1 A         88      12       22
#> 2 B         91      17       28
#> 3 C         99      24       30
#> 4 D         94      28       31

4.2.5 Other helpful functions

The pivot_*() family of functions resolve issues with rows (too many observations per row or rows per observation).

There are similar helper functions to solve column issues:

Multiple variables per column: tidyr::separate(),
Multiple columns per variable: tidyr::unite().

4.2.6 Missing Data

In tidy data, every cell contains a value. Including cells with missing values.

Missing values are coded as NA (generic) or a type-specific NA, such as NA_character_.
The readr family of read_*() function have good defaults and helpful na argument.
Explicitly code NA values when collecting data, avoid ambiguity: ” “, -999 or worst of all 0.
More on missing values in EDA videos…

4.3 Wrapping Up

Reading in tabular data by a range of methods
Introduced the tibble and tidy data (+ tidy not always best)
Tools for tidying messy tabular data

4.4 Session Information

R version 4.4.3 (2025-02-28)

Platform: x86_64-apple-darwin20

locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: magrittr(v.2.0.3)

loaded via a namespace (and not attached): crayon(v.1.5.3), vctrs(v.0.6.5), cli(v.3.6.3), knitr(v.1.49), rlang(v.1.1.4), xfun(v.0.50), purrr(v.1.0.2), generics(v.0.1.3), jsonlite(v.1.8.9), glue(v.1.8.0), bit(v.4.5.0.1), htmltools(v.0.5.8.1), hms(v.1.1.3), rmarkdown(v.2.29), pander(v.0.6.5), evaluate(v.1.0.3), tibble(v.3.2.1), tzdb(v.0.4.0), fastmap(v.1.2.0), yaml(v.2.3.10), lifecycle(v.1.0.4), compiler(v.4.4.3), dplyr(v.1.1.4), Rcpp(v.1.0.14), htmlwidgets(v.1.6.4), pkgconfig(v.2.0.3), tidyr(v.1.3.1), rstudioapi(v.0.17.1), digest(v.0.6.37), R6(v.2.5.1), utf8(v.1.2.4), tidyselect(v.1.2.1), readr(v.2.1.5), parallel(v.4.4.3), vroom(v.1.6.5), pillar(v.1.10.1), withr(v.3.0.2), tools(v.4.4.3) and bit64(v.4.5.2)