Appendix A — Reading List
Effective Data Science is still a work-in-progress. This chapter is largely complete and just needs final proof reading.
If you would like to contribute to the development of EDS, you may do so at https://github.com/zakvarty/data_science_notes.
This reading list is organised by topic, according to each week of the course. These are split into several categories.
Core Materials: These form a core part of the course activities.
Reference Materials: These will be used extensively in the course, but should be seen as helpful guides, rather than required reading from cover to cover.
Materials of Interest: These will not form a core part of the course, but will give you a deeper understanding or interesting perspective on the weekly topic. There might be some fun other stuff in here too.
A.1 Effective Data Science Workflows
Core Materials
- The Tidyverse R Style Guide by Hadley Wickham.
- Wilson, et al (2017). Good Enough Practices in Scientific Computing. PLOS Computational Biology.
Reference Materials
R For Data Science Chapters 2, 6 and 8 by Hadley Wickham and Garrett Grolemund. Chapters covering R workflow basics, a scripting and project based workflow.
Documentation for the {here} package
R Packages Book (Second Edition) by Hadley Wickham and Jenny Bryan.
Materials of Interest
- STAT545, Part 1 by Jennifer Bryan and The STAT 545 TAs
- What they forgot to teach you about R, Chapters 2-4 by Jennifer Bryan and Jim Hester.
- Broman et al (2017). Recommendations to Funding Agencies for Supporting Reproducible Research. American Statistical Association.
Advanced R by Hadley Wickham Section introductions on functional and object oriented approaches to programming.
Atlassian Article on Agile Project Management
- The Pragmatic Programmer, 20th Anniversary Edition Edition by David Thomas and Andrew Hunt. The section on DRY coding and a few others are freely available.
Efficient R programming by Colin Gillespie and Robin Lovelace. Chapter 5 considers Efficient Input/Output is relevant to this week. Chapter 4 on Efficient Workflows links nicely with last week’s topics.
Towards A Principled Bayesian Workflow by Michael Betancourt.
- Happy Git and GitHub for the useR by Jennifer Bryan
- Make Tutorial by the Monash Informatics Platform.
- Makefiles for R and LaTeX projects blog post by Rob Hyndman
- Makefile tutorial by Chase Lambert
A.2 Aquiring and Sharing Data
Core Materials
R for Data Science Chapters 9 - 12 by Hadley Wickham. These chapters introduce tibbles as a data structure, how to import data into R and how to wrangle that data into tidy format.
Efficient R programming by Colin Gillespie and Robin Lovelace. Chapter 5 considers Efficient Input/Output is relevant to this week.
Wickham (2014). Tidy Data. Journal of Statistical Software. The paper that brought tidy data to the mainstream.
Reference Materials
The {readr} documentation
The {data.table} documentation and vignette
The {rvest} documentation
The {tidyr} documentation
Materials of Interest
- Introduction to APIs by Brian Cooksey
- R for Data Science (Second Edition) Chapters within the Import section.
This covers importing data from spreadsheets, databases, using Apache Arrow and importing hierarchical data as well as web scraping.
A.3 Data Exploration and Visualisation
Core Materials
- Exploratory Data Analysis with R by Roger Peng.
Chapters 3 and 4 are core reading, respectively introducing data frame manipulation with {dplyr} and an example workflow for exploratory data analysis. Other chapters may be useful as references.
- Flexible Imputation of Missing Data by Stef van Buuren. Sections 1.1-1.4 give a thorough introduction to missing data problems.
Referene Materials
A ggplot2 Tutorial for Beautiful Plotting in R https://www.cedricscherer.com/2019/08/05/a-ggplot2-tutorial-for-beautiful-plotting-in-r/) by Cédric Scherer.
The {dplyr} documentation
R for Data Science (First Edition) Chapters on Data Transformations, Exploratory Data Analysis and Relational Data.
Equivalent sections in R for Data Science Second Edition
Materials of Interest
Wickham, H. (2010). A Layered Grammar of Graphics. Journal of Computational and Graphical Statistics.
Better Data Visualisations by Jonathan Schwabish
- Data Visualization: A Practical Introduction by Kieran Healy
A.4 Preparing for Production
Core Materials
The Ethical Algorithm M Kearns and A Roth (Chapter 4)
Ribeiro et al (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier.
Reference Materials
The Docker Curriculum by Prakhar Srivastav.
LIME package documentation on CRAN.
Interpretable Machine Learning: A Guide for Making Black Box Models Explainable by Christoph Molnar.
Advanced R (Second Edition) by Hadley Wickham. Chapter 23 on measuring performance and Chapter 24 on improving performance.
Materials of Interest
The ASA Statement on \(p\)-values: Context, Process and Purpose
The Garden of Forking Paths: Why multiple comparisons can be a problem, even when there is no “Fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. A Gelman and E loken (2013)
Understanding LIME tutorial by T Pedersen and M Benesty.
Advanced R (Second Edition) by Hadley Wickham. Chapter 25 on writing R code in C++.
A.5 Data Science Ethics
Core Materials
The Ethical Algorithm M Kearns and A Roth. Chapters 1 and 2 on Algorithmic Privacy and Algortihmic Fairness.
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification by Joy Buolamwini and Timnit Gebru (2018). Proceedings of the 1st Conference on Fairness, Accountability and Transparency.
Robust De-anonymization of Large Sparse Datasets by Arvind Narayanan and Vitaly Shmatikov (2008). IEEE Symposium on Security and Privacy.
Reference Materials
Fairness and machine learning Limitations and Opportunities by Solon Barocas, Moritz Hardt and Arvind Narayanan.
-
Professional Guidleines on Data Ethics from:
Materials of Interest
- Algorithmic Fairness (2020). Pre-print of review paper by Dana Pessach and Erez Shmueli.