Checklist

Note

Effective Data Science is still a work-in-progress. This chapter is largely complete and just needs final proof reading.

If you would like to contribute to the development of EDS, you may do so at https://github.com/zakvarty/data_science_notes.

Videos / Chapters

Reading

Use the Acquiring and Sharing Data section of the reading list to support and guide your exploration of this week’s topics. Note that these texts are divided into core reading, reference materials and materials of interest.

Tasks

Core:

  • Revisit the Projects that you explored on Github last week. This time look for any data or documentation files.

    • Are there any file types that are new to you?
    • If so, are there packages or helper function that would let you read this data into R?
    • Why might you not find many data files on Github?
  • Play CSS Diner to familiarise yourself with some CSS selectors.

  • Identify 3 APIs that give access to data on topics that interest you. Write a post on the discussion forum describing the APIs and use one of them to load some data into R.

  • Scraping Book Reviews:

    • Visit the Amazon page for R for Data Science. Write code to scrape the percentage of customers giving each “star” rating (5⭐, …, 1⭐).
    • Turn your code into a function that will return a tibble of the form:
product n_reviews percent_5_star percent_4_star percent_3_star percent_2_star percent_1_star url
example_name 1000 20 20 20 20 20 www.example.com
  • Generalise your function to work for other Amazon products, where the function takes as input a vector of product names and an associated vector of URLs.

  • Use your function to compare the reviews of the following three books: R for Data Science, R packages and ggplot2.

Bonus:

  • Add this function to the R package you made last week, remembering to add tests and documentation.

Live Session

In the live session we will begin with a discussion of this week’s tasks. We will then work through some examples of how to read data from non-standard sources.

Please come to the live session prepared to discuss the following points:

  • Roger Peng states that files can be imported and exported using readRDS() and saveRDS() for fast and space efficient data storage. What is the downside to doing so?

  • What data types have you come across (that we have not discussed already) and in what context are they used?

  • What do you have to give greater consideration to when scraping data than when using an API?