product | n_reviews | percent_5_star | percent_4_star | percent_3_star | percent_2_star | percent_1_star | url |
---|---|---|---|---|---|---|---|
example_name | 1000 | 20 | 20 | 20 | 20 | 20 | www.example.com |
Checklist
Effective Data Science is still a work-in-progress. This chapter is largely complete and just needs final proof reading.
If you would like to contribute to the development of EDS, you may do so at https://github.com/zakvarty/data_science_notes.
Videos / Chapters
Tabular Data (27 min) [slides]
Web Scraping (22 min) [slides]
Reading
Use the Acquiring and Sharing Data section of the reading list to support and guide your exploration of this week’s topics. Note that these texts are divided into core reading, reference materials and materials of interest.
Tasks
Core:
-
Revisit the Projects that you explored on Github last week. This time look for any data or documentation files.
- Are there any file types that are new to you?
- If so, are there packages or helper function that would let you read this data into R?
- Why might you not find many data files on Github?
Play CSS Diner to familiarise yourself with some CSS selectors.
Identify 3 APIs that give access to data on topics that interest you. Write a post on the discussion forum describing the APIs and use one of them to load some data into R.
-
Scraping Book Reviews:
- Visit the Amazon page for R for Data Science. Write code to scrape the percentage of customers giving each “star” rating (5⭐, …, 1⭐).
- Turn your code into a function that will return a tibble of the form:
Generalise your function to work for other Amazon products, where the function takes as input a vector of product names and an associated vector of URLs.
Use your function to compare the reviews of the following three books: R for Data Science, R packages and ggplot2.
Bonus:
- Add this function to the R package you made last week, remembering to add tests and documentation.
Live Session
In the live session we will begin with a discussion of this week’s tasks. We will then work through some examples of how to read data from non-standard sources.
Please come to the live session prepared to discuss the following points:
Roger Peng states that files can be imported and exported using
readRDS()
andsaveRDS()
for fast and space efficient data storage. What is the downside to doing so?What data types have you come across (that we have not discussed already) and in what context are they used?
What do you have to give greater consideration to when scraping data than when using an API?