Effective Data Science

Author

Zak Varty

Published

January 17, 2025

About this Course

Model building and evaluation are necessary but not sufficient skills for the effective practice of data science. In this module you will develop the technical and personal skills that are required to work successfully as a data scientist within an organisation.

During this module you will critically explore how to:

  • effectively scope and manage a data science project;
  • work openly and reproducibly;
  • efficiently acquire, manipulate, and present data;
  • interpret and explain your work for a variety of stakeholders;
  • ensure that your work can be put into production;
  • assess the ethical implications of your work as a data scientist.

This interdisciplinary course will draw from fields including statistics, computing, management science and data ethics. Each topic will be investigated through a selection of lecture videos, conference presentations and academic papers, supported by hands-on lab exercises and readings on industry best-practices as published by recognised professional bodies.

Schedule

These notes are intended for students on the course MATH70076: Data Science in the academic year 2024/25.

As the course is scheduled to take place over five weeks, the suggested schedule is:

  • 1st week: effective data science workflows;
  • 2nd week: acquiring and sharing data;
  • 3rd week: exploratory data analysis and visualisation;
  • 4th week: preparing for production;
  • 5th week: ethics and context of data science.

An alternative pdf version of these notes may be downloaded here. Please be aware that this pdf version is secondary to this course webpage and will be updated less frequently.

Learning outcomes

On successful completion of this module students should be able to:

  1. Independently scope and manage a data science project;
  2. Source data from the internet through web scraping and APIs;
  3. Clean, explore and visualise data, justifying and documenting the decisions made;
  4. Evaluate the need for (and implement) approaches that are explainable, reproducible and scalable;
  5. Appraise the ethical implications of a data science projects, particularly the risks of compromising privacy or fairness and the potential to cause harm.

Allocation of Study Hours

Lectures: 10 Hours (2 hours per week)

Group Teaching: 5 Hours (1 hour per week)

Lab / Practical: 10 hours (2 hours per week)

Independent Study: 100 hours (11 hours per week + 45 hours coursework)

Drop-In Sessions: Each week there will be an optional drop-in session to address any questions about the course or material. This is where you can get support from the course lecturer or GTA on the topics covered each week, either individually or in small groups.

These will be held on Fridays 14:00-15:00 in Huxley 711C.

Assessment Structure

The course will be assessed entirely by coursework, reflecting the practical and pragmatic nature of the course material.

Coursework 1 (30%): A reproducible data journalism task, to be completed during one week of the course.

Coursework 2 (70%): A student-led development of a data product, e.g. an in depth statistical analysis, portfolio website, R package or dashboard. To be released during the course and submitted following the examination period in Summer term.

Acknowledgements

These notes were created by Dr Zak Varty. They were inspired by a previous lecture series by Dr Purvasha Chakravarti at Imperial College London and draw from many resource that were made available by the R community, which are attributed throughout.