Skip to main content

Checkpoint 2: EDA, Visualization, and Machine Learning Plan

Context

This is checkpoint #2 of your final project. It builds on what you've done in checkpoint #1.

Keep in Mind

Nothing that you've done in checkpoint 1 is set in stone. You can always go back and change your dataset, your question, or your approach. This checkpoint is meant to help you refine your project and make sure that you are on the right track.

IT work is never linear, it's always iterative and incremental. You will be making changes to your project as you go. This is a good thing! It means that you are learning and growing as a data scientist.

Overview

In the past few weeks we addressed a few key topics in data science: data cleaning, exploratory data analysis, and visualization. In this checkpoint, you will be asked to apply these skills to your final project dataset. You will also be asked to plan how you can apply machine learning in your project.

What the checkpoint should include

In addition to addressing any feedback that was provided to you by your peers and/or the teaching team, your checkpoint should include the following sections:

1. Exploratory Data Analysis (EDA)

You may have (hopefully) found more datasets for your project. If so, you should include them under the "Data Sources" section. If not, that means that so far your data is able to answer your analysis Apply EDA and use Visualizations to answer questions such as:

  • What insights and interesting information are you able to extract at this stage?
  • What are the distributions of my variables?
  • Are there any correlations between my variables?
  • What issues can you see in your data at this point?
  • Are there any outliers or anomalies? are they relevant to your analysis? or should they be removed?
  • Are there any missing values? how are you going to deal with them?
  • Are there any duplicate values? how are you going to deal with them?
  • Are there any data types that need to be changed?

Data Visualization

You should use several visualizations to help answer the questions above.

  • You should have at least 4 visualizations in your notebook, to represent different aspects and valuable insights of your data.
  • You can use 2 visualization library that you want.
  • You can use any type of visualization that best represents your data.
    • You'll need to provide a short description of each visualization and explain what it represents, and what insights you can extract from it.

2. Data Cleaning and Transformation

In this section, you'll clean data per your findings in the EDA section. You will be handling issues such as:

  • Missing values
  • Duplicate values
  • Anomalies and Outliers
  • Data types transformation.

You will need to describe the data cleaning process you went through to prepare your data for analysis. This includes your code and the reasoning behind your decisions (if not already disucssed in the EDA section).

3. Machine Learning Plan

In this section, you'll plan how you can apply machine learning to your project. You're not expected to actually implement machine learning in this checkpoint, but this checkpoint will help identify issues that you may encounter due to not having enough data, or not enough features, or too many features, ...etc.

🦉 Advise

Take your findings from this checkpoint to start preparing for next checkpoints. You will be implementing machine learning in the next checkpoint.

Use this section to address the following questions:

  • What type of machine learning model are you planning to use?
  • What are the challenges have you identified/are you anticipating in building your machine learning model?
  • How are you planning to address these challenges?

4. Prior Feedback and Updates

  • What feedback did you receive from your peers and/or the teaching team?
  • What changes have you made to your project based on this feedback?

Due Dates

  • Assignment Due Date: Oct 16, 2022
  • Peer-Review Due Date: Oct 23, 2022

Submission

1. Your Notebook

You'll be submitting the same GitHub Repository Link you submitted for Checkpoint #1. You should Add the the sections mentioend above in that notebook.

  • Make sure you keep the current last cell in your notebook as the last cell
!jupyter nbconvert --to python source.ipynb
  • Run that cell again, to convert your notebook to a Python Script. This is the file your peers will be submitting the review on. They can still see the notebook to see the visualization. But that one will make it easier to provide the feedback on.

2. Peer-Review

You'll also be required to provide a peer review of one of your fellow classmates' work. You'll have access to their submission and will need to provide thoughtful feedback. Your feedback isn't limited to the project idea and design but also to the coding practices followed. Please provide your feedback on the Feedback Pull Request on their GitHub repository. Same as what I did on your Python Exercises Assignment.

Every notebook should have an accompanying Python Script with the same name, except that it ends with *.py instead of *.ipynb. This is the file that you'll be reviewing. You should still review the notebook itself to be able to see the visualization and the output of the code.

The assignment resources include links to the Warm-Cool-Hard Feedback protocol and the clean code practices document on the course's website. You're encouraged to draw on your own personal and industrial experience in your feedback.

Please check the rubric for details on how you'll be graded on this assignment.


(1) Feedback can be found on the discussion board, on the assignment submission AND on the different components on the assignment rubric (make sure you check in all 3 places)

Resources