Data Scientist Job Simulation

Data Scientist Job Simulator

Welcome to the Data Scientist Job Simulator! In this interactive experience, you'll work as a data scientist to analyze customer churn data and develop strategies to reduce customer attrition.

Choose how you'd like to proceed:

📊

Start Full Simulation

Experience the complete data science journey from initial briefing through analysis and recommendations.

Choose this if: You want to experience the full simulation with storyline and context.

âš¡

Complete the Analysis

Skip directly to answering questions about your analysis of the dataset.

Choose this if: You've already downloaded and analyzed the dataset and want to submit your findings.

Meeting Your Manager

Joan Guadalupe (Data Science Manager)

Hello, and welcome to the Analytics team! I'm Joan Guadalupe, the Data Science Manager. It's great to have you on board. We have an important project for you to work on right away.

You

Thanks for the warm welcome! I'm excited to be here and eager to get started.

Joan Guadalupe (Data Science Manager)

That's the spirit! Our telecommunications division has been experiencing challenges with customer churn. They've collected customer data and want us to analyze it to understand what factors might be contributing to customers leaving.

Project Briefing

Joan Guadalupe (Data Science Manager)

Let me give you the full picture of what we're dealing with. Our telecom client has been losing customers at an alarming rate over the past quarter. Preliminary analysis indicates the churn rate is around 43%, which is concerning.

Joan Guadalupe (Data Science Manager)

They've provided us with a dataset that includes customer demographics, subscription details, charges, data usage, contract information, and whether they've churned. The key business questions they want answered are:

  1. What factors are most strongly correlated with customer churn?
  2. Can we build a model to predict which customers are at risk of churning?
  3. What actionable recommendations can we provide to reduce churn?
You

Got it. So we need to perform exploratory data analysis, identify key factors related to churn, build a predictive model, and provide actionable insights.

Joan Guadalupe (Data Science Manager)

Exactly! I'll share the dataset with you now. It contains information from 100 customers. Please note that there are some data quality issues that you'll need to address before analysis - missing values, inconsistent formats, and a few outliers.

Dataset Overview

Joan Guadalupe (Data Science Manager)

Here's the customer churn dataset. Let me walk you through the columns:

  • CustomerID: Unique identifier for each customer
  • Age: Customer's age in years
  • Gender: Customer's gender (Male/Female)
  • SubscriptionLength: Length of customer's subscription (in months or years)
  • MonthlyCharges: The amount charged to the customer monthly
  • TotalCharges: The total amount charged to the customer
  • DataUsage: Customer's data usage (in MB or GB)
  • ContractType: Type of contract (Month-to-month, One year, Two year)
  • PaymentMethod: Method of payment (Electronic check, Mailed check, Bank transfer)
  • SignUpDate: Date when the customer signed up
  • SupportCalls: Number of calls to customer service
  • Satisfaction: Customer satisfaction rating (0-6 scale)
  • Churned: Whether the customer has left (True/False)
Joan Guadalupe (Data Science Manager)

Once you download the dataset, you may need to pre-process it first before analyzing it and implementing Machine Learning methods. Once you're done, please come back and continue to the 'Complete the Analysis' section to answer questions pertaining to the dataset. All the best!

Data Preview

CustomerID Age Gender SubscriptionLength MonthlyCharges ContractType SupportCalls Satisfaction Churned
1 52 null null null month-to-month 2 null False
2 42 Male 9 months 64.42 Two year 5 3 True
3 54 Male 37 months 62.37 Two year 6 5 True
4 67 Female 33 months 39.35 Two year 9 2 True
5 41 Female 51 months 27.45 Two year 1 3 False
6 -25 Male 42 months 91.78 Two year 4 1 True

Note: This is just a preview of the first 6 rows. The actual dataset contains 100 rows.

Download the dataset and analyze it using your preferred data analysis tool (Python, R, Excel, etc.). When you've completed your analysis, return to continue the simulation.

Data Preprocessing

Based on your analysis of the dataset, select how you would handle each preprocessing challenge:

1. How would you handle missing values in the Gender column (5 missing values)?

Impute with mode (Male)

Replace missing values with the most common gender in the dataset.

Create a 'Unknown' category

Treat missing gender as a separate 'Unknown' category.

Drop rows with missing Gender

Remove the 5 rows with missing gender values.

Leave as null

Perform analysis with missing values as is.

Replace with random values

Randomly assign 'Male' or 'Female' to missing values without considering dataset patterns.

2. How would you handle the inconsistent formats in SignUpDate?

Convert all to a standard datetime format

Parse and standardize all date formats (MM/DD/YYYY, YYYY-MM-DD, DD-MM-YYYY) to a consistent format.

Extract year and month only

Extract just the year and month components and discard the day information.

Convert to customer tenure

Calculate months since sign-up to create a consistent numerical tenure feature.

Use as separate features

Keep different date formats as separate categorical features in the analysis.

Ignore SignUpDate entirely

Drop the column and don't use any time-related information in the analysis.

3. How would you handle the Age outliers (negative values and unrealistically high values)?

Remove or cap values outside realistic range

Remove or cap ages outside a realistic range (e.g., 18-80 years).

Replace with median age

Replace outliers with the median age from the dataset.

Remove rows with age outliers

Drop all rows containing unrealistic age values.

Keep as is

Leave age outliers in the dataset without modification.

Convert age to binary variable

Create a binary "adult" flag (1 if age > 18, 0 otherwise) and discard the continuous age information.

4. How would you handle the inconsistent DataUsage formats (MB vs GB)?

Convert all to GB

Standardize by converting all values to GB (e.g., 1000 MB = 1 GB).

Convert all to MB

Standardize by converting all values to MB (e.g., 1 GB = 1000 MB).

Create usage categories

Create categorical bins (Low, Medium, High) based on usage amounts.

Use separate features

Create separate features for MB and GB values.

Treat as text data

Analyze the usage values as string data without numeric conversion.

5. How would you handle the inconsistent ContractType capitalization and formatting?

Standardize capitalization and format

Convert all to consistent format (e.g., "Month-to-month", "One year", "Two year").

Create binary features

Create one-hot encoded features for contract types regardless of formatting.

Create contract length in months

Convert contract types to numeric values (1, 12, 24 months).

Keep as is

Use contract types as they appear in the dataset.

Replace with churn probability

Replace contract type with the average churn rate for each contract type, creating data leakage.

Exploratory Data Analysis

Based on your analysis of the dataset, select the 5 most important findings that would help understand customer churn:

Overall churn rate is 44%

44% of customers have churned, which is very high for the telecom industry.

Month-to-month contracts have ~62% churn rate

Customers with month-to-month contracts churn at nearly twice the rate of those with two-year contracts.

One-year contracts have ~24% churn rate

Customers with one-year contracts have the lowest churn rate among contract types.

High correlation between satisfaction and churn

Customers with lower satisfaction scores (0-2) have significantly higher churn rates (~60%) than those with higher scores (4-6, ~35%).

More support calls correlate with higher churn

Customers with 6+ support calls have a ~70% churn rate compared to ~25% for those with 0-2 calls.

Gender doesn't strongly predict churn

Male and female customers have similar churn rates with no statistically significant difference.

Electronic check payment method has higher churn

Customers using electronic checks for payment have higher churn rates than other payment methods.

Negative correlation between subscription length and churn

Customers with longer subscription lengths are less likely to churn.

Age is not a strong predictor of churn

Customer age doesn't show a significant correlation with churn behavior.

Higher monthly charges correlate with increased churn

Customers paying higher monthly fees are more likely to churn, especially on month-to-month contracts.

Data usage directly causes churn

Customers with higher data usage always churn more frequently, indicating a causal relationship.

Customer ID patterns predict churn

There's a significant pattern between customer ID numbers and likelihood to churn.

Your Selected Findings (0/5):

Machine Learning Approach

Based on your analysis, select the best approach for building a predictive model for customer churn:

1. Which features would you select as most important for your model?

ContractType, Satisfaction, MonthlyCharges, SupportCalls, SubscriptionLength

Focus on the features with strongest correlation to churn and business relevance.

All features except CustomerID and SignUpDate

Use most available features but exclude non-predictive identifiers.

Only ContractType and SupportCalls

Focus only on the two strongest individual predictors.

Create interaction features between MonthlyCharges and ContractType

Engineer features that capture the relationship between pricing and contract terms.

Include CustomerID as a predictive feature

Use the customer identifier as a predictive variable in your model.

Only use Gender and Age

Focus only on demographic information for prediction.

2. Which algorithm would you choose for this classification problem?

Logistic Regression

Use logistic regression for interpretability and establishing a baseline.

Random Forest

Use random forest to capture non-linear relationships and get feature importance.

Support Vector Machine

Use SVM for potentially higher accuracy with proper tuning.

Comparison of multiple algorithms

Compare logistic regression, random forest, and gradient boosting to select the best performer.

Deep Neural Network with 10+ layers

Build a complex deep learning model regardless of the dataset size.

Simple rule-based system

Define fixed thresholds for each variable without using machine learning.

3. How would you evaluate your model's performance?

Accuracy only

Focus on overall prediction accuracy as the main metric.

Precision and recall

Evaluate based on precision (reducing false positives) and recall (reducing false negatives).

ROC-AUC and precision-recall curves

Use curves and area under curve metrics to compare models across different thresholds.

Cross-validation with business-focused metrics

Use stratified k-fold cross-validation with metrics tied to business cost of false positives vs. false negatives.

No validation, use training data performance

Evaluate model effectiveness based only on how well it predicts the training data.

Manual inspection of predictions

Choose a few random examples and manually check if predictions make sense.

4. How would you handle the class imbalance in the churn data?

No special handling needed

The data is close enough to balanced (44% churn rate) that no special techniques are required.

Use SMOTE for synthetic samples

Generate synthetic examples of the minority class to balance the dataset.

Adjust class weights

Apply class weights in the model to give more importance to the minority class.

Optimize classification threshold

Adjust the prediction threshold based on business priorities and cost-benefit analysis.

Duplicate majority class samples

Make copies of non-churned customers to create an even more imbalanced dataset.

Always predict the majority class

Simply predict that no customers will churn since non-churners are the slight majority.

Actionable Recommendations

Based on your analysis, select the 3 most effective recommendations to reduce customer churn:

Incentivize longer contracts

Offer discounts or added benefits for customers who switch from month-to-month to one or two-year contracts.

Improve customer support

Develop a proactive intervention system for customers with multiple support calls to address their issues before they decide to leave.

Satisfaction improvement program

Implement a targeted outreach program for customers with low satisfaction scores (0-2) to address their concerns.

Review pricing strategy

Evaluate and potentially adjust pricing for customers with high monthly charges, especially those on month-to-month contracts.

Payment method alternatives

Encourage customers using electronic checks to switch to other payment methods with better retention rates.

Early warning system

Implement the predictive model to identify customers at high risk of churning and target them with retention offers.

Loyalty rewards program

Create incentives that increase in value with subscription length to reward customer loyalty.

Increase prices to boost revenue

Raise monthly charges for all customers to maximize short-term revenue before they potentially churn.

Ignore churn and focus only on acquisition

Accept current churn rates as inevitable and focus exclusively on acquiring new customers at a faster rate.

Make it harder to cancel service

Add complications to the cancellation process to reduce the number of customers who successfully cancel.

Target marketing based on CustomerID

Direct marketing efforts based on patterns discovered in customer ID numbers.

Implement mandatory customer surveys

Require all customers to complete satisfaction surveys monthly to gather more data.

Your Selected Recommendations (0/3):

Simulation Complete

Joan Guadalupe (Data Science Manager)

Excellent work on the customer churn analysis project! I'm impressed with your approach and insights. Let me share your evaluation based on how you tackled each part of the assignment.

Your Final Score:

0
0%

Score Breakdown:

Section Your Score Maximum Score
Data Preprocessing 0 25
Exploratory Data Analysis 0 25
Machine Learning Approach 0 20
Final Recommendations 0 15
Total 0 85
Joan Guadalupe (Data Science Manager)

Thank you for completing this data science simulation. I hope you found it insightful and educational. Feel free to retry the simulation to explore different approaches and see how they impact your results.

In a real data science role, you would now proceed to implement your recommendations and monitor their impact, creating a feedback loop to continuously improve the model and business outcomes.