Data Scientist Job Simulator
Welcome to the Data Scientist Job Simulator! In this interactive experience, you'll work as a data scientist to analyze customer churn data and develop strategies to reduce customer attrition.
Choose how you'd like to proceed:
Start Full Simulation
Experience the complete data science journey from initial briefing through analysis and recommendations.
Choose this if: You want to experience the full simulation with storyline and context.
Complete the Analysis
Skip directly to answering questions about your analysis of the dataset.
Choose this if: You've already downloaded and analyzed the dataset and want to submit your findings.
Meeting Your Manager
Hello, and welcome to the Analytics team! I'm Joan Guadalupe, the Data Science Manager. It's great to have you on board. We have an important project for you to work on right away.
Thanks for the warm welcome! I'm excited to be here and eager to get started.
That's the spirit! Our telecommunications division has been experiencing challenges with customer churn. They've collected customer data and want us to analyze it to understand what factors might be contributing to customers leaving.
Project Briefing
Let me give you the full picture of what we're dealing with. Our telecom client has been losing customers at an alarming rate over the past quarter. Preliminary analysis indicates the churn rate is around 43%, which is concerning.
They've provided us with a dataset that includes customer demographics, subscription details, charges, data usage, contract information, and whether they've churned. The key business questions they want answered are:
- What factors are most strongly correlated with customer churn?
- Can we build a model to predict which customers are at risk of churning?
- What actionable recommendations can we provide to reduce churn?
Got it. So we need to perform exploratory data analysis, identify key factors related to churn, build a predictive model, and provide actionable insights.
Exactly! I'll share the dataset with you now. It contains information from 100 customers. Please note that there are some data quality issues that you'll need to address before analysis - missing values, inconsistent formats, and a few outliers.
Dataset Overview
Here's the customer churn dataset. Let me walk you through the columns:
- CustomerID: Unique identifier for each customer
- Age: Customer's age in years
- Gender: Customer's gender (Male/Female)
- SubscriptionLength: Length of customer's subscription (in months or years)
- MonthlyCharges: The amount charged to the customer monthly
- TotalCharges: The total amount charged to the customer
- DataUsage: Customer's data usage (in MB or GB)
- ContractType: Type of contract (Month-to-month, One year, Two year)
- PaymentMethod: Method of payment (Electronic check, Mailed check, Bank transfer)
- SignUpDate: Date when the customer signed up
- SupportCalls: Number of calls to customer service
- Satisfaction: Customer satisfaction rating (0-6 scale)
- Churned: Whether the customer has left (True/False)
Once you download the dataset, you may need to pre-process it first before analyzing it and implementing Machine Learning methods. Once you're done, please come back and continue to the 'Complete the Analysis' section to answer questions pertaining to the dataset. All the best!
Data Preview
CustomerID | Age | Gender | SubscriptionLength | MonthlyCharges | ContractType | SupportCalls | Satisfaction | Churned |
---|---|---|---|---|---|---|---|---|
1 | 52 | null | null | null | month-to-month | 2 | null | False |
2 | 42 | Male | 9 months | 64.42 | Two year | 5 | 3 | True |
3 | 54 | Male | 37 months | 62.37 | Two year | 6 | 5 | True |
4 | 67 | Female | 33 months | 39.35 | Two year | 9 | 2 | True |
5 | 41 | Female | 51 months | 27.45 | Two year | 1 | 3 | False |
6 | -25 | Male | 42 months | 91.78 | Two year | 4 | 1 | True |
Note: This is just a preview of the first 6 rows. The actual dataset contains 100 rows.
Download the dataset and analyze it using your preferred data analysis tool (Python, R, Excel, etc.). When you've completed your analysis, return to continue the simulation.
Data Preprocessing
Based on your analysis of the dataset, select how you would handle each preprocessing challenge:
1. How would you handle missing values in the Gender column (5 missing values)?
Impute with mode (Male)
Replace missing values with the most common gender in the dataset.
Create a 'Unknown' category
Treat missing gender as a separate 'Unknown' category.
Drop rows with missing Gender
Remove the 5 rows with missing gender values.
Leave as null
Perform analysis with missing values as is.
Replace with random values
Randomly assign 'Male' or 'Female' to missing values without considering dataset patterns.
2. How would you handle the inconsistent formats in SignUpDate?
Convert all to a standard datetime format
Parse and standardize all date formats (MM/DD/YYYY, YYYY-MM-DD, DD-MM-YYYY) to a consistent format.
Extract year and month only
Extract just the year and month components and discard the day information.
Convert to customer tenure
Calculate months since sign-up to create a consistent numerical tenure feature.
Use as separate features
Keep different date formats as separate categorical features in the analysis.
Ignore SignUpDate entirely
Drop the column and don't use any time-related information in the analysis.
3. How would you handle the Age outliers (negative values and unrealistically high values)?
Remove or cap values outside realistic range
Remove or cap ages outside a realistic range (e.g., 18-80 years).
Replace with median age
Replace outliers with the median age from the dataset.
Remove rows with age outliers
Drop all rows containing unrealistic age values.
Keep as is
Leave age outliers in the dataset without modification.
Convert age to binary variable
Create a binary "adult" flag (1 if age > 18, 0 otherwise) and discard the continuous age information.
4. How would you handle the inconsistent DataUsage formats (MB vs GB)?
Convert all to GB
Standardize by converting all values to GB (e.g., 1000 MB = 1 GB).
Convert all to MB
Standardize by converting all values to MB (e.g., 1 GB = 1000 MB).
Create usage categories
Create categorical bins (Low, Medium, High) based on usage amounts.
Use separate features
Create separate features for MB and GB values.
Treat as text data
Analyze the usage values as string data without numeric conversion.
5. How would you handle the inconsistent ContractType capitalization and formatting?
Standardize capitalization and format
Convert all to consistent format (e.g., "Month-to-month", "One year", "Two year").
Create binary features
Create one-hot encoded features for contract types regardless of formatting.
Create contract length in months
Convert contract types to numeric values (1, 12, 24 months).
Keep as is
Use contract types as they appear in the dataset.
Replace with churn probability
Replace contract type with the average churn rate for each contract type, creating data leakage.
Exploratory Data Analysis
Based on your analysis of the dataset, select the 5 most important findings that would help understand customer churn:
Overall churn rate is 44%
44% of customers have churned, which is very high for the telecom industry.
Month-to-month contracts have ~62% churn rate
Customers with month-to-month contracts churn at nearly twice the rate of those with two-year contracts.
One-year contracts have ~24% churn rate
Customers with one-year contracts have the lowest churn rate among contract types.
High correlation between satisfaction and churn
Customers with lower satisfaction scores (0-2) have significantly higher churn rates (~60%) than those with higher scores (4-6, ~35%).
More support calls correlate with higher churn
Customers with 6+ support calls have a ~70% churn rate compared to ~25% for those with 0-2 calls.
Gender doesn't strongly predict churn
Male and female customers have similar churn rates with no statistically significant difference.
Electronic check payment method has higher churn
Customers using electronic checks for payment have higher churn rates than other payment methods.
Negative correlation between subscription length and churn
Customers with longer subscription lengths are less likely to churn.
Age is not a strong predictor of churn
Customer age doesn't show a significant correlation with churn behavior.
Higher monthly charges correlate with increased churn
Customers paying higher monthly fees are more likely to churn, especially on month-to-month contracts.
Data usage directly causes churn
Customers with higher data usage always churn more frequently, indicating a causal relationship.
Customer ID patterns predict churn
There's a significant pattern between customer ID numbers and likelihood to churn.
Your Selected Findings (0/5):
Machine Learning Approach
Based on your analysis, select the best approach for building a predictive model for customer churn:
1. Which features would you select as most important for your model?
ContractType, Satisfaction, MonthlyCharges, SupportCalls, SubscriptionLength
Focus on the features with strongest correlation to churn and business relevance.
All features except CustomerID and SignUpDate
Use most available features but exclude non-predictive identifiers.
Only ContractType and SupportCalls
Focus only on the two strongest individual predictors.
Create interaction features between MonthlyCharges and ContractType
Engineer features that capture the relationship between pricing and contract terms.
Include CustomerID as a predictive feature
Use the customer identifier as a predictive variable in your model.
Only use Gender and Age
Focus only on demographic information for prediction.
2. Which algorithm would you choose for this classification problem?
Logistic Regression
Use logistic regression for interpretability and establishing a baseline.
Random Forest
Use random forest to capture non-linear relationships and get feature importance.
Support Vector Machine
Use SVM for potentially higher accuracy with proper tuning.
Comparison of multiple algorithms
Compare logistic regression, random forest, and gradient boosting to select the best performer.
Deep Neural Network with 10+ layers
Build a complex deep learning model regardless of the dataset size.
Simple rule-based system
Define fixed thresholds for each variable without using machine learning.
3. How would you evaluate your model's performance?
Accuracy only
Focus on overall prediction accuracy as the main metric.
Precision and recall
Evaluate based on precision (reducing false positives) and recall (reducing false negatives).
ROC-AUC and precision-recall curves
Use curves and area under curve metrics to compare models across different thresholds.
Cross-validation with business-focused metrics
Use stratified k-fold cross-validation with metrics tied to business cost of false positives vs. false negatives.
No validation, use training data performance
Evaluate model effectiveness based only on how well it predicts the training data.
Manual inspection of predictions
Choose a few random examples and manually check if predictions make sense.
4. How would you handle the class imbalance in the churn data?
No special handling needed
The data is close enough to balanced (44% churn rate) that no special techniques are required.
Use SMOTE for synthetic samples
Generate synthetic examples of the minority class to balance the dataset.
Adjust class weights
Apply class weights in the model to give more importance to the minority class.
Optimize classification threshold
Adjust the prediction threshold based on business priorities and cost-benefit analysis.
Duplicate majority class samples
Make copies of non-churned customers to create an even more imbalanced dataset.
Always predict the majority class
Simply predict that no customers will churn since non-churners are the slight majority.
Actionable Recommendations
Based on your analysis, select the 3 most effective recommendations to reduce customer churn:
Incentivize longer contracts
Offer discounts or added benefits for customers who switch from month-to-month to one or two-year contracts.
Improve customer support
Develop a proactive intervention system for customers with multiple support calls to address their issues before they decide to leave.
Satisfaction improvement program
Implement a targeted outreach program for customers with low satisfaction scores (0-2) to address their concerns.
Review pricing strategy
Evaluate and potentially adjust pricing for customers with high monthly charges, especially those on month-to-month contracts.
Payment method alternatives
Encourage customers using electronic checks to switch to other payment methods with better retention rates.
Early warning system
Implement the predictive model to identify customers at high risk of churning and target them with retention offers.
Loyalty rewards program
Create incentives that increase in value with subscription length to reward customer loyalty.
Increase prices to boost revenue
Raise monthly charges for all customers to maximize short-term revenue before they potentially churn.
Ignore churn and focus only on acquisition
Accept current churn rates as inevitable and focus exclusively on acquiring new customers at a faster rate.
Make it harder to cancel service
Add complications to the cancellation process to reduce the number of customers who successfully cancel.
Target marketing based on CustomerID
Direct marketing efforts based on patterns discovered in customer ID numbers.
Implement mandatory customer surveys
Require all customers to complete satisfaction surveys monthly to gather more data.
Your Selected Recommendations (0/3):
Simulation Complete
Excellent work on the customer churn analysis project! I'm impressed with your approach and insights. Let me share your evaluation based on how you tackled each part of the assignment.
Your Final Score:
Score Breakdown:
Section | Your Score | Maximum Score |
---|---|---|
Data Preprocessing | 0 | 25 |
Exploratory Data Analysis | 0 | 25 |
Machine Learning Approach | 0 | 20 |
Final Recommendations | 0 | 15 |
Total | 0 | 85 |
Thank you for completing this data science simulation. I hope you found it insightful and educational. Feel free to retry the simulation to explore different approaches and see how they impact your results.
In a real data science role, you would now proceed to implement your recommendations and monitor their impact, creating a feedback loop to continuously improve the model and business outcomes.