Skip to content

Commit

Permalink
remove sensitive contents
Browse files Browse the repository at this point in the history
  • Loading branch information
JifuZhao committed Aug 27, 2018
1 parent 1db5154 commit a92c2eb
Show file tree
Hide file tree
Showing 41 changed files with 11 additions and 293,117 deletions.
26 changes: 0 additions & 26 deletions 01. Conversion Rate.ipynb
Original file line number Diff line number Diff line change
@@ -1,31 +1,5 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Goal\n",
"Optimizing conversion rate is likely the most common work of a data scientist, and rightfully so.\n",
"\n",
"The data revolution has a lot to do with the fact that now we are able to collect all sorts of data about people who buy something on our site as well as people who don't. This gives us a tremendous opportunity to understand what's working well (and potentially scale it even further) and what's not working well (and fix it).\n",
"\n",
"The goal of this challenge is to build a model that predicts conversion rate and, based on the model, come up with ideas to improve revenue.\n",
"\n",
"This challenge is significantly easier than all others in this collection. There are no dates, no tables to join, no feature engineering required, and the problem is really straightforward. Therefore, it is a great starting point to get familiar with data science take-home challenges."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Description\n",
"We have data about users who hit our site: whether they converted or not as well as some of their characteristics such as their country, the marketing channel, their age, whether they are repeat users and the number of pages visited during that session (as a proxy for site activity/time spent on site).\n",
"\n",
"Your project is to:\n",
"* Predict conversion rate\n",
"* Come up with recommendations for the product team and the marketing team to improve conversion rate"
]
},
{
"cell_type": "code",
"execution_count": 1,
Expand Down
31 changes: 0 additions & 31 deletions 02. Spanish Translation AB Test.ipynb
Original file line number Diff line number Diff line change
@@ -1,36 +1,5 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Goal\n",
"A/B tests play a huge role in website optimization. Analyzing A/B tests data is a very important data scientist responsibility. Especially, data scientists have to make sure that results are reliable, trustworthy, and conclusions can be drawn.\n",
"\n",
"Furthermore, companies often run tens, if not hundreds, of A/B tests at the same time. Manually analyzing all of them would require lot of time and people. Therefore, it is common practice to look at the typical A/B test analysis steps and try to automate as much as possible. This frees up time for the data scientists to work on more high level topics.\n",
"\n",
"In this challenge, you will have to analyze results from an A/B test. Also, you will be asked to design an algorithm to automate some steps."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Description\n",
"Company XYZ is a worldwide e-commerce site with localized versions of the site.\n",
"\n",
"A data scientist at XYZ noticed that Spain-based users have a much higher conversion rate than any other Spanish-speaking country. She therefore went and talked to the international team in charge of Spain And LatAm to see if they had any ideas about why that was happening.\n",
"\n",
"Spain and LatAm country manager suggested that one reason could be translation. All Spanish- speaking countries had the same translation of the site which was written by a Spaniard. They agreed to try a test where each country would have its one translation written by a local. That is, Argentinian users would see a translation written by an Argentinian, Mexican users by a Mexican and so on. Obviously, nothing would change for users from Spain.\n",
"\n",
"After they run the test however, they are really surprised cause the test is negative. I.e., it appears that the non-localized translation was doing better!\n",
"\n",
"You are asked to:\n",
"* Confirm that the test is actually negative. That is, it appears that the old version of the site with just one translation across Spain and LatAm performs better\n",
"* Explain why that might be happening. Are the localized translations really worse?\n",
"* If you identified what was wrong, design an algorithm that would return FALSE if the same problem is happening in the future and TRUE if everything is good and the results can be trusted."
]
},
{
"cell_type": "code",
"execution_count": 1,
Expand Down
26 changes: 0 additions & 26 deletions 03. Employee Retention.ipynb
Original file line number Diff line number Diff line change
@@ -1,31 +1,5 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Goal\n",
"Employee turn-over is a very costly problem for companies. The cost of replacing an employee if often larger than 100K USD, taking into account the time spent to interview and find a replacement, placement fees, sign-on bonuses and the loss of productivity for several months.\n",
"\n",
"It is only natural then that data science has started being applied to this area. Understanding why and when employees are most likely to leave can lead to actions to improve employee retention as well as planning new hiring in advance. This application of DS is sometimes called *people analytics or people data science* (if you see a job title: people data scientist, this is your job).\n",
"\n",
"In this challenge, you have a data set with info about the employees and have to predict when employees are going to quit by understanding the main drivers of employee churn."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Description\n",
"We got employee data from a few companies. We have data about all employees who joined from 2011/01/24 to 2015/12/13. For each employee, we also know if they are still at the company as of 2015/12/13 or they have quit. Beside that, we have general info about the employee, such as avg salary during her tenure, dept, and yrs of experience.\n",
"\n",
"As said above, the goal is to predict employee retention and understand its main drivers. Specifically, you should:\n",
"* Assume, for each company, that the headcount starts from zero on 2011/01/23. Estimate employee headcount, for each company, on each day, from 2011/01/24 to 2015/12/13. That is, if by 2012/03/02 2000 people have joined company 1 and 1000 of them have already quit, then company headcount on 2012/03/02 for company 1 would be 1000. \n",
" - **You should create a table with 3 columns: day, employee_headcount, company_id.**\n",
"* What are the main factors that drive employee churn? Do they make sense? Explain your findings.\n",
"* If you could add to this data set just one variable that could help explain employee churn, what would that be?"
]
},
{
"cell_type": "code",
"execution_count": 1,
Expand Down
30 changes: 0 additions & 30 deletions 04. Identifying Fraudulent Activities.ipynb
Original file line number Diff line number Diff line change
@@ -1,35 +1,5 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Goal\n",
"E-commerce websites often transact huge amounts of money. And whenever a huge amount of money is moved, there is a high risk of users performing fraudulent activities, e.g. using stolen credit cards, doing money laundry, etc.\n",
"\n",
"Machine Learning really excels at identifying fraudulent activities. Any website where you put your credit card information has a risk team in charge of avoiding frauds via machine learning.\n",
"\n",
"The goal of this challenge is to build a machine learning model that predicts the probability that the first transaction of a new user is fraudulent."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Description\n",
"Company XYZ is an e-commerce site that sells hand-made clothes.\n",
"\n",
"You have to build a model that predicts whether a user has a high probability of using the site to perform some illegal activity or not. This is a super common task for data scientists.\n",
"\n",
"You only have information about the user first transaction on the site and based on that you have to make your classification (\"fraud/no fraud\").\n",
"\n",
"These are the tasks you are asked to do:\n",
"* For each user, determine her country based on the numeric IP address.\n",
"* Build a model to predict whether an activity is fraudulent or not. Explain how different assumptions about the cost of false positives vs false negatives would impact the model.\n",
"* Your boss is a bit worried about using a model she doesn't understand for something as important as fraud detection. How would you explain her how the model is making the predictions? Not from a mathematical perspective (she couldn't care less about that), but from a user perspective. What kinds of users are more likely to be classified as at risk? What are their characteristics?\n",
"* Let's say you now have this model which can be used live to predict in real time if an activity is fraudulent or not. From a product perspective, how would you use it? That is, what kind of different user experiences would you build based on the model output?"
]
},
{
"cell_type": "code",
"execution_count": 1,
Expand Down
30 changes: 0 additions & 30 deletions 05. Funnel Analysis.ipynb
Original file line number Diff line number Diff line change
@@ -1,35 +1,5 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Goal\n",
"The goal is to perform [funnel analysis](https://en.wikipedia.org/wiki/Funnel_analysis) for an e-commerce website.\n",
"\n",
"Typically, websites have a clear path to conversion: for instance, you land on the home page, then you search, select a product, and buy it. At each of these steps, some users will drop off and leave the site. The sequence of pages that lead to conversion is called 'funnel'.\n",
"\n",
"Data Science can have a tremendous impact on funnel optimization. Funnel analysis allows to understand where/when our users abandon the website. It gives crucial insights on user behavior and on ways to improve the user experience. Also, it often allows to discover bugs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Description\n",
"You are looking at data from an e-commerce website. The site is very simple and has just 4 pages:\n",
"* The first page is the **home page**. When you come to the site for the first time, you can only land on the home page as a first page.\n",
"* From the home page, the user can perform a search and land on the **search page**. \n",
"* From the search page, if the user clicks on a product, she will get to the **payment page**, where she is asked to provide payment information in order to buy that product.\n",
"* If she does decide to buy, she ends up on the **confirmation page**\n",
"\n",
"The company CEO isn't very happy with the volume of sales and, especially, of sales coming from new users. Therefore, she asked you to investigate whether there is something wrong in the conversion funnel or, in general, if you could suggest how conversion rate can be improved.\n",
"\n",
"Specifically, she is interested in :\n",
"* **A full picture of funnel conversion rate** for both desktop and mobile\n",
"* Some insights on **what the product team should focus on** in order to improve conversion rate as well as anything you might discover that could help improve conversion rate."
]
},
{
"cell_type": "code",
"execution_count": 1,
Expand Down
25 changes: 0 additions & 25 deletions 06. Pricing Test.ipynb
Original file line number Diff line number Diff line change
@@ -1,30 +1,5 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Goal\n",
"Pricing optimization is, non surprisingly, another area where data science can provide huge value.\n",
"\n",
"The goal here is to evaluate whether a pricing test running on the site has been successful. As always, you should focus on user segmentation and provide insights about segments who behave differently as well as any other insights you might find."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Description\n",
"Company XYZ sells a software for $\\$39$. Since revenue has been flat for some time, the VP of Product has decided to run a test increasing the price. She hopes that this would increase revenue. In the experiment, $66\\%$ of the users have seen the old price ($\\$39$), while a random sample of $33\\%$ users a higher price ($\\$59$).\n",
"\n",
"The test has been running for some time and the VP of Product is interested in understanding how it went and whether it would make sense to increase the price for all the users.\n",
"\n",
"Especially he asked you the following questions:\n",
"* **Should the company sell its software for $\\$39$ or $\\$59$?**\n",
"* The VP of Product is interested in having a holistic view into user behavior, especially focusing on actionable insights that might increase conversion rate. **What are your main findings looking at the data?**\n",
"* [Bonus] The VP of Product feels that the test has been running for too long and he should have been able to get statistically significant results in a shorter time. Do you agree with her intuition? **After how many days you would have stopped the test?** Please, explain why."
]
},
{
"cell_type": "code",
"execution_count": 1,
Expand Down
26 changes: 0 additions & 26 deletions 07. Marketing Email Campaign.ipynb
Original file line number Diff line number Diff line change
@@ -1,31 +1,5 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Goal\n",
"Optimizing marketing campaigns is one of the most common data science tasks. Among the many possible marketing tools, one of the most efficient is using emails.\n",
"\n",
"Emails are great cause they are free and can be easily personalized. Email optimization involves personalizing the text and/or the subject, who should receive it, when should be sent, etc. Machine Learning excels at this."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Description\n",
"The marketing team of an e-commerce site has launched an email campaign. This site has email addresses from all the users who created an account in the past.\n",
"\n",
"They have chosen a random sample of users and emailed them. The email let the user know about a new feature implemented on the site. From the marketing team perspective, a success is if the user clicks on the link inside of the email. This link takes the user to the company site.\n",
"\n",
"You are in charge of figuring out how the email campaign performed and were asked the following questions:\n",
"* What percentage of users opened the email and what percentage clicked on the link within the email?\n",
"* The VP of marketing thinks that it is stupid to send emails to a random subset and in a random way. Based on all the information you have about the emails that were sent, can you build a model to optimize in future email campaigns to maximize the probability of users clicking on the link inside the email?\n",
"* By how much do you think your model would improve click through rate ( defined as # of users who click on the link / total users who received the email). How would you test that?\n",
"* Did you find any interesting pattern on how the email campaign performed for different segments of users? Explain."
]
},
{
"cell_type": "code",
"execution_count": 1,
Expand Down
27 changes: 0 additions & 27 deletions 08. Song Challenge.ipynb
Original file line number Diff line number Diff line change
@@ -1,32 +1,5 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Goal\n",
"Company XYZ is a very early stage startup. They allow people to stream music from their mobile for free. Right now, they still only have songs from the Beatles in their music collection, but they are planning to expand soon.\n",
"\n",
"They still have all their data in json files and they are interested in getting some basic info about their users as well as building a very preliminary song recommendation model in order to increase user engagement.\n",
"\n",
"Working with json files is important. If you join a very early stage start-up, they might not have a nice database and all data will be in jsons. Third party data are often stored in json files as well."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Description\n",
"You are the fifth employee at company XYZ. The good news is that if the company becomes big, you will become very rich with the stocks. The bad news is that, at such an early stage, the data is usually very messy. All their data is stored in json files.\n",
"\n",
"The company CEO asked you very specific questions:\n",
"* What are the top 3 and the bottom 3 states in terms of number of users?\n",
"* What are the top 3 and the bottom 3 states in terms of user engagement? You can choose how to mathematically define user engagement. What the CEO cares about here is in which states users are using the product a lot/very little.\n",
"* The CEO wants to send a gift to the first user who signed-up for each state. That is, the first user who signed-up from California, from Oregon, etc. Can you give him a list of those users?\n",
"* Build a function that takes as an input any of the songs in the data and returns the most likely song to be listened next. That is, if, for instance, a user is currently listening to \"Eight Days A Week\", which song has the highest probability of being played right after it by the same user? This is going to be v1 of a song recommendation model.\n",
"* How would you set up a test to check whether your model works well and is improving engagement?"
]
},
{
"cell_type": "code",
"execution_count": 1,
Expand Down
27 changes: 0 additions & 27 deletions 09. Clustering Grocery Items.ipynb
Original file line number Diff line number Diff line change
@@ -1,32 +1,5 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Goal\n",
"Online shops often sell tons of different items and this can become very messy very quickly!\n",
"\n",
"Data science can be extremely useful to automatically organize the products in categories so that they can be easily found by the customers.\n",
"\n",
"The goal of this challenge is to look at user purchase history and create categories of items that are likely to be bought together and, therefore, should belong to the same section."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Description\n",
"Company XYZ is an online grocery store. In the current version of the website, they have manually grouped the items into a few categories based on their experience.\n",
"\n",
"However, they now have a lot of data about user purchase history. Therefore, they would like to put the data into use!\n",
"This is what they asked you to do:\n",
"* The company founder wants to meet with some of the best customers to go through a focus group with them. You are asked to send the ID of the following customers to the founder:\n",
" - the customer who bought the most items overall in her lifetime\n",
" - for each item, the customer who bought that product the most\n",
"* Cluster items based on user co-purchase history. That is, create clusters of products that have the highest probability of being bought together. The goal of this is to replace the old/manually created categories with these new ones. Each item can belong to just one cluster."
]
},
{
"cell_type": "code",
"execution_count": 1,
Expand Down
Loading

0 comments on commit a92c2eb

Please sign in to comment.