Chapter 3 Python / R analyses

3.1 Sample Questions

3.1.1 Active users on a messaging application

  • Data Analysis
  • Python
  • Pandas
  • Data Manipulation
  • External Dataset

Here is a table schema for a P2P messaging application. The table contains send/receive message data for the application’s users. The structure is as follows:

Table name: user_messaging

user_messaging
date
sender_id (#id of the message sender)
receiver_id (#id of the message receiver)

Using Python, calculate what fraction of senders sent messages to at least 9 unique people on March 1, 2018.

Click here to view this problem in an interactive Colab (Jupyter) notebook.

# In Python
import pandas as pd 

data = pd.read_csv('https://raw.githubusercontent.com/erood/interviewqs.com_code_snippets/master/Datasets/ddi_message_app_data.csv')

print(data.head())
##          date  sender_id  receiver_id
## 0  2018-03-01          5            2
## 1  2018-03-01          8            6
## 2  2018-03-01          1            2
## 3  2018-03-01          4            8
## 4  2018-03-01          2            7
print(data.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 999 entries, 0 to 998
## Data columns (total 3 columns):
## date           999 non-null object
## sender_id      999 non-null int64
## receiver_id    999 non-null int64
## dtypes: int64(2), object(1)
## memory usage: 23.5+ KB
## None
def fraction_sent_by_day(data:pd.DataFrame, day:str, count:int) -> int:
    df_filtered = data[data['date'] == day] 
    df_filtered = df_filtered.drop_duplicates(subset = ['sender_id', 'receiver_id'], keep = 'first').reset_index(drop=True)
    numerator = len(df_filtered)
    return numerator / len(data)

fraction_sent_by_day(data, '2018-03-01', 9)
## 0.0890890890890891
# In R
library(tidyverse)
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'pillar'
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'tibble'
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'hms'
## ── Attaching packages ─────────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
## ✔ tibble  3.1.4     ✔ dplyr   1.0.7
## ✔ tidyr   1.1.3     ✔ stringr 1.4.0
## ✔ readr   2.0.1     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
df_raw = read.csv('https://raw.githubusercontent.com/erood/interviewqs.com_code_snippets/master/Datasets/ddi_message_app_data.csv')

df <- df_raw %>%
    filter(date == '2018-03-01') %>%
    group_by(sender_id) %>%
    summarise(unique_sends = n_distinct(receiver_id)) %>%
    summarise(answer = sum(unique_sends >= 9) / n())

df[1]
## # A tibble: 1 × 1
##   answer
##    <dbl>
## 1    0.9

3.1.2 Time for a response on a messaging application

  • Data Analysis
  • Python
  • Pandas
  • Data Manipulation

You are given a dataset with information around messages sent between users in a P2P messaging application. Below is the dataset’s schema:

Column Name Data Type Description
date string date of the message sent/received, format is ‘YYYY-mm-dd’
timestamp integer timestamp of the message sent/received, epoch seconds
sender_id integer id of the message sender
receiver_id integer id of the message receiver

Given this, write code to find the fraction of messages that are sent between the same sender and receiver within five minutes (e.g. the fraction of messages that receive a response within 5 minutes).

# group by sender and receiver 
import pandas as pd 

df = pd.read_csv('data/data_analysis_2-messages.csv')
df.head()
##          date   timestamp  sender_id  receiver_id
## 0  2018-03-01  1519923378          1            5
## 1  2018-03-01  1519942810          1            4
## 2  2018-03-01  1519918950          1            5
## 3  2018-03-01  1519930114          1            2
## 4  2018-03-01  1519920410          1            2

3.1.3 Cleaning and analyzing employee data

  • Data Analysis
  • Pandas
  • Data Manipulation
  • Data Cleaning

Below is a snippet from a table that contains information about employees that work at Company XYZ:

employee_name employee_id date_joined age yrs_of_experience
Andy 123456 2015-02-15 45 24
Beth 789456 NaN 36 15
Cindy 654123 2017-05-16 34 14
Dale 963852 2018-01-15 25 4

Company XYZ recently migrated database systems causing some of the date_joined records to be NULL. You’re told by an analyst in human resources NULL records for the date_joined field indicates the employees joined prior to 2010. You also find out there are multiple employees with the same name and duplicate records for some employees.

Given this, write code to find the number of employees that joined each month. You can group all of the null values as Dec 1, 2009.

# in R
library(tidyverse)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
df <- data.frame(employee_name = c("Andy","Beth", "Cindy", "Dale", "Andy"),
                 emloyee_id = c(123456, 789456, 654123, 963852, 123222),
                 date_joined = c("2015-02-15", NA, "2017-05-16", "2018-01-15", "2018-01-15"),
                 age = c(45, 36, 34, 25, 45),
                 yrs_of_experience = c(24, 15, 14, 4, 24)) 

df %>%
    replace_na(list(date_joined="2009-12-01")) %>% 
    mutate(yr_month = floor_date(as_date(date_joined), "month")) %>% 
    group_by(employee_name, age, yrs_of_experience) %>% 
    mutate(dupe = n() > 1) %>%
    slice(1) %>% 
    ungroup() %>% 
    group_by(yr_month) %>% 
    summarize(n = n()) %>% 
    arrange(yr_month)
## # A tibble: 4 × 2
##   yr_month       n
##   <date>     <int>
## 1 2009-12-01     1
## 2 2015-02-01     1
## 3 2017-05-01     1
## 4 2018-01-01     1

3.1.4 Analyzing employee data

Below is a snippet from a table that contains information about employees that work at Company XYZ:

Column name Data type Example value Description
employee_name string Cindy Name of employee
employee_id integer 1837204 Unique id for each employee
yrs_of_experience integer 14 total working years of experience
yrs_at_company integer 10 total working years at Company XYZ
compensation integer 100000 dollar value of employee compensation
career_track string technical Potential values: technical, non-technical, executive

Company XYZ Human Resource department is trying to understand compensation across the company and asked you to pull data to help them make a decision regarding employee compensation.

Question: Can you pull the average, median, minimum, maximum, and standard deviations for salary across 5 year experience buckets at Company XYZ? (e.g. get the corresponding average, median, minimum, maximum, and standard deviations for experience buckets 0-5, 5-10, 10-15, etc.)

df <- data.frame(yrs_of_experience = c(sample(1:25, 100, replace=T)))

df %>%
    mutate(age_cat = case_when(
        yrs_of_experience < 5 ~ "0-5",
        yrs_of_experience >= 5 & yrs_of_experience < 10 ~ "5-10",
        yrs_of_experience >= 10 & yrs_of_experience < 15 ~ '10-15',
        yrs_of_experience >= 15 & yrs_of_experience < 20 ~ '15-20',
        TRUE ~ "Over 20")) %>%
    group_by(age_cat) %>% 
    summarize(average= mean(yrs_of_experience), median = median(yrs_of_experience), min = min(yrs_of_experience), max = max(yrs_of_experience), sd = sd(yrs_of_experience))
## # A tibble: 5 × 6
##   age_cat average median   min   max    sd
##   <chr>     <dbl>  <dbl> <int> <int> <dbl>
## 1 0-5        2.69    3       1     4  1.30
## 2 10-15     11.7    12      10    14  1.49
## 3 15-20     16.6    16      15    19  1.54
## 4 5-10       6.57    6.5     5     9  1.50
## 5 Over 20   22.0    21      20    25  1.95