Exploratory Data Analysis (Restaurant Sales in NYC Data)

Restaurant Food Delivery Project

This portfolio project was an opportunity to analyze a dataset that can provide insights to the restaurant management service in NYC.¶

The aim of the project was to answer the following questions through exploratory data analysis.¶

What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed?

How many orders are not rated? Is this data clean? Why or why not?

What are the most popular cuisine types?

How does their price range and review rating compare to those that don't perforT as well?

try to find out why they perform better.

Are they ordering more on weekdays or weekends?

Is preparation time longer on weekdays or weekends?

What is the delivery time on weekdays vs weekends?

Which are the top restaurants with orders received?

What is the most popular cuisine type on weekends?

what percentage of orders are less than $20? less than $30? $40? $50? what is the average delivery time in NYC?

The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders hey placed cost of cuisine typ comparison ratings based on cuisine type preparation time based o cuisine type.

The company want p to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4.

Find the restaurants fulfilling the criteria to get the promotional offer. The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders.

The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered.)

The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during week days and weekends?

What are your conclusions from the analysis? What recommendations would you like to share to help improve the business? ove the business?

I used the process of Exploratory Data Analysis EDA to answer the above questions¶

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.preprocessing import LabelEncoder

In [2]:

df=pd.read_csv ('foodhub.csv')
df.head()

Out[2]:

	order_id	customer_id	restaurant_name	cuisine_type	cost_of_the_order	day_of_the_week	rating	food_preparation_time	delivery_time
0	1477147	337525	Hangawi	Korean	30.75	Weekend	Not given	25	20
1	1477685	358141	Blue Ribbon Sushi Izakaya	Japanese	12.08	Weekend	Not given	25	23
2	1477070	66393	Cafe Habana	Mexican	12.23	Weekday	5	23	28
3	1477334	106968	Blue Ribbon Fried Chicken	American	29.20	Weekend	3	25	15
4	1478249	76942	Dirty Bird to Go	American	11.59	Weekday	4	25	24

In [3]:

#Is this data clean?

df.shape

Out[3]:

(1898, 9)

In [4]:

df.columns

Out[4]:

Index(['order_id', 'customer_id', 'restaurant_name', 'cuisine_type',
       'cost_of_the_order', 'day_of_the_week', 'rating',
       'food_preparation_time', 'delivery_time'],
      dtype='object')

1. Is the date clean? why or why not?¶

In [5]:

df.isnull().sum()

Out[5]:

order_id                 0
customer_id              0
restaurant_name          0
cuisine_type             0
cost_of_the_order        0
day_of_the_week          0
rating                   0
food_preparation_time    0
delivery_time            0
dtype: int64

In [6]:

df.duplicated().sum()

Out[6]:

In [7]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               1898 non-null   int64  
 1   customer_id            1898 non-null   int64  
 2   restaurant_name        1898 non-null   object 
 3   cuisine_type           1898 non-null   object 
 4   cost_of_the_order      1898 non-null   float64
 5   day_of_the_week        1898 non-null   object 
 6   rating                 1898 non-null   object 
 7   food_preparation_time  1898 non-null   int64  
 8   delivery_time          1898 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 133.6+ KB

Data is cleaned, no null values, no duplicate values¶

2. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed?¶

In [8]:

df.describe()

Out[8]:

	order_id	customer_id	cost_of_the_order	food_preparation_time	delivery_time
count	1.898000e+03	1898.000000	1898.000000	1898.000000	1898.000000
mean	1.477496e+06	171168.478398	16.498851	27.371970	24.161749
std	5.480497e+02	113698.139743	7.483812	4.632481	4.972637
min	1.476547e+06	1311.000000	4.470000	20.000000	15.000000
25%	1.477021e+06	77787.750000	12.080000	23.000000	20.000000
50%	1.477496e+06	128600.000000	14.140000	27.000000	25.000000
75%	1.477970e+06	270525.000000	22.297500	31.000000	28.000000
max	1.478444e+06	405334.000000	35.410000	35.000000	33.000000

The Minimum time for 20 minutes, the maximum time is 35 minutes and the average time is 27.37 minutes¶

3. How many orders are not rated?¶

In [9]:

count_rate = df['rating'].value_counts()

In [10]:

count_rate

Out[10]:

rating
Not given    736
5            588
4            386
3            188
Name: count, dtype: int64

There are no reviews for 736 orders¶

4. What are the most popular cuisine types?¶

How does their price range and review rating compare to those that don't perform as well? try to find out why they perform better.

In [11]:

count_cusine = df['cuisine_type'].value_counts()

In [12]:

count_cusine

Out[12]:

cuisine_type
American          584
Japanese          470
Italian           298
Chinese           215
Mexican            77
Indian             73
Middle Eastern     49
Mediterranean      46
Thai               19
French             18
Southern           17
Korean             13
Spanish            12
Vietnamese          7
Name: count, dtype: int64

The 4 most popular cuisine_type are; American (584) Japanese (470) Italian (298) Chinese (215)

In [13]:

def calculate_average_cost_by_cuisine(df):
    # Grouping by 'cuisine_type' and calculate the mean of average 'cost_of_the_order'
    average_costs = df.groupby('cuisine_type')['cost_of_the_order'].mean()
    return average_costs

In [14]:

average_costs = calculate_average_cost_by_cuisine(df)
print(average_costs)

cuisine_type
American          16.319829
Chinese           16.305209
French            19.793889
Indian            16.919726
Italian           16.418691
Japanese          16.304532
Korean            14.001538
Mediterranean     15.474783
Mexican           16.933117
Middle Eastern    18.820612
Southern          19.300588
Spanish           18.994167
Thai              19.207895
Vietnamese        12.882857
Name: cost_of_the_order, dtype: float64

In [15]:

average_costs_sorted = average_costs.sort_values(ascending=True)
plt.figure(figsize=(10, 6))
average_costs_sorted.plot(kind='bar')
plt.xlabel('Cuisine Type')
plt.ylabel('Average Cost of the Order')
plt.title('Average Cost of Orders by Cuisine Type')
plt.xticks(rotation=45)
plt.show()

No description has been provided for this image

In [16]:

def calculate_average_rating_by_cuisine(df):
    # Grouping by 'cuisine_type' and calculate the mean of average rating
     # Replace "Not given" with NaN
    df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
    average_rating = df.groupby('cuisine_type')['rating'].mean()
    return average_rating

In [17]:

average_rating = calculate_average_rating_by_cuisine(df)
print(average_rating)

cuisine_type
American          4.298913
Chinese           4.338346
French            4.300000
Indian            4.540000
Italian           4.360465
Japanese          4.373626
Korean            4.111111
Mediterranean     4.218750
Mexican           4.416667
Middle Eastern    4.235294
Southern          4.307692
Spanish           4.833333
Thai              4.666667
Vietnamese        4.000000
Name: rating, dtype: float64

In [18]:

std_dev_mean_ratings = average_rating.std()
print(std_dev_mean_ratings)

0.21328238900474772

In [19]:

dataframe=df.loc[:,['restaurant_name','cuisine_type','cost_of_the_order','day_of_the_week','rating','food_preparation_time', 'delivery_time']]

In [20]:

dataframe.head()

Out[20]:

	restaurant_name	cuisine_type	cost_of_the_order	day_of_the_week	rating	food_preparation_time	delivery_time
0	Hangawi	Korean	30.75	Weekend	NaN	25	20
1	Blue Ribbon Sushi Izakaya	Japanese	12.08	Weekend	NaN	25	23
2	Cafe Habana	Mexican	12.23	Weekday	5.0	23	28
3	Blue Ribbon Fried Chicken	American	29.20	Weekend	3.0	25	15
4	Dirty Bird to Go	American	11.59	Weekday	4.0	25	24

What is the correlationship that exists within the dataset?¶

In [21]:

# Label encode categorical columns
label_encoder = LabelEncoder()

# List of categorical columns to encode
categorical_columns = ['restaurant_name', 'cuisine_type', 'day_of_the_week']

for column in categorical_columns:
    dataframe[column] = label_encoder.fit_transform(dataframe[column])

In [22]:

dataframe.head()

Out[22]:

	restaurant_name	cuisine_type	cost_of_the_order	day_of_the_week	rating	food_preparation_time	delivery_time
0	67	6	30.75	1	NaN	25	20
1	23	5	12.08	1	NaN	25	23
2	29	8	12.23	0	5.0	23	28
3	20	0	29.20	1	3.0	25	15
4	47	0	11.59	0	4.0	25	24

In [23]:

dataframe.corr()

Out[23]:

	restaurant_name	cuisine_type	cost_of_the_order	day_of_the_week	rating	food_preparation_time	delivery_time
restaurant_name	1.000000	-0.110332	-0.012247	-0.001656	0.031119	0.007109	0.026797
cuisine_type	-0.110332	1.000000	0.035184	-0.012140	0.028304	-0.028737	-0.005913
cost_of_the_order	-0.012247	0.035184	1.000000	0.015960	0.033983	0.041527	-0.029949
day_of_the_week	-0.001656	-0.012140	0.015960	1.000000	0.030727	0.021970	-0.534801
rating	0.031119	0.028304	0.033983	0.030727	1.000000	-0.006083	-0.009804
food_preparation_time	0.007109	-0.028737	0.041527	0.021970	-0.006083	1.000000	0.011094
delivery_time	0.026797	-0.005913	-0.029949	-0.534801	-0.009804	0.011094	1.000000

In [24]:

sns.heatmap(dataframe.corr(),
            annot=True,
           cbar=False)
plt.show()

a. The price range and reviews do no directly explain why some cuisine perform better because the average rating of cuisine is between 4.0 and 4.83 and standard deviation of 0.21 which implies the data points are very close to the mean.¶

b. In same light, there is no linear relationship within the dataset.¶

In [25]:

dataframe2=df.loc[:,['restaurant_name','cuisine_type','cost_of_the_order','day_of_the_week','rating','food_preparation_time', 'delivery_time']]

In [26]:

dataframe.head()

Out[26]:

	restaurant_name	cuisine_type	cost_of_the_order	day_of_the_week	rating	food_preparation_time	delivery_time
0	67	6	30.75	1	NaN	25	20
1	23	5	12.08	1	NaN	25	23
2	29	8	12.23	0	5.0	23	28
3	20	0	29.20	1	3.0	25	15
4	47	0	11.59	0	4.0	25	24

In [27]:

# Group by 'cuisine_type' and 'day_of_the_week' and count the number of orders
order_counts = df.groupby(['cuisine_type', 'day_of_the_week']).size().unstack(fill_value=0)

# Display the result
print(order_counts)

day_of_the_week  Weekday  Weekend
cuisine_type                     
American             169      415
Chinese               52      163
French                 5       13
Indian                24       49
Italian               91      207
Japanese             135      335
Korean                 2       11
Mediterranean         14       32
Mexican               24       53
Middle Eastern        17       32
Southern               6       11
Spanish                1       11
Thai                   4       15
Vietnamese             3        4

In [28]:

average_times = df.groupby('cuisine_type')[['food_preparation_time', 'delivery_time']].mean()
print(average_times)

                food_preparation_time  delivery_time
cuisine_type                                        
American                    27.440068      24.193493
Chinese                     27.511628      23.855814
French                      26.888889      25.333333
Indian                      27.109589      24.082192
Italian                     27.483221      24.567114
Japanese                    27.510638      24.131915
Korean                      25.461538      20.923077
Mediterranean               27.000000      23.586957
Mexican                     26.727273      24.389610
Middle Eastern              26.673469      24.081633
Southern                    27.588235      23.823529
Spanish                     26.916667      23.416667
Thai                        27.315789      23.157895
Vietnamese                  25.714286      26.142857

In [29]:

# Reset index to use cuisine_type as a column for plotting
average_times = average_times.reset_index()
plt.figure(figsize=(9, 5))
sns.barplot(x='cuisine_type', y='food_preparation_time', data=average_times, palette='viridis')
plt.title('Average Food Preparation Time by Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Average Food Preparation Time (minutes)')
plt.xticks(rotation=45)
plt.show()

In [30]:

# Plot the average delivery time
plt.figure(figsize=(9, 5))
sns.barplot(x='cuisine_type', y='delivery_time', data=average_times, palette='viridis')
plt.title('Average Delivery Time by Cuisine Type')
plt.xlabel('Cuisine Type')
plt.ylabel('Average Delivery Time (minutes)')
plt.xticks(rotation=45)
plt.show()

I have tried to look at the discrepancies in the data from total orders by weekday and weekend, and also looking at the average preparation time and selivery time.¶

However, the cuisine performance could be a preferance of the population charactics; such as background influence.

5. Are they ordering more on weekdays or weekends?¶

The above analysis shows that more orders come on the weekends looking at the counts

6. Is preparation time longer on weekdays or weekends?¶

In [31]:

average_times_by_week = df.groupby('day_of_the_week')[['food_preparation_time', 'delivery_time']].mean()
print(average_times_by_week)

                 food_preparation_time  delivery_time
day_of_the_week                                      
Weekday                      27.212066      28.340037
Weekend                      27.436714      22.470022

Preparation time is slightly longer on weekend with an average of 27.44 , fairly explained by the high orders received.¶

On the orther have delivery time is higher on the weekdays.

7. which are the top restaurants with orders received?¶

In [33]:

count_restaurants = df['restaurant_name'].value_counts()
freq_above_30 = count_restaurants[count_restaurants > 30]

In [34]:

print(freq_above_30)

restaurant_name
Shake Shack                      219
The Meatball Shop                132
Blue Ribbon Sushi                119
Blue Ribbon Fried Chicken         96
Parm                              68
RedFarm Broadway                  59
RedFarm Hudson                    55
TAO                               49
Han Dynasty                       46
Blue Ribbon Sushi Bar & Grill     44
Nobu Next Door                    42
Rubirosa                          37
Sushi of Gari 46                  37
Name: count, dtype: int64

The top three restaurants are;¶

Shake Shack (219), The Meatball Shop (132), and Blue Ribbon Sushi (119)

In [35]:

filtered_restaurants = df['restaurant_name'].value_counts()[df['restaurant_name'].value_counts() > 30]
# Sort ratings by their counts in ascending order
sorted_res = filtered_restaurants.index

plt.figure(figsize = (10,8))
sns.countplot(y='restaurant_name', data=df[df['restaurant_name'].isin(filtered_restaurants.index)], order=sorted_res,palette='viridis')
plt.xlabel('Frequency')
plt.ylabel('Restaurant')
plt.show()

The most popular restaurants by orders are Blue Shake Shack 219, The Meatball Shop 132, and Blue Ribbon Sushi 119

8. what is the most popular cuisine type on weekends?¶

In [36]:

weekend_cuisine_order = df.groupby(['cuisine_type', 'day_of_the_week']).size().unstack(fill_value=0)

In [37]:

print(weekend_cuisine_order)

day_of_the_week  Weekday  Weekend
cuisine_type                     
American             169      415
Chinese               52      163
French                 5       13
Indian                24       49
Italian               91      207
Japanese             135      335
Korean                 2       11
Mediterranean         14       32
Mexican               24       53
Middle Eastern        17       32
Southern               6       11
Spanish                1       11
Thai                   4       15
Vietnamese             3        4

In [38]:

# Filter DataFrame for 
filtered_df = df[df['day_of_the_week'].isin(['Weekend'])]
#order
sorted_cuisines = filtered_df.index

# Plotting with Seaborn
plt.figure(figsize=(12, 6))
sns.countplot(data=filtered_df, x='cuisine_type', hue='day_of_the_week', palette='viridis')
plt.title('Cuisine Type Count for Monday and Tuesday')
plt.xlabel('Cuisine Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

9. what percentage of orders are less than $20? less than $30? $40? $50?¶

In [39]:

def percentage_below_thresholds(df, thresholds, column_name='cost_of_the_order'):
    total_orders = df.shape[0]
    percentages = {}
    
    for threshold in thresholds:
        orders_below_threshold = df[df[column_name] < threshold].shape[0]
        percentage_below_threshold = (orders_below_threshold / total_orders) * 100
        percentages[threshold] = percentage_below_threshold
    
    return percentages

# Define the thresholds
thresholds = [20, 30, 40, 50]

# Get the percentages
percentages = percentage_below_thresholds(df, thresholds)

# Print the results
for threshold, percentage in percentages.items():
    print(f"Percentage of orders less than ${threshold}: {percentage:.2f}%")

Percentage of orders less than $20: 70.76%
Percentage of orders less than $30: 94.73%
Percentage of orders less than $40: 100.00%
Percentage of orders less than $50: 100.00%

what is the average delivery time in NYC?

In [40]:

df['delivery_time'].mean()

Out[40]:

24.161749209694417

IDs of these customers and the number of orders they placed¶

In [41]:

def top_customers(df, n=3):
    # Group by customer_id and count the number of orders for each customer
    customer_order_counts = df['customer_id'].value_counts()
    
    # Get the top n customers
    top_n_customers = customer_order_counts.head(n)
    
    # Convert to DataFrame for better readability
    top_n_customers_df = top_n_customers.reset_index()
    top_n_customers_df.columns = ['customer_id', 'number_of_orders']
    
    return top_n_customers_df

# Get the top 3 customers
top_3_customers = top_customers(df, 3)

print(top_3_customers)

   customer_id  number_of_orders
0        52832                13
1        47440                10
2        83287                 9

In [ ]:

In [42]:

def top_customers_cuisines(df, n=3):
    # Group customer_id and count the number of orders for each customer
    customer_order_counts = df['customer_id'].value_counts()
    
    # Get top n customers
    top_n_customers = customer_order_counts.head(n).index
    
    # Filter DFrame to include only orders from the top n customers
    top_customers_orders = df[df['customer_id'].isin(top_n_customers)]
    
    # Get unique cuisines from these orders
    top_cuisines = top_customers_orders['cuisine_type'].unique()
    
    return top_cuisines

# Get cuisines for top 3 customers
top_cuisines = top_customers_cuisines(df, 3)

print(top_cuisines)

['American' 'Chinese' 'Middle Eastern' 'Japanese' 'Mexican' 'Korean'
 'Indian' 'Spanish' 'French']

In [43]:

def top_customers_top_cuisines_cost(df, n_customers=3, n_cuisines=9):
    # Group customer_id and count the number of orders for each customer
    customer_order_counts = df['customer_id'].value_counts()
    
    # Get top n_customers customers based on number of orders
    top_customers = customer_order_counts.head(n_customers).index
    
    # Filter DataFrame to include only orders from top customers
    top_customers_orders = df[df['customer_id'].isin(top_customers)]
    
    # Calculate the total cost of orders for each cuisine type
    cuisine_total_cost = top_customers_orders.groupby('cuisine_type')['cost_of_the_order'].sum()
    
    # Get top n_cuisines cuisines based on total order cost
    top_cuisines = cuisine_total_cost.sort_values(ascending=False).head(n_cuisines)
    
    return top_cuisines

# Get total cost for top 3 cuisines ordered by top  customers
top_cuisines_cost = top_customers_top_cuisines_cost(df, 3, 9)

print(top_cuisines_cost)

cuisine_type
American          232.64
Japanese           96.58
Indian             54.66
Middle Eastern     40.89
Chinese            31.53
French             24.20
Mexican            16.06
Spanish            14.50
Korean             12.23
Name: cost_of_the_order, dtype: float64

In [44]:

plt.figure(figsize=(10, 7))
plt.pie(top_cuisines_cost, labels=top_cuisines_cost.index, autopct='%1.1f%%', startangle=140, )
plt.title('Total Cost Distribution by Cuisine')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

What are the ratings based on cuisine type and average rating from top customers only?¶

In [45]:

def top_customers_cuisine_rating_stats(df, n_customers=3):
    # customer_id and count number of orders for each customer
    customer_order_counts = df['customer_id'].value_counts()
    
    #  top n_customers customers based on number of orders
    top_n_customers = customer_order_counts.head(n_customers).index
    
    #  DF to include only orders from top n_customers customers
    top_customers_orders = df[df['customer_id'].isin(top_n_customers)]
    
    # empty lists to store ratings and number of ratings per cuisine
    cuisine_ratings = []
    cuisine_num_ratings = []
    
    # Iterate over each cuisine type
    for cuisine in top_customers_orders['cuisine_type'].unique():
        # orders for the current cuisine
        cuisine_orders = top_customers_orders[top_customers_orders['cuisine_type'] == cuisine]
        
        # 'Not given' ratings to NaN
        cuisine_orders['rating'] = pd.to_numeric(cuisine_orders['rating'], errors='coerce')
        
        # number of ratings and average rating for the current cuisine
        num_ratings = cuisine_orders['rating'].count()
        avg_rating = cuisine_orders['rating'].mean()
        
        # results to lists
        cuisine_ratings.append(avg_rating)
        cuisine_num_ratings.append(num_ratings)
    
    # DF to store results
    results_df = pd.DataFrame({
        'cuisine_type': top_customers_orders['cuisine_type'].unique(),
        'num_ratings': cuisine_num_ratings,
        'avg_rating': cuisine_ratings
    })
    
    return results_df

# Get ratings statistics for top customers for each cuisine
ratings_stats_per_cuisine = top_customers_cuisine_rating_stats(df, 3)

print(ratings_stats_per_cuisine)

     cuisine_type  num_ratings  avg_rating
0        American           11    4.090909
1         Chinese            1    3.000000
2  Middle Eastern            2    3.500000
3        Japanese            3    5.000000
4         Mexican            0         NaN
5          Korean            0         NaN
6          Indian            1    5.000000
7         Spanish            0         NaN
8          French            0         NaN

C:\Users\Brene\AppData\Local\Temp\ipykernel_16032\3916300919.py:21: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cuisine_orders['rating'] = pd.to_numeric(cuisine_orders['rating'], errors='coerce')

Preparation time based on cuisine type

In [46]:

def top_customers_cuisine_preparation_time(df, n_customers=3):
    # customer_id and count the number of orders for each customer
    customer_order_counts = df['customer_id'].value_counts()
    
    # top n_customers customers based on number of orders
    top_n_customers = customer_order_counts.head(n_customers).index
    
    # DataFrame to include only orders from the top n_customers customers
    top_customers_orders = df[df['customer_id'].isin(top_n_customers)]
    
    # Total preparation time for each cuisine type
    cuisine_total_preparation_time = top_customers_orders.groupby('cuisine_type')['food_preparation_time'].sum()
    
    return cuisine_total_preparation_time

# total preparation time for each cuisine type for the top 3 customers
total_preparation_time_per_cuisine = top_customers_cuisine_preparation_time(df, 3)

print(total_preparation_time_per_cuisine)

cuisine_type
American          403
Chinese            67
French             28
Indian             46
Japanese          206
Korean             32
Mexican            35
Middle Eastern     51
Spanish            35
Name: food_preparation_time, dtype: int64

In [47]:

total_preparation_time_per_cuisine.sort_values( ascending=False, inplace=True)
plt.figure(figsize=(8, 6))
data=total_preparation_time_per_cuisine.plot(kind='bar', )
plt.title('Total Food Preparation Time by Cuisine (Top 3 Customers)')
plt.xlabel('Cuisine Type')
plt.ylabel('Total Preparation Time')
plt.xticks(rotation=45)
plt.show()

The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer¶

In [48]:

def top_restaurants_by_ratings(df, min_rating_count=50, min_avg_rating=4):
    # Filter out "Not given" ratings to numeric values
    df_filtered = df[df['rating'] != 'Not given'].copy()
    df_filtered['rating'] = pd.to_numeric(df_filtered['rating'])

    #restaurant_name and calculate rating count and average rating
    rating_stats = df_filtered.groupby('restaurant_name')['rating'].agg(['count', 'mean']).reset_index()
    rating_stats.columns = ['restaurant_name', 'rating_count', 'avg_rating']

    # the conditions to be met
    top_restaurants = rating_stats[(rating_stats['rating_count'] >= min_rating_count) & (rating_stats['avg_rating'] >= min_avg_rating)]

    return top_restaurants

# top restaurants based on the criteria
top_restaurants = top_restaurants_by_ratings(df, min_rating_count=50, min_avg_rating=4)
print(top_restaurants)

               restaurant_name  rating_count  avg_rating
20   Blue Ribbon Fried Chicken            64    4.328125
21           Blue Ribbon Sushi            73    4.219178
136                Shake Shack           133    4.278195
153          The Meatball Shop            84    4.511905

In [49]:

top_restaurants.head()

Out[49]:

	restaurant_name	rating_count	avg_rating
20	Blue Ribbon Fried Chicken	64	4.328125
21	Blue Ribbon Sushi	73	4.219178
136	Shake Shack	133	4.278195
153	The Meatball Shop	84	4.511905

The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders.¶

In [50]:

def top_restaurants_by_ratings(df, min_rating_count=50, min_avg_rating=4):
    #"Not given" ratings and convert ratings to numeric values
    df_filtered = df[df['rating'] != 'Not given'].copy()
    df_filtered['rating'] = pd.to_numeric(df_filtered['rating'])

    # restaurant_name and calculate rating count and average rating
    rating_stats = df_filtered.groupby('restaurant_name')['rating'].agg(['count', 'mean']).reset_index()
    rating_stats.columns = ['restaurant_name', 'rating_count', 'avg_rating']

    #  based on the conditions
    top_restaurants = rating_stats[(rating_stats['rating_count'] >= min_rating_count) & (rating_stats['avg_rating'] >= min_avg_rating)]

    return top_restaurants

# top restaurants based on the criteria
top_restaurants = top_restaurants_by_ratings(df, min_rating_count=50, min_avg_rating=4)
top_restaurant_orders = df[df['restaurant_name'].isin(top_restaurants['restaurant_name'])]
def calculate_advertising_charges(df):
    # Calculate the advertising charge based on the cost_of_the_order
    df['advertising_charge'] = df['cost_of_the_order'].apply(lambda x: 0.25 * x if x > 20 else (0.15 * x if x > 5 else 0))
    return df

#Advertising charges to the orders from top restaurants
top_restaurant_orders_with_charges = calculate_advertising_charges(top_restaurant_orders)

print(top_restaurant_orders_with_charges)

      order_id  customer_id            restaurant_name cuisine_type  \
3      1477334       106968  Blue Ribbon Fried Chicken     American   
6      1477894       157711          The Meatball Shop      Italian   
12     1476966       129969  Blue Ribbon Fried Chicken     American   
15     1477414        66222                Shake Shack     American   
19     1477354        67487          Blue Ribbon Sushi     Japanese   
...        ...          ...                        ...          ...   
1887   1476873       237616                Shake Shack     American   
1888   1477353       106324          The Meatball Shop      Italian   
1891   1476981       138586                Shake Shack     American   
1895   1477819        35309          Blue Ribbon Sushi     Japanese   
1897   1478056       120353          Blue Ribbon Sushi     Japanese   

      cost_of_the_order day_of_the_week  rating  food_preparation_time  \
3                 29.20         Weekend     3.0                     25   
6                  6.07         Weekend     NaN                     28   
12                24.30         Weekend     5.0                     23   
15                16.20         Weekend     5.0                     33   
19                16.20         Weekend     4.0                     35   
...                 ...             ...     ...                    ...   
1887               5.82         Weekend     NaN                     26   
1888              16.20         Weekend     5.0                     21   
1891               5.82         Weekend     NaN                     22   
1895              25.22         Weekday     NaN                     31   
1897              19.45         Weekend     NaN                     28   

      delivery_time  advertising_charge  
3                15              7.3000  
6                21              0.9105  
12               17              6.0750  
15               25              2.4300  
19               26              2.4300  
...             ...                 ...  
1887             30              0.8730  
1888             26              2.4300  
1891             28              0.8730  
1895             24              6.3050  
1897             24              2.9175  

[566 rows x 10 columns]

C:\Users\Brene\AppData\Local\Temp\ipykernel_16032\2839648762.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['advertising_charge'] = df['cost_of_the_order'].apply(lambda x: 0.25 * x if x > 20 else (0.15 * x if x > 5 else 0))

In [51]:

top_restaurant_orders_with_charges.head()

Out[51]:

	order_id	customer_id	restaurant_name	cuisine_type	cost_of_the_order	day_of_the_week	rating	food_preparation_time	delivery_time	advertising_charge
3	1477334	106968	Blue Ribbon Fried Chicken	American	29.20	Weekend	3.0	25	15	7.3000
6	1477894	157711	The Meatball Shop	Italian	6.07	Weekend	NaN	28	21	0.9105
12	1476966	129969	Blue Ribbon Fried Chicken	American	24.30	Weekend	5.0	23	17	6.0750
15	1477414	66222	Shake Shack	American	16.20	Weekend	5.0	33	25	2.4300
19	1477354	67487	Blue Ribbon Sushi	Japanese	16.20	Weekend	4.0	35	26	2.4300

In [52]:

#Revenue generated across all orders 
top_restaurant_orders_with_charges['advertising_charge'].sum()

Out[52]:

1824.0994999999998

Advert Charges stands at 1824 dollars

In [53]:

# Add a new column 'total_time' which is the sum of 'food_preparation_time' and 'delivery_time'
df['total_time'] = df['food_preparation_time'] + df['delivery_time']

def percentage_of_long_orders(df, threshold=60):
    # %tage of orders with 'total_time' >= threshold
    total_orders = len(df)
    long_orders = len(df[df['total_time'] >= threshold])
    percentage = (long_orders / total_orders) * 100
    return percentage

# %tage of orders that take 60 minutes or more
percentage_60_plus = percentage_of_long_orders(df, 60)
print(f"Percentage of orders that take 60 minutes or more: {percentage_60_plus:.2f}%")

Percentage of orders that take 60 minutes or more: 12.91%

The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends?

In [54]:

average_delivery = df.groupby('day_of_the_week')['delivery_time'].mean()

In [55]:

print(average_delivery)

day_of_the_week
Weekday    28.340037
Weekend    22.470022
Name: delivery_time, dtype: float64

Based on the analysis, we can derive several conclusions and recommendations to help improve the business:

Conclusions¶

Order Distribution by Cuisine¶

The majority of orders come from American, Japanese, and Italian cuisines. Lesser popular cuisines include Vietnamese, Spanish, and Korean.

Average Order Cost¶

Higher average order costs are associated with cuisines liThai, Southern and Frenchean. Lower average order costs are seen in cuisines liVietnamese, Mediterranean, and Koreanern.

Order Frequency by Day of the Week¶

The distribution of orders varies across weekdays and weekendCe cuisines might have higher demand on specific days (e.g., weekends might see more orders foverall of cuisine).

Food Preparation and Delivery Times¶

Average food preparation and delivery times differ by cuisine. Some cuisines have longer preparation and delivery times which might affect customer satisfaction.

Order Time Distribution¶

A significant percentage of orders take more th5n 60 minutes, which could impact customer satisfaction.

Top Customers¶

Identifying top customers and their preferences helps in personalizing marketing efforts. The top customers contribute significantly to revenue and order freque ncy.

Top Restaurants¶

Restaurants with high average ratings and frequent orders can be promoted to maintain and enhance customer loyalty.

Advertising Charges¶

Implementing advertising charges based on order costs could provide an additional revenue stream.

Recommendations¶

Focus on Popular Cuisines¶

Invest in marketing and promotions for the most popular cuisines (American, Japanese, Italian) to leverage their high demand. Introduce special offers or discounts to boost orders for less popular cuisines.

Improve Preparation and Delivery Efficiency¶

Analyze the preparation and delivery processes for cuisines with longer times to identify bottlenecks. Implement measures to streamline operations and reduce total time to enhance customer satisfaction.

Dynamic Pricing and Discounts¶

Implement dynamic pricing strategies based on demand patterns observed on different days of the week. Offer discounts on slower days to balance the order distribution.

Enhance Customer Experience¶

Focus on reducing the percentage of orders taking more than 60 minutes by optimizing logistics and staff management. Regularly gather customer feedback to identify pain points and areas for improvement.

Personalized Marketing¶

Use insights from top customers to personalize marketing campaigns and offers. Encourage repeat orders from top customers through loyalty programs and exclusive deals.

Promote High-Rated Restaurants¶

Highlight and promote restaurants with high average ratings and frequent orders in different marketing materials. Collaborate with these restaurants for special events like workshops/seminars or promotions.