Introduction
This is a mid-term project of R language course in NSYSU -- visual analysis of datasets, problems identify and problems solving . First, introduce Olist company and the datasets. And conduct EDA (exploratory data analysis) to understand Olist’s operating and identify problems. Final, offer the solutions to the problems and benefits evaluation. The R language is used for data analysis and diagrams in this project.
Content
1. What is Olist?
2. Dataset
2.1. Daraset characteristic
2.2.Variable
3. Before analysising
4. Analysising
4.1. Motivation
4.2. Exploratory data analysis
4.3. Solution proposal
4.4. Benefit
5. Conclusion
6. Reference
1. What is Olist?
Olist is a Brazilian departmental store that operates in e-commerce segment, but It operates as a SaaS (Software as a Service) technology company since 2015. It offers a e-commerce marketplace solution to retailers of all sizes to increase their sales.
Olist’s solution consists of three aspects: Software, Contracts with the main marketplaces and Reputation sharing. Olist’s Business Model diagram as below shows how Olist links marketplaces, consumers, and retailers.
2. Datasets
2.1. Daraset characteristic
On Kaggle, its data of 100k orders from September 2016 to August 2018. And each of dataset saved information about orders, products, consumers, and sellers. The data model has been described in image below and it is organised and normalised for each category.
Source: https://www.kaggle.com/anshumoudgil/olist-ecommerce-analytics-quasi-poisson-poly-regs
2.2 . Variable
Orders
order_id:Order ID number
customer_id: Customer ID number
order_status:Status of order process
order_purchase_timestamp:Time of the product is ordered
order_approved_at:Time of payment for the order
order_delivered_carrier_date:Time of notifying the logistics in order process
order_delivered_customer_date:Time of the product arrived at the consumer
order_estimated_delivery_date:Estimated arrival time of the order
Items
order_item_id:The number of the product in the order (ex:If there are three items in the order, the order_item_id is 1, 2, 3)
product_id:Product ID number
seller_id :Seller ID number
shipping_limit_date:The deadline for the seller deliver the goods to the logistics
price:Product price
freight_value:delivery fee
Reviews
review_id:The comment ID number
review_score:Customer satisfaction
review_comment_title:Comment title
review_comment_message:Comment message
review_creation_date:Time of inviteing customer make a comment
review_answer_timestamp:Time of customer finish the comment
Products
product_category_name:Product category
product_name_lenght:Number of words in product name
product_description_lenght:Number of words in product description
product_photos_qty:Number of product photo
product_weight_g:Weight of product(gram)
product_length_cm:Length of product (centimeter)
product_height_cm:Height of product(centimeter)
product_width_cm:Width of product(centimeter)
Customers
customer_unique_id:customer ID number
customer_zip_code_prefix:Zip code of customer
customer_city:City of customer
customer_state:State of customer
seller_zip_code_prefix:Zip code of customer
seller_city:City of seller
seller_state:State of seller
3. Before analysing
Check datasets have any missing value and repeat value, and remove them.
order_delivered_carrier_date、order_delivered_customer_date
These two variable has some missing value, infer that some order be canceled lead to order process be interupt.
review_comment_title、review_comment_message
These two variable has a lot missing value, infer that Olist does not enforce customers make the comment.
In order to analysis, 8 data sets are merged into a main data set according to the order id. Each data (row) represents a transaction information. If there are two rows (or more) with same oreder_id, it means that the two transactions belong to the same order.
Source: https://medium.com/@albert90162/巴西電商olist-資料集分析-上-探索性資料分析-85d376f6af43
4. Analysising
4.1. Motivation
Number of purchases
In these 100,000 transaction record, customers seldom ordered than twice, infer that the datasets maybe is random sampling, not the complete trasaction record. Thus, it is not easy realize real customers’ situation. Moreover, other analysts have made a detailed analysis of the purchase time, distribution area, sales profile, and customer reviews. So, this analysis will focus on the delivery part because the dataset of order delivery is complete compared with other datasets.
Wordcloud
This is a wordcloud made from the customer review. Time, delivery, and arrived are customers care. Thus, delivery service is very important to Olist customers.
4.2. Exploratory and analysis datasets
Delivery time and product category
The payment time is roughly the same. The stock up time is the time from platform approve order to the seller delivery the goods. It can be inferred that office furniture need assembly or customized production, the stocking time is the longest. If Seller can offer disassembled furniture — reducing assembly time and quickly shipping (such as IKEA’s product).
All category of goods, the delivery time is the longest of the total time, and the delivery time has no corelation with category
Correlation coefficient matrix
Correlation coefficient between delivery time (t3) and delay days is 0.57. If the delivery time is longer, the goods ordered by the customer are more likely to be delayed in arrival.
And the factors that affect shipping costs, the correlation coefficient between weight and freight is 0.61, it means that heavier the weight, higher the freight. The correlation coefficient between volume and freight is 0.58, so the freight will increase if the volume is large. In addition, delivery time is also slightly negatively correlated with customer ratings.
The relationship between delivery time and customer ratings
From the figure, the delivery time (t3) is negatively correlated with the rating.
In the box chart where customers scored one, 50% of customers received a longer delivery time, confirming that the longer the delivery time, the lower the score.
In addition, we also found that most people (75%) who scored three believed that delivery time should be within 14 days.
Therefore, we use 14 days as the benchmark. If over 14 delivery days, it means that exceeding the standard.
Distribution of delivery time
Most of the delivery time for goods ordered is within 9 days, and there are 15.1% orders are more than 14 days.
Average delivery time per month
The delay situation were up in April 2017 and February to March 2018 because there were largest strikes in Brazil In the periors. In addition to the strikes, delivery time is also susceptible by major festivals, so in December you can see a significant increase in delivery time. If Olist have own logistics center, it will be able to control the delivery time.
4.3. Solution proposal
It is going to solve the problem of delivery time, improve the delay situation and customer satisfaction. First analyze the delivery time of each state to determind the logistics center is located. And which products are placed in the logistics center. Finally, estimate the performance of build the logistics center, including the time saved and freight.
Location distribution of customers and sellers
The customers of olist are mostly located in the eastern of Brazil; while the sellers are mainly concentrated in the southeast. Therefore, the logistics center will serve customers in the eastern.
Average delivery time and revenue of each state
Most of the transactions are concentrated in the southeast, the distance of delivery is short, so the average delivery time in the southeast is shorter than other regions.
The red area on the right image shows a high revenue, indicating that Olist’s revenue is concentrated in the southeast.
Average delivery time of the top 10 states by transaction
The delivery time in the red area is significant long, such as SC-BA, PE-RS, RS-PE, GO-PE.
Routes with an average delivery time of more than two weeks
Most of the routes are the north-south transportation routes because they are too long. There are BA, MG, GO, and DF in the middle of these route. In order to improve the delivery time of these routes, the suggesting is build the logistics center.
Orders passing through BA have high chance be delayed. In addition, BA is dominated by agriculture and tourism as its main sources of income so the cost of acquiring land is relatively cheap. Therefore, setting up a logistics center in BA(Bahia) will help the transportation in eastern Brazil.
Revenue, number of orders, and freight for each category
Goods with high orders and revenues from 2016 to 2018 are in the upper right corner, such as beauty, bathroom products, sports goods, etc.
Sales quantity by category in each state
Among the 10 routes that the logistics center is responsible for, the most popular product category is roughly the same as the hot-selling product in the previous picture. Therefore, recommend that Olist should put these commodities in the logistics center.
4.4. Benefit
Logistics center’s benefit — delivery time
The delivery time of each route has a significant reduction. On average, it can save 5 days of delivery time. And all goods will arrive within two weeks.
Logistics center’s benefit — freight
The freight for each route has been reduced, it can save 40% of the freight.
5. Conclusion
This project is my first time participated in data analysis. I learned that programming language is very important, but it’s even more important to be able to make the best decision based on the analyzed result information. This project was made by our team in this course. In order to commemorate this hard work, so specially organized into this article.
If there are anything I can improve, please feel free to let me know.
Thank you.
6. Reference
R Language, Probability & Business
Brazilian E-Commerce Public Dataset by Olist
https://www.kaggle.com/olistbr/brazilian-ecommerce
Olist eCommerce-Analytics, Quasi Poisson+Poly Regs
https://www.kaggle.com/anshumoudgil/olist-ecommerce-analytics-quasi-poisson-poly-regs
Olist-巴西電商訂單資料分析 (上) — 探索性資料分析
https://medium.com/@albert90162/巴西電商olist-資料集分析-上-探索性資料分析-85d376f6af43