Analysis of Brazilian E-Commerce Datasets -- Olist

Owen
8 min readJul 11, 2021

--

Source: https://blog.olist.com/nova-identidade-do-olist/

Introduction

This is a mid-term project of R language course in NSYSU -- visual analysis of datasets, problems identify and problems solving . First, introduce Olist company and the datasets. And conduct EDA (exploratory data analysis) to understand Olist’s operating and identify problems. Final, offer the solutions to the problems and benefits evaluation. The R language is used for data analysis and diagrams in this project.

Content

1. What is Olist?

2. Dataset

2.1. Daraset characteristic

2.2.Variable

3. Before analysising

4. Analysising

4.1. Motivation

4.2. Exploratory data analysis

4.3. Solution proposal

4.4. Benefit

5. Conclusion

6. Reference

1. What is Olist?

Source: https://www.kaggle.com/anshumoudgil/olist-ecommerce-analytics-quasi-poisson-poly-regs

Olist is a Brazilian departmental store that operates in e-commerce segment, but It operates as a SaaS (Software as a Service) technology company since 2015. It offers a e-commerce marketplace solution to retailers of all sizes to increase their sales.

Olist’s solution consists of three aspects: Software, Contracts with the main marketplaces and Reputation sharing. Olist’s Business Model diagram as below shows how Olist links marketplaces, consumers, and retailers.

2. Datasets

2.1. Daraset characteristic

Source: https://www.kaggle.com/olistbr/brazilian-ecommerce

On Kaggle, its data of 100k orders from September 2016 to August 2018. And each of dataset saved information about orders, products, consumers, and sellers. The data model has been described in image below and it is organised and normalised for each category.

Source: https://www.kaggle.com/anshumoudgil/olist-ecommerce-analytics-quasi-poisson-poly-regs

2.2 . Variable

Orders

order_id:Order ID number

customer_id: Customer ID number

order_status:Status of order process

order_purchase_timestamp:Time of the product is ordered

order_approved_at:Time of payment for the order

order_delivered_carrier_date:Time of notifying the logistics in order process

order_delivered_customer_date:Time of the product arrived at the consumer

order_estimated_delivery_date:Estimated arrival time of the order

Items

order_item_id:The number of the product in the order (ex:If there are three items in the order, the order_item_id is 1, 2, 3)

product_id:Product ID number

seller_id :Seller ID number

shipping_limit_date:The deadline for the seller deliver the goods to the logistics

price:Product price

freight_value:delivery fee

Reviews

review_id:The comment ID number

review_score:Customer satisfaction

review_comment_title:Comment title

review_comment_message:Comment message

review_creation_date:Time of inviteing customer make a comment

review_answer_timestamp:Time of customer finish the comment

Products

product_category_name:Product category

product_name_lenght:Number of words in product name

product_description_lenght:Number of words in product description

product_photos_qty:Number of product photo

product_weight_g:Weight of product(gram)

product_length_cm:Length of product (centimeter)

product_height_cm:Height of product(centimeter)

product_width_cm:Width of product(centimeter)

Customers

customer_unique_id:customer ID number

customer_zip_code_prefix:Zip code of customer

customer_city:City of customer

customer_state:State of customer

seller_zip_code_prefix:Zip code of customer

seller_city:City of seller

seller_state:State of seller

3. Before analysing

Check datasets have any missing value and repeat value, and remove them.

order_delivered_carrier_date、order_delivered_customer_date

These two variable has some missing value, infer that some order be canceled lead to order process be interupt.

review_comment_title、review_comment_message

These two variable has a lot missing value, infer that Olist does not enforce customers make the comment.

In order to analysis, 8 data sets are merged into a main data set according to the order id. Each data (row) represents a transaction information. If there are two rows (or more) with same oreder_id, it means that the two transactions belong to the same order.

Source: https://medium.com/@albert90162/巴西電商olist-資料集分析-上-探索性資料分析-85d376f6af43

4. Analysising

4.1. Motivation

Number of purchases

In these 100,000 transaction record, customers seldom ordered than twice, infer that the datasets maybe is random sampling, not the complete trasaction record. Thus, it is not easy realize real customers’ situation. Moreover, other analysts have made a detailed analysis of the purchase time, distribution area, sales profile, and customer reviews. So, this analysis will focus on the delivery part because the dataset of order delivery is complete compared with other datasets.

Wordcloud

This is a wordcloud made from the customer review. Time, delivery, and arrived are customers care. Thus, delivery service is very important to Olist customers.

4.2. Exploratory and analysis datasets

Delivery time and product category

The payment time is roughly the same. The stock up time is the time from platform approve order to the seller delivery the goods. It can be inferred that office furniture need assembly or customized production, the stocking time is the longest. If Seller can offer disassembled furniture — reducing assembly time and quickly shipping (such as IKEA’s product).

All category of goods, the delivery time is the longest of the total time, and the delivery time has no corelation with category

Correlation coefficient matrix

Correlation coefficient between delivery time (t3) and delay days is 0.57. If the delivery time is longer, the goods ordered by the customer are more likely to be delayed in arrival.

And the factors that affect shipping costs, the correlation coefficient between weight and freight is 0.61, it means that heavier the weight, higher the freight. The correlation coefficient between volume and freight is 0.58, so the freight will increase if the volume is large. In addition, delivery time is also slightly negatively correlated with customer ratings.

The relationship between delivery time and customer ratings

From the figure, the delivery time (t3) is negatively correlated with the rating.

In the box chart where customers scored one, 50% of customers received a longer delivery time, confirming that the longer the delivery time, the lower the score.

In addition, we also found that most people (75%) who scored three believed that delivery time should be within 14 days.

Therefore, we use 14 days as the benchmark. If over 14 delivery days, it means that exceeding the standard.

Distribution of delivery time

Most of the delivery time for goods ordered is within 9 days, and there are 15.1% orders are more than 14 days.

Average delivery time per month

The delay situation were up in April 2017 and February to March 2018 because there were largest strikes in Brazil In the periors. In addition to the strikes, delivery time is also susceptible by major festivals, so in December you can see a significant increase in delivery time. If Olist have own logistics center, it will be able to control the delivery time.

4.3. Solution proposal

It is going to solve the problem of delivery time, improve the delay situation and customer satisfaction. First analyze the delivery time of each state to determind the logistics center is located. And which products are placed in the logistics center. Finally, estimate the performance of build the logistics center, including the time saved and freight.

Location distribution of customers and sellers

The customers of olist are mostly located in the eastern of Brazil; while the sellers are mainly concentrated in the southeast. Therefore, the logistics center will serve customers in the eastern.

Average delivery time and revenue of each state

Most of the transactions are concentrated in the southeast, the distance of delivery is short, so the average delivery time in the southeast is shorter than other regions.

The red area on the right image shows a high revenue, indicating that Olist’s revenue is concentrated in the southeast.

Average delivery time of the top 10 states by transaction

The delivery time in the red area is significant long, such as SC-BA, PE-RS, RS-PE, GO-PE.

Routes with an average delivery time of more than two weeks

Most of the routes are the north-south transportation routes because they are too long. There are BA, MG, GO, and DF in the middle of these route. In order to improve the delivery time of these routes, the suggesting is build the logistics center.

Orders passing through BA have high chance be delayed. In addition, BA is dominated by agriculture and tourism as its main sources of income so the cost of acquiring land is relatively cheap. Therefore, setting up a logistics center in BA(Bahia) will help the transportation in eastern Brazil.

Revenue, number of orders, and freight for each category

Goods with high orders and revenues from 2016 to 2018 are in the upper right corner, such as beauty, bathroom products, sports goods, etc.

Sales quantity by category in each state

Among the 10 routes that the logistics center is responsible for, the most popular product category is roughly the same as the hot-selling product in the previous picture. Therefore, recommend that Olist should put these commodities in the logistics center.

4.4. Benefit

Logistics center’s benefit — delivery time

The delivery time of each route has a significant reduction. On average, it can save 5 days of delivery time. And all goods will arrive within two weeks.

Logistics center’s benefit — freight

The freight for each route has been reduced, it can save 40% of the freight.

5. Conclusion

This project is my first time participated in data analysis. I learned that programming language is very important, but it’s even more important to be able to make the best decision based on the analyzed result information. This project was made by our team in this course. In order to commemorate this hard work, so specially organized into this article.

If there are anything I can improve, please feel free to let me know.

Thank you.

6. Reference

R Language, Probability & Business

https://bap.cm.nsysu.edu.tw/

Brazilian E-Commerce Public Dataset by Olist

https://www.kaggle.com/olistbr/brazilian-ecommerce

Olist eCommerce-Analytics, Quasi Poisson+Poly Regs

https://www.kaggle.com/anshumoudgil/olist-ecommerce-analytics-quasi-poisson-poly-regs

Olist-巴西電商訂單資料分析 (上) — 探索性資料分析

https://medium.com/@albert90162/巴西電商olist-資料集分析-上-探索性資料分析-85d376f6af43

Contact

✉️ Email: owenhsu94@gmail.com

Linkedin: https://www.linkedin.com/in/wen-wei-hsu-548000153/

--

--