
Data Warehousing for an e-commerce website
Using Apache Hive
PROJECT DESCRIPTION
The purpose of this project was to get an understanding of how we can use Apache Hive (through Cloudera) to build a Data Warehouse for an e-commerce website. Our job was to apply transformations and analysis tasks over the data generated through customer orders. Based on the nature of the data, we initially structured the data into a simple star schema, and then applied further analysis.
Sub-tasks of this project included:
- Executing the necessary commands in order to create the correct folder structure within the HDFS folder in cloudera.
- Loading the data in the corresponding HDFS folders and making them accessible through Hive.
- Creating the corresponding database, tables and “loading” statements.
- Creating staging tables and loading the data from external tables to pre-defined staging tables, while converting the columns into the correct data type.
- Building surrogate keys for specific dimensions and storing them in tables
- Creating/populating specific dimensions (e.g. time/date dimensions) for better data analysis
THEMES
Apache Hive, Cloudera, Data Warehousing
DATASET INFO
The dataset consisted of orders made in an e-commerce website and their details.
The main orders file is structured as follows:
Field
Description
ORDER_NUMBER
Order unique identifier
ORDER_DATE
Date of Order
SHIPPED_DATE
Date the order shipped
STATUS
Order status
COMMENTS
Comments over the order
CUSTOMER_NUMBER
Customer unique identifier
CUSTOMER_NAME
Customer name
CUST_CITY
City of customer
CUST_STATE
State of customer
CUST_COUNTRY
Country of customer
CUST_COUNTRY_ISO
ISO code of customer country
SALES_CURRENCY
Currency of sales price
SALES_REP_ID
Sales representative id
SALES_REP_FIRSTNAME
First name of Sales representative
SALES_REP_LASTNAME
Surname of Sales representative
OFFICE_CODE
Code of office the sales representative belongs to
REPORTING_PATH
Sales representative reporting path
OFFICE_CITY
City of office
OFFICE_STATE
State of office
OFFICE_TERRITORY
Territory of office
OFFICE_COUNTRY
Office country
There is also a companion file, with additional details about the orders:Field
ORDER_NUMBER
Description
Order unique identifier
PRODUCT_CODE
Unique identifier of the product
PRODUCT_NAME
Name of the product
PRODUCT_CATEGORY
Category of the product
PRODUCT_VENDOR
Vendor of the product
QUANTITY_IN_STOCK
Quantity in stock
BUY_PRICE
Purchase price, per item
QUANTITY_ORDERED
Number of items ordered, for the specific product
UNIT_PRICE
Price of the product (per item)
CODE
Code in Hive Query Language are available upon request
COLLABORATORS
Sotiris Baratsas © 2022. All rights reserved.