• Data Warehousing for an e-commerce website

    Using Apache Hive

  • PROJECT DESCRIPTION

    The purpose of this project was to get an understanding of how we can use Apache Hive (through Cloudera) to build a Data Warehouse for an e-commerce website. Our job was to apply transformations and analysis tasks over the data generated through customer orders. Based on the nature of the data, we initially structured the data into a simple star schema, and then applied further analysis.

     

    Sub-tasks of this project included:

    • Executing the necessary commands in order to create the correct folder structure within the HDFS folder in cloudera.
    • Loading the data in the corresponding HDFS folders and making them accessible through Hive.
    • Creating the corresponding database, tables and “loading” statements. 
    • Creating staging tables and loading the data from external tables to pre-defined staging tables, while converting the columns into the correct data type.
    • Building surrogate keys for specific dimensions and storing them in tables
    • Creating/populating specific dimensions (e.g. time/date dimensions) for better data analysis

    THEMES

    Apache Hive, Cloudera, Data Warehousing

    DATASET INFO

    The dataset consisted of orders made in an e-commerce website and their details.

     

    The main orders file is structured as follows:

     

    Field

    Description

    ORDER_NUMBER

    Order unique identifier

    ORDER_DATE

    Date of Order

    SHIPPED_DATE

    Date the order shipped

    STATUS

    Order status

    COMMENTS

    Comments over the order

    CUSTOMER_NUMBER

    Customer unique identifier

    CUSTOMER_NAME

    Customer name

    CUST_CITY

    City of customer

    CUST_STATE

    State of customer

    CUST_COUNTRY

    Country of customer

    CUST_COUNTRY_ISO

    ISO code of customer country

    SALES_CURRENCY

    Currency of sales price

    SALES_REP_ID

    Sales representative id

    SALES_REP_FIRSTNAME

    First name of Sales representative

    SALES_REP_LASTNAME

    Surname of Sales representative

    OFFICE_CODE

    Code of office the sales representative belongs to

    REPORTING_PATH

    Sales representative reporting path

    OFFICE_CITY

    City of office

    OFFICE_STATE

    State of office

    OFFICE_TERRITORY

    Territory of office

    OFFICE_COUNTRY

    Office country

     

    There is also a companion file, with additional details about the orders:

     

    Field

    ORDER_NUMBER

    Description

    Order unique identifier

    PRODUCT_CODE

    Unique identifier of the product

    PRODUCT_NAME

    Name of the product

    PRODUCT_CATEGORY

    Category of the product

    PRODUCT_VENDOR

    Vendor of the product

    QUANTITY_IN_STOCK

    Quantity in stock

    BUY_PRICE

    Purchase price, per item

    QUANTITY_ORDERED

    Number of items ordered, for the specific product

    UNIT_PRICE

    Price of the product (per item)

    CODE

    Code in Hive Query Language are available upon request

    COLLABORATORS