• Predicting blog comments with Machine Learning methods

    PySpark - Random Forest - Gradient Boosting

  • PROJECT DESCRIPTION

    In this project we tried to predict the number of comments that a blog post receives based on features of the post. We got the data from the UCI Machine Learning Archive. The data were originally used in a paper by Krisztian Buza (2014): Feedback Prediction for Blogs. In Data Analysis, Machine Learning and Knowledge Discovery (pp. 145-152).

     

    We tried different Machine Learning methods, to see which one produced the best results, in terms of RMSE:

    • The results from the Linear Regression were not very good (RMSE=30.304).
    • The decision tree regressor produced better results than the linear regression, with RMSE = 23.9206
    • The Random Forest Algorithm produced the best results, with 23.2699.
    • The Gradient Boosting algorithm produced equally good results, with RMSE = 23.7586.

    THEMES

    Apache Spark, Python, PySpark, Machine Learning, Linear Regression, Decision Tree, Random Forest, Gradient Boosting

    CODE

    Code in Python is available upon request.