
Predicting blog comments with Machine Learning methods
PySpark - Random Forest - Gradient Boosting
PROJECT DESCRIPTION
In this project we tried to predict the number of comments that a blog post receives based on features of the post. We got the data from the UCI Machine Learning Archive. The data were originally used in a paper by Krisztian Buza (2014): Feedback Prediction for Blogs. In Data Analysis, Machine Learning and Knowledge Discovery (pp. 145-152).
We tried different Machine Learning methods, to see which one produced the best results, in terms of RMSE:
- The results from the Linear Regression were not very good (RMSE=30.304).
- The decision tree regressor produced better results than the linear regression, with RMSE = 23.9206
- The Random Forest Algorithm produced the best results, with 23.2699.
- The Gradient Boosting algorithm produced equally good results, with RMSE = 23.7586.
THEMES
Apache Spark, Python, PySpark, Machine Learning, Linear Regression, Decision Tree, Random Forest, Gradient Boosting
CODE
Code in Python is available upon request.
Sotiris Baratsas © 2022. All rights reserved.