Overview
Multiple Linear Regression is a powerful statistical technique used for predicting the value of a dependent variable based on multiple independent variables. It allows you to take "information" from features found in a dataset and then form weights, indicating how much respective features matter in calculating a prediction.
In this project, I analyzed a dataset (publicly available on Kaggle) containing details (model
, year
, price
, transmission
, mileage
, fuelType
, tax
, mpg
, engineSize
) about used Volkswagen cars and successfully created a model that provides used car price predictions. I order to ensure a fair estimate of the model's accuracy, I followed standard Machine Learning procedures which included splitting the original dataset into three parts, the last being a test dataset. This test dataset was not used by any of the models except the final model that was chosen.
Feature Engineering
Feature engineering is a core part of this project. I took raw features as found in the data, and engineered new features based on them, which helped to fit a curve to closely match the actual prices found in the dataset. Some techniques used include:
- Creating polynomial features from numerical data.
- Using one-hot encoding to create a new set of features based on categorical data (
transmission
,fuelType
, etc.).
Accuracy
Concepts such as bias and variance are discussed, and I explored how to minimize both of these so that predictions based on both the training and cross-validation sets are closely aligned and accurate. To help determine how well the model is performing in terms of accuracy, I quantified the error of the models using MSE (Mean Squared Error).
Towards the end of the notebook, many additional features were added to a new model, causing high variance to occur, or in other words the model began to over-fit the data. View the notebook to see how, through the use of regularization, I was able to address this issue and come up with a final model that does a great job at predicting used car prices.
The resulting model has an MSE of 2.87 which is down from the original (albeit simple) model that had an MSE of 20.4. These MSE values represent the average squared difference between the predicted and actual car prices, measured in £1,000. An initial simple model resulted in an MSE of 20.4, meaning the predictions were, on average, £4.5K off (square root of 20.4) the actual car prices. After refining the model and applying regularization techniques to minimize overfitting, I was able to reduce the MSE to 2.87 (or £1.7K off the actual car prices), indicating a significant improvement in the prediction accuracy of the model.
Conclusion
In conclusion, this project provides a great introduction to building a Multiple Linear Regression model and covers common techniques to address issues that might come up along the way. I hope you enjoy reviewing it as much as I enjoyed building it!