Project 3

Project Description

Using two machine learning methods predict population values at 100 x 100 meter resolution throughout your selected country.
Validate the two models using different methods presented in this class.
Write a report assessing the two approaches and which of the two models was more accurate. Be sure to account for spatial variation throughout your selected location and provide substantive explanations for why those variations occurred.

Data

I chose to model the population of Jordan. Jordan is a small country (both area and number of residents), so there were not very administrative boundaries, which meant that the data was easier for the model to process for the whole country.

Actual population of Jordan

ActPop

Validation Methods:

Simple difference between actual and predicted population values
Mean Absolute Error
Root Mean Squared Error

Linear Regression

Population predicted by Linear Regression Model

LRpop

Difference between LR predicted population and actual population (subtracted actual population from predicted population)

LRpopdiff LRstats diffLR

The figure showing population difference across the country along with the statistics showing actual vs. predicted population numbers indicates that the linear regression model underpredicts the population. According to the simple validation shown above, the most underpredicted areas were the urban areas and the most overpredicted areas were the rural areas. Because the first validation method was very simple, I tried MAE and RMSE.

LR Mean Absolue Error

LRMAE

LR Root Mean Squared Error

LRrmse

Both the MAE and RMSE further emphasize the underprediction in the urban areas. Maybe the Random Forest model will better predict population.

Random Forest

Population predicted by Random Forest Model

RFpop

Most important variable for predicting population

LRvarimp

Difference between RF predicted population and actual population (subtracted actual population from predicted population)

RFpopdiff RFStats diffRF

The figure showing population difference across the country along with the statistics showing actual vs. predicted population numbers indicates that the random forest model underpredicts the population. According to the simple validation shown above, the most underpredicted areas were the urban areas and the most overpredicted areas were the rural areas. Because the first validation method was very simple, I tried MAE and RMSE.

The linear regression model just barely predicted population better than the random forest model according to the simple difference validation. Linear regression had a 8959955 difference and random forest had a 9000918 difference.

RF Mean Absolute Error

RFMAE

RF Root Mean Squared Error

RFrmse

Both the MAE and RMSE further emphasize the underprediction in the urban areas. The Random Forest model did not perform better. If we zoom in on the most populated area, we can see the error more clearly.

Spatial Variation

diffAmman

The spatial variation in the prediction is most likely because the most important variable to the models was night time lights. There will be more concentrated light in cities, but that doesn’t account for building height. If there is a sky scraper, there may be more people living in that one square km than the model predicts because there isn’t as much light being emitted per person as in rural areas.