Data310_workbook

View the Project on GitHub SGLott/Data310_workbook

Project 1

First Steps

I used the zillow_scrape.py file provided in class to scrape 400 listings from Zillow for houses for sale in Detroit, MI. I chose Detroit because I am moving to Michigan next year and wanted to learn about the housing market there. The output file exported the data as a .csv file which I then imported into a different python file for data processing.

Data Description

Detroit Addresses

Figure 1. Mapped Addresses using converted latitude and longitude

I created a webmap for all of the addresses, but I couldn’t figure out how to host a .html file on github, so here’s a screenshot of the web map. The homes are grouped and when you zoom in on the map, it shows you the homes for sale.

Data Summary
Variable Max Min Mean
Price ($) 2,200,000 1,000 146,000
Number of Bedrooms 7 2 2.5
Number of Baths 1 7 2.5
Square Footage (ft^2) 7075 770 2041

Model Architecture

The model I used is the same one that we used for identifying the Matthews, VA house prices. Because I decided to use location as an additional variable, I have 4 independent variables instead of 3. The dependent variable for the model is house price. The model is Sequential with 1 dense layer, 4 input variables (sqft, number of bedrooms, number of bathrooms, latitude), and 1 dependent variable. The variables are normalized to help the efficiency of the model, sqft by 1000, price by 100000, and latitude by 10. I only used latitude and not both latitude and longigude because I thought it might be redundant for the model. The model ran for 500 epochs. With every epoch, the model evaluates the loss and then optimizes for the next run.

model = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[4])])
model.compile(optimizer='sgd', loss='mean_squared_error')

x1 = np.array(df1.iloc[:, 2], dtype=float) # number of bedrooms
x2 = np.array(df1.iloc[:, 3], dtype=float) # number of bathrooms
x3 = np.array(df1.iloc[:, 6], dtype=float) # square footage
x4 = np.array(df1.iloc[:, 9], dtype=float) # latitude
xs = np.stack([x1, x2, x3, x4], axis=1)
ys = np.array(df1.iloc[:, 5], dtype=float)

model.fit(xs, ys, epochs=500)

p = model.predict(xs)

Model Output

The model was not very accurate. The MSE is 52257562878.595474. Even though the model wasn’t very accurate, it under and over predicted about the same number of homes. It under predicted 146 and over predicted 162. The houses that the model underpredicted the most were the most expensive houses and the houses that were overpredicted tended to be the least expensive.

When I ran the model without considering location, the MSE was 48338586251.65768, which is actually better. Perhaps if I had included both longitude and latitude, it would have increased the accuracy of my model instead of making it less accurate.

ActualvsPredicted

Figure 2. Listed House Prices compared to Model Predicted House Prices (USD) when model includes latitude. Notice the difference in axis scales.

Points plotted above the line are over predicted, points plotted below the line are underpredicted.

ActualvsPredicted

Figure 3. Listed House Prices compared to Model Predicted House Prices (USD) when model has 3 variables (does not consider latitude). Notice the difference in axis scales.

Points plotted above the line are over predicted, points plotted below the line are underpredicted.

Difference

Figure 4. Listed House Prices compared to Difference in price vs. predicted. Notice the difference in axis scales.

Houses plotted below zero are underpredicted and thus are a bad value for the home owners. Houses plotted above zero are over predicted and would be good value for home owners. The graph is pretty hard to read because some of the most expensive houses were so under predicted.

Best Values for current Homeowners: Homes that were over predicted

GoodValue

Worst Values for current Homeowners: Homes that were under predicted

BadValue