- Before we start report, we must know something above all:
1.This is the 1st project from Udacity Data Scientist Term2.
2.This project have 2 dataset, including airbnb Seattle and Boston host and reviews data.
3.At this report, will discuss three question about the dataset.
- You can get all the code about this project from https://github.com/kylechenoO/airbnb
Explore and Preprocess Data
- The dataset including tow main area: Seattle and Boston. They are very similar. But only difference between some detail. We won't discuss them here, but i write it as a comment with code. If needed, you can find it in my github before this section.
- The preprocess of data, i check the features which including NaN values one by one. And find which one can drop, and which not. And do some onehot processing with some features. Such as the feature price and available in df_seattle_clendar_raw (you can find detail in my code), the NaN values of price could not drop, because they're not available at all, so is fine, just keep it in.
Get the main research
- After preprocess data, I try to find the pattern of the data. So I cut the dataframe. I just remain the features id, city, price, security_deposit, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value from df_seattle_listings and df_boston_listings. It looks like this:
- The price and security_deposit need to transfer to int type. So, I write a function to delete the '$', '.', ',' from dataframe. And after all, the info looks like this:
Question 1: Is there any pattern between price and review_scores_rating ?
- Ok, let's start find some pattern between price and review_scores_rating.
- At first, we need plot scatter of the Price and Review Scores Rating. It look like this:
- After plot the data scatter between Price and Review Scores Rating, it shows us something. It just like the house which price is higher and it's review scores rating's distribute is almost get higher at the same time. And you will find the price lower than $400's house, the probable of the score between 90 and 70 is very high. So, don't try to save any money, more expensive is more worth it.
Question 2: Is there any pattern between any other scores' with price ?
- In this question, I write a loop for each plot. Because of there're a lot of features need to plot one by one.
- In this report, just show these tow plot (because of there's no relation with any other features):
- After plot all the features between the price and scores. It seemed like there's only price/security_deposit plot have any affect with others, except the price/review_scores_rating plot we analyze before. From the plot of price/security_deposit, it seemed like the security_deposit is not related with price. Maybe the expensive price house was not care about any the security_deposit, and they may faced to any high level consumers.
Question 3: Is there any pattern about the location and the price ?
- At this section, I concat tow dataframe to one. And do onehot with city feature.
- Before apply onehot, is needed to check the data of city, and transform the location into same format. And plot it:
- It seemed like there's many expensive house in class 3 and 20. They are Seattle, and Boston's city center. They're always convenice to shopping and traffic. So it worth this price.
- You can see I didn't use any ML models for classify the pattern or anything else. Because of I find it was not necessary to use it. The data can talk many thing to use. I do some data analyze about airbnb's Seattle and Boston's dataset. And try to find some pattern with price and location. But it was not really easy to explore it. At last, I find the house in Seattle and Boston's city center is the best way to build a house and rent it on airbnb. It will be good for rent and can rent for a better price.
- After all above, can find the price of city center is more expensive than others. And the higher price house was worthy to rent, and the higher price house was not really about the security_deposit, because they may faced to any high level consumers. So if you want to buy a house in Seattle or Boston and rent it on Airbnb, I will suggest you select which are in the center of the city.