Application of machine learning methods for estimating apartment prices in the Czech Republic
Název práce v češtině: | Aplikace metod strojového učení pro odhad cen bytů v České republice |
---|---|
Název v anglickém jazyce: | Application of machine learning methods for estimating apartment prices in the Czech Republic |
Akademický rok vypsání: | 2017/2018 |
Typ práce: | diplomová práce |
Jazyk práce: | angličtina |
Ústav: | Institut ekonomických studií (23-IES) |
Vedoucí / školitel: | prof. PhDr. Ladislav Krištoufek, Ph.D. |
Řešitel: | skrytý - zadáno vedoucím/školitelem |
Datum přihlášení: | 13.06.2018 |
Datum zadání: | 13.06.2018 |
Datum a čas obhajoby: | 16.09.2019 09:00 |
Místo konání obhajoby: | Opletalova - Opletalova 26, O206, Opletalova - místn. č. 206 |
Datum odevzdání elektronické podoby: | 30.07.2019 |
Datum proběhlé obhajoby: | 16.09.2019 |
Oponenti: | doc. PhDr. Jozef Baruník, Ph.D. |
Kontrola URKUND: |
Seznam odborné literatury |
1. Abdallah, S.& D. A. Khasha(2016): \Using Text Mining To Analyze Real Estate Classifieds." International Conference on Advanced Intelligent Systems and Informatics: pp. 193-202.
2. Eh, M., M. Kilibarda, A. Lisec & B. Bajat (2018): \Estimating the Performance of Random Forest versus Multiple Regression for Predicting Prices of the Apartments." ISPRS International Journal of Geo-Information 7(5): pp. 168 3. Goldberg, Y. (2017): \Neural Network Methods for Natural Language Processing." Synthesis Lectures on Human Language Technologies 10(1): pp. 1-309. 4. Manjula, R., S. Jain, S. Srivastava, Kher (1996): \Real estate value prediction using multivariate regression models." IOP Conference Series: Materials Science and Engineering 263(4): pp. 141{53. 5. Nejad, M. Z., J. Lu, V. Behbood, Kher (2017): \Applying dynamic Bayesian tree in property sales price estimation." International Conference on Intelligent Systems and Knowledge Engineering (ISKE) 12. 6. Park, B. & J. K. Bae (2015): \Using machine learning algorithms for housing price prediction: The case of Fairfax County, Virginia housing data." Expert Systems with Applications 42(6): pp. 2928{2934. 7. Stevens, D. (2014): \Predicting Real Estate Price Using Text Mining, Automated Real Estate Description Analysis." Tilburg University School of Humanities. 8. Witten, I. H. (2017): \Data mining: practical machine learning tools and techniques." Elsevier. |
Předběžná náplň práce |
Proposed Topic: The thesis will focus on various models for the estimation of property prices in the Czech Republic. We will cover apartments only, as the available data are detailed and the size is adequate for our needs. The analysis will be performed on the cross-sections obtained by main real estate sites in the Czech market. The expected sample size should be more than 10 000 observations, which should provide sufficient robustness of our later conclusions. Moreover, large-scale datasets are frequently critical conditions for the application of machine learning methods.
Apartment offer prices are publicly known; thus, it is not expected any dramatic deviations to appear. Models should be able to precisely forecast the price for every new offer based on historical evidence. Finally, models should enable to detect mispriced properties. The output of the empirical research will be compared to the results of similar works. The aim of this work will be not only to determine the most relevant parameters for the estimation of properties market price but to find the approach which provides the most accurate prediction as well. The final discussion will cover the potential replacement of the comparative method used in everyday practice. Hypotheses: 1) Machine learning methods provide more accurate estimations of apartments' prices. 2) Advertising text descriptions have a significant impact on the properties' offering prices. 3) The set of independent variables for apartments' price determination is different in Prague and the rest of the Czech Republic. Methodology: In the thesis, we will use two methods for the price estimation. Firstly, the linear regression will be employed. Despite the fact that these approaches are not used for expert evidence in the common practice, there have been numerous papers regarding linear models published. The following part will be dedicated to machine learning techniques, as their popularity has grown in recent years. We will apply the least absolute shrinkage and selection operator (LASSO), decision tree, random forests and nearest neighborhood methods. The extensive empirical analysis will be the main component of the paper, and a combination of different approaches should shed light on the Czech apartments market. The expectation is that a substantial part of the apartment price is determined by factors that cannot be easily quantified. On that account, w will use a refinement of models achieved by text mining. To support or reject our hypothesis, we will create limited models with a focus on Prague. We believe that the characteristics of the Czech capital's market will be different than in the rest of the country. Expected Contribution: In-depth analysis of the description context is a relatively new approach and not frequently used. Hence NLP should provide significant added value to the economic research. Moreover, the text will be predominantly in the Czech language. Due to complicated grammar, the analysis will be more complex than in English written papers. Furthermore, this approach will be used for the first time in housing prices analysis in the Czech market. From a practical perspective, the detection of mispricing by models opens the opportunity for investors to find the best pick. As even a small deviation from the correct price can turn into an outstanding deal. Text analysis should unhide gems by automated data processing on a periodical basis. Additionally, models should take into consideration the impact of large cities on the price, which is not always used in similar works. Outline: 1) Introduction 2) Theoretical Background 3) Literature Review 4) Data 5) Methodology a) Application of Text Mining b) Conventional Econometric Estimation Methods c) Machine Learning Methods 6) Results and Model Comparison 7) Conclusion |
Předběžná náplň práce v anglickém jazyce |
Proposed Topic: The thesis will focus on various models for the estimation of property prices in the Czech Republic. We will cover apartments only, as the available data are detailed and the size is adequate for our needs. The analysis will be performed on the cross-sections obtained by main real estate sites in the Czech market. The expected sample size should be more than 10 000 observations, which should provide sufficient robustness of our later conclusions. Moreover, large-scale datasets are frequently critical conditions for the application of machine learning methods.
Apartment offer prices are publicly known; thus, it is not expected any dramatic deviations to appear. Models should be able to precisely forecast the price for every new offer based on historical evidence. Finally, models should enable to detect mispriced properties. The output of the empirical research will be compared to the results of similar works. The aim of this work will be not only to determine the most relevant parameters for the estimation of properties market price but to find the approach which provides the most accurate prediction as well. The final discussion will cover the potential replacement of the comparative method used in everyday practice. Hypotheses: 1) Machine learning methods provide more accurate estimations of apartments' prices. 2) Advertising text descriptions have a significant impact on the properties' offering prices. 3) The set of independent variables for apartments' price determination is different in Prague and the rest of the Czech Republic. Methodology: In the thesis, we will use two methods for the price estimation. Firstly, the linear regression will be employed. Despite the fact that these approaches are not used for expert evidence in the common practice, there have been numerous papers regarding linear models published. The following part will be dedicated to machine learning techniques, as their popularity has grown in recent years. We will apply the least absolute shrinkage and selection operator (LASSO), decision tree, random forests and nearest neighborhood methods. The extensive empirical analysis will be the main component of the paper, and a combination of different approaches should shed light on the Czech apartments market. The expectation is that a substantial part of the apartment price is determined by factors that cannot be easily quantified. On that account, w will use a refinement of models achieved by text mining. To support or reject our hypothesis, we will create limited models with a focus on Prague. We believe that the characteristics of the Czech capital's market will be different than in the rest of the country. Expected Contribution: In-depth analysis of the description context is a relatively new approach and not frequently used. Hence NLP should provide significant added value to the economic research. Moreover, the text will be predominantly in the Czech language. Due to complicated grammar, the analysis will be more complex than in English written papers. Furthermore, this approach will be used for the first time in housing prices analysis in the Czech market. From a practical perspective, the detection of mispricing by models opens the opportunity for investors to find the best pick. As even a small deviation from the correct price can turn into an outstanding deal. Text analysis should unhide gems by automated data processing on a periodical basis. Additionally, models should take into consideration the impact of large cities on the price, which is not always used in similar works. Outline: 1) Introduction 2) Theoretical Background 3) Literature Review 4) Data 5) Methodology a) Application of Text Mining b) Conventional Econometric Estimation Methods c) Machine Learning Methods 6) Results and Model Comparison 7) Conclusion |