Analysis of stock market sentiment with social media
Název práce v češtině: | Analýza sentimentu akciového trhu pomocí sociálních médií |
---|---|
Název v anglickém jazyce: | Analysis of stock market sentiment with social media |
Klíčová slova: | sentiment Twitteru, vnoření slov, volatilita, reprezentace textu |
Klíčová slova anglicky: | Twitter sentiment, Word embeddings, volatility |
Akademický rok vypsání: | 2015/2016 |
Typ práce: | diplomová práce |
Jazyk práce: | angličtina |
Ústav: | Institut ekonomických studií (23-IES) |
Vedoucí / školitel: | doc. PhDr. Jozef Baruník, Ph.D. |
Řešitel: | skrytý - zadáno vedoucím/školitelem |
Datum přihlášení: | 22.07.2016 |
Datum zadání: | 22.07.2016 |
Datum a čas obhajoby: | 20.06.2018 08:30 |
Místo konání obhajoby: | Opletalova - Opletalova 26, O206, Opletalova - místn. č. 206 |
Datum odevzdání elektronické podoby: | 11.05.2018 |
Datum proběhlé obhajoby: | 20.06.2018 |
Oponenti: | PhDr. Pavel Vacek, Ph.D. |
Kontrola URKUND: |
Seznam odborné literatury |
Alanyali, M., H. S. Moat, & T. Preis (2013): “Quantifying the relationship
between financial news and the stock market.” Scientific reports 3: p. 3578. Asur, S. & B. A. Huberman (2010): “Predicting the future with social media.” In “Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on,” volume 1, pp. 492– 499. IEEE. Bird, S. & E. Loper (2004): “Nltk: the natural language toolkit.” In “Proceedings of the ACL 2004 on Interactive poster and demonstration sessions,” p. 31. Association for Computational Linguistics. Bollen, J., H. Mao, & A. Pepe (2011a): “Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena.” ICWSM 11: pp. 450–453. Bollen, J., H. Mao, & X. Zeng (2011b): “Twitter mood predicts the stock market.” Journal of computational science 2(1): pp. 1–8. Carter, R. (2011): English grammar today: An AZ of spoken and written grammar. Ernst Klett Sprachen. De Boom, C., S. Van Canneyt, T. Demeester, & B. Dhoedt (2016): “Representation learning for very short texts using weighted word embedding aggregation.” Pattern Recognition Letters 80: pp. 150–156. Dimpfl, T. & S. Jank (2016): “Can internet search queries help to predict stock market volatility?” European Financial Management 22(2): pp. 171– 192. Fama, E. F. (1970): “Efficient capital markets: A review of theory and empirical work.” The journal of Finance 25(2): pp. 383–417. Bibliography 62 Fellbaum, C. (1998): WordNet. Wiley Online Library. Fremunt, M. (2015): “Predictability of security returns using twitter sentiment.” Gentzkow, M., B. T. Kelly, & M. Taddy (2017): “Text as data.” Technical report, National Bureau of Economic Research. Gilbert, C. H. E. (2014): “Vader: A parsimonious rule-based model for sentiment analysis of social media text.” In “Eighth International Conference on Weblogs and Social Media (ICWSM-14). Available at (20/04/16) http://comp. social. gatech. edu/papers/icwsm14. vader. hutto. pdf,” . Go, A., R. Bhayani, & L. Huang (2009): “Twitter sentiment classification using distant supervision.” CS224N Project Report, Stanford 1(2009): p. 12. Kolchyna, O., T. T. Souza, P. Treleaven, & T. Aste (2015): “Twitter sentiment analysis: Lexicon method, machine learning method and their combination.” arXiv preprint arXiv:1507.00955 . Mao, H., S. Counts, & J. Bollen (2011): “Predicting financial markets: Comparing survey, news, twitter and search engine data.” arXiv preprint arXiv:1112.1051 . Marcin Zablocki (2017): “Sentiment analysis of Tweets with Python, NLTK, Word2Vec and SciKit- Learn.” https://zablo.net/blog/post/ twitter-sentiment-analysis-python-scikit-word2vec-nltk-xgboost. Online; accessed 22 April 2018. Mikolov, T., K. Chen, G. Corrado, & J. Dean (2013): “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781 . Pak, A. & P. Paroubek (2010): “Twitter as a corpus for sentiment analysis and opinion mining.” In “LREc,” volume 10. Pang, B., L. Lee, & S. Vaithyanathan (2002): “Thumbs up?: sentiment classification using machine learning techniques.” In “Proceedings of the ACL-02 conference on Empirical methods in natural language processingVolume 10,” pp. 79–86. Association for Computational Linguistics. Bibliography 63 Pang, B., L. Lee et al. (2008): “Opinion mining and sentiment analysis.” Foundations and Trends® in Information Retrieval 2(1–2): pp. 1–135. Pennington, J., R. Socher, & C. D. Manning (2014): “Glove: Global vectors for word representation.” In “Empirical Methods in Natural Language Processing (EMNLP),” pp. 1532–1543. Ranco, G., D. Aleksovski, G. Caldarelli, M. Grcar ˇ , & I. Mozeticˇ (2015): “The effects of twitter sentiment on stock price returns.” PloS one 10(9): p. e0138441. Rao, T. & S. Srivastava (2012): “Analyzing stock market movements using twitter sentiment analysis.” In “Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012),” pp. 119–123. IEEE Computer Society. Salton, G., A. Wong, & C.-S. Yang (1975): “A vector space model for automatic indexing.” Communications of the ACM 18(11): pp. 613–620. Schutze ¨ , H., C. D. Manning, & P. Raghavan (2008): Introduction to information retrieval, volume 39. Cambridge University Press. Si, J., A. Mukherjee, B. Liu, Q. Li, H. Li, & X. Deng (2013): “Exploiting topic based twitter sentiment for stock prediction.” ACL (2) 2013: pp. 24–29. Souza, T. T. P., O. Kolchyna, P. C. Treleaven, & T. Aste (2015): “Twitter sentiment analysis applied to finance: A case study in the retail industry.” arXiv preprint arXiv:1507.00784 . Zhang, W., S. Skiena et al. (2010): “Trading strategies to exploit blog and news sentiment.” In “Icwsm,” . Zheludev, I., R. Smith, & T. Aste (2014): “When can social media lead financial markets?” Scientific reports 4: p. 4213. |
Předběžná náplň práce |
Motivation
Thousands and thousands tweets are posted every minute. Considerable part of this massive stream of information includes personal opinions or expression of posters attitude about some topic. There have been several papers studying methods how to extract the sentiment inside the twitter micro messages. For example (Kolchyna, 2015) used for the sentiment classification machine learning as well as lexical methods. In the paper, they further showed that machine learning approach of sentiment classification outperformed the lexical approach and these two combined yielded even more precise classification. According the efficient market hypothesis proposed by (Fama, 1970), there should be no way how to exploit the knowledge of the sentiment and generate abnormal return on the stock market. However, recently there has been much research done about the effect of twitter based sentiments on stock markets. For example (Fermut 2014) analyzed aggregated twitter data using lexical method and found out that there is significant relation at long investment horizon and this effect diminishes with shortening the horizon. Also in (Ranco 2015), authors showed significant dependence in case of events characterized by twitter peeks for several days after such event, but resulting only in small cumulative abnormal returns. In my thesis we will use twitter sentimental analysis to model returns and volatility of selected stock market indices. To do so, we will analyze huge dataset of all tweets that does include specific keywords related to stock indices of thirteen technological companies such Google and Microsoft over the past 20 months. The dataset was collected using official Twitter streaming API. Hypotheses Hypothesis #1: We are able to quantify and measure the sentiment or mood in society using Twitter and other text based social media, such as blogs Master’s Thesis Proposal xii Hypothesis #2: Using suitable twitter hashtags and word combinations as sentiment indicator, we are able to set index with better predictive power than lexical index based on agregated words. Hypothesis #3: Sentiment is significant variable for explaining future returns and especially future volatility. Hypothesis #4: In case of significant exogeneous events (earthquakes, terrorist attack, etc.) the market reaction is faster than corresponding reaction of social media, leaving no room for arbitrage. Methodology With such enormous amount of twitter messages, we will need to use methodology with regards to extra large dataset in order to process all relevant information regarding the sentiment inside the tweets in reasonable time. This will be done using suitable algorithms and data structures. Correlation of sentiment indices and real financial data will be measured using both standard discrete methodology for time series such as Granger causality methods and vector autoregressive regression. In addition of these methods, fit of matching the two types of data will also be measured using other approaches such as wavelets. In order to fully exploit the benefits of high frequency dataset, for modeling volatility we will use techniques based on realized volatility, such as heterogeneous autoregressive models. We will compare performance of such models with standard GARCH family models. For detecting outlying events in stock market data, bipower variation method will be used. This method will allow us to mark outliers in market data and match them to events in twitter using methods described in paragraph above. Expected Contribution Main goal of the thesis is to show relation between twitter based sentiments and financial markets on different time horizons, with emphasis on analysis of volatility. We will use unique data of tweets specifically tailored for sentimental analysis of technological stock indices. The combination of the twitter data and tick by tick stock market data let us use all benefits of high frequency methodology and continuous approximation of the time series. With such dataset, using machine learning approach allows us to create sentiment indices tweets to precisely fit the market data. In this way, we are able to test our hypothesis with much greater precision than in case of past research. The thesis will examine the possibilities of creating trading algorithms based on twitter sentiments. On the other hand, we want also explore new risk management approaches based on social media. For Risk management and hedging strategies, Master’s Thesis Proposal xiii measuring magnitude of social media reaction to market events or vice versa, can be extremely appealing. Outline 1. Introduction 2. Literature review 3. Methodology 4. Data description 5. Results of analysis 6. Discussion of the results 7. Conclusion |
Předběžná náplň práce v anglickém jazyce |
Motivation
Thousands and thousands tweets are posted every minute. Considerable part of this massive stream of information includes personal opinions or expression of posters attitude about some topic. There have been several papers studying methods how to extract the sentiment inside the twitter micro messages. For example (Kolchyna, 2015) used for the sentiment classification machine learning as well as lexical methods. In the paper, they further showed that machine learning approach of sentiment classification outperformed the lexical approach and these two combined yielded even more precise classification. According the efficient market hypothesis proposed by (Fama, 1970), there should be no way how to exploit the knowledge of the sentiment and generate abnormal return on the stock market. However, recently there has been much research done about the effect of twitter based sentiments on stock markets. For example (Fermut 2014) analyzed aggregated twitter data using lexical method and found out that there is significant relation at long investment horizon and this effect diminishes with shortening the horizon. Also in (Ranco 2015), authors showed significant dependence in case of events characterized by twitter peeks for several days after such event, but resulting only in small cumulative abnormal returns. In my thesis we will use twitter sentimental analysis to model returns and volatility of selected stock market indices. To do so, we will analyze huge dataset of all tweets that does include specific keywords related to stock indices of thirteen technological companies such Google and Microsoft over the past 20 months. The dataset was collected using official Twitter streaming API. Hypotheses Hypothesis #1: We are able to quantify and measure the sentiment or mood in society using Twitter and other text based social media, such as blogs Master’s Thesis Proposal xii Hypothesis #2: Using suitable twitter hashtags and word combinations as sentiment indicator, we are able to set index with better predictive power than lexical index based on agregated words. Hypothesis #3: Sentiment is significant variable for explaining future returns and especially future volatility. Hypothesis #4: In case of significant exogeneous events (earthquakes, terrorist attack, etc.) the market reaction is faster than corresponding reaction of social media, leaving no room for arbitrage. Methodology With such enormous amount of twitter messages, we will need to use methodology with regards to extra large dataset in order to process all relevant information regarding the sentiment inside the tweets in reasonable time. This will be done using suitable algorithms and data structures. Correlation of sentiment indices and real financial data will be measured using both standard discrete methodology for time series such as Granger causality methods and vector autoregressive regression. In addition of these methods, fit of matching the two types of data will also be measured using other approaches such as wavelets. In order to fully exploit the benefits of high frequency dataset, for modeling volatility we will use techniques based on realized volatility, such as heterogeneous autoregressive models. We will compare performance of such models with standard GARCH family models. For detecting outlying events in stock market data, bipower variation method will be used. This method will allow us to mark outliers in market data and match them to events in twitter using methods described in paragraph above. Expected Contribution Main goal of the thesis is to show relation between twitter based sentiments and financial markets on different time horizons, with emphasis on analysis of volatility. We will use unique data of tweets specifically tailored for sentimental analysis of technological stock indices. The combination of the twitter data and tick by tick stock market data let us use all benefits of high frequency methodology and continuous approximation of the time series. With such dataset, using machine learning approach allows us to create sentiment indices tweets to precisely fit the market data. In this way, we are able to test our hypothesis with much greater precision than in case of past research. The thesis will examine the possibilities of creating trading algorithms based on twitter sentiments. On the other hand, we want also explore new risk management approaches based on social media. For Risk management and hedging strategies, Master’s Thesis Proposal xiii measuring magnitude of social media reaction to market events or vice versa, can be extremely appealing. Outline 1. Introduction 2. Literature review 3. Methodology 4. Data description 5. Results of analysis 6. Discussion of the results 7. Conclusion |