Témata prací (Výběr práce)Témata prací (Výběr práce)(verze: 368)
Detail práce
   Přihlásit přes CAS
Analysis of stock market sentiment with social media
Název práce v češtině: Analýza sentimentu akciového trhu pomocí sociálních médií
Název v anglickém jazyce: Analysis of stock market sentiment with social media
Klíčová slova: sentiment Twitteru, vnoření slov, volatilita, reprezentace textu
Klíčová slova anglicky: Twitter sentiment, Word embeddings, volatility
Akademický rok vypsání: 2015/2016
Typ práce: diplomová práce
Jazyk práce: angličtina
Ústav: Institut ekonomických studií (23-IES)
Vedoucí / školitel: doc. PhDr. Jozef Baruník, Ph.D.
Řešitel: skrytý - zadáno vedoucím/školitelem
Datum přihlášení: 22.07.2016
Datum zadání: 22.07.2016
Datum a čas obhajoby: 20.06.2018 08:30
Místo konání obhajoby: Opletalova - Opletalova 26, O206, Opletalova - místn. č. 206
Datum odevzdání elektronické podoby:11.05.2018
Datum proběhlé obhajoby: 20.06.2018
Oponenti: PhDr. Pavel Vacek, Ph.D.
 
 
 
Kontrola URKUND:
Seznam odborné literatury
Alanyali, M., H. S. Moat, & T. Preis (2013): “Quantifying the relationship
between financial news and the stock market.” Scientific reports 3: p. 3578.
Asur, S. & B. A. Huberman (2010): “Predicting the future with social media.”
In “Web Intelligence and Intelligent Agent Technology (WI-IAT),
2010 IEEE/WIC/ACM International Conference on,” volume 1, pp. 492–
499. IEEE.
Bird, S. & E. Loper (2004): “Nltk: the natural language toolkit.” In “Proceedings
of the ACL 2004 on Interactive poster and demonstration sessions,”
p. 31. Association for Computational Linguistics.
Bollen, J., H. Mao, & A. Pepe (2011a): “Modeling public mood and emotion:
Twitter sentiment and socio-economic phenomena.” ICWSM 11: pp.
450–453.
Bollen, J., H. Mao, & X. Zeng (2011b): “Twitter mood predicts the stock
market.” Journal of computational science 2(1): pp. 1–8.
Carter, R. (2011): English grammar today: An AZ of spoken and written
grammar. Ernst Klett Sprachen.
De Boom, C., S. Van Canneyt, T. Demeester, & B. Dhoedt (2016):
“Representation learning for very short texts using weighted word embedding
aggregation.” Pattern Recognition Letters 80: pp. 150–156.
Dimpfl, T. & S. Jank (2016): “Can internet search queries help to predict
stock market volatility?” European Financial Management 22(2): pp. 171–
192.
Fama, E. F. (1970): “Efficient capital markets: A review of theory and empirical
work.” The journal of Finance 25(2): pp. 383–417.
Bibliography 62
Fellbaum, C. (1998): WordNet. Wiley Online Library.
Fremunt, M. (2015): “Predictability of security returns using twitter sentiment.”
Gentzkow, M., B. T. Kelly, & M. Taddy (2017): “Text as data.” Technical
report, National Bureau of Economic Research.
Gilbert, C. H. E. (2014): “Vader: A parsimonious rule-based model for
sentiment analysis of social media text.” In “Eighth International Conference
on Weblogs and Social Media (ICWSM-14). Available at (20/04/16)
http://comp. social. gatech. edu/papers/icwsm14. vader. hutto. pdf,” .
Go, A., R. Bhayani, & L. Huang (2009): “Twitter sentiment classification
using distant supervision.” CS224N Project Report, Stanford 1(2009): p. 12.
Kolchyna, O., T. T. Souza, P. Treleaven, & T. Aste (2015): “Twitter
sentiment analysis: Lexicon method, machine learning method and their
combination.” arXiv preprint arXiv:1507.00955 .
Mao, H., S. Counts, & J. Bollen (2011): “Predicting financial markets:
Comparing survey, news, twitter and search engine data.” arXiv preprint
arXiv:1112.1051 .
Marcin Zablocki (2017): “Sentiment analysis of Tweets with Python,
NLTK, Word2Vec and SciKit- Learn.” https://zablo.net/blog/post/
twitter-sentiment-analysis-python-scikit-word2vec-nltk-xgboost.
Online; accessed 22 April 2018.
Mikolov, T., K. Chen, G. Corrado, & J. Dean (2013): “Efficient estimation
of word representations in vector space.” arXiv preprint arXiv:1301.3781
.
Pak, A. & P. Paroubek (2010): “Twitter as a corpus for sentiment analysis
and opinion mining.” In “LREc,” volume 10.
Pang, B., L. Lee, & S. Vaithyanathan (2002): “Thumbs up?: sentiment
classification using machine learning techniques.” In “Proceedings of the
ACL-02 conference on Empirical methods in natural language processingVolume
10,” pp. 79–86. Association for Computational Linguistics.
Bibliography 63
Pang, B., L. Lee et al. (2008): “Opinion mining and sentiment analysis.”
Foundations and Trends® in Information Retrieval 2(1–2): pp. 1–135.
Pennington, J., R. Socher, & C. D. Manning (2014): “Glove: Global vectors
for word representation.” In “Empirical Methods in Natural Language
Processing (EMNLP),” pp. 1532–1543.
Ranco, G., D. Aleksovski, G. Caldarelli, M. Grcar ˇ , & I. Mozeticˇ
(2015): “The effects of twitter sentiment on stock price returns.” PloS one
10(9): p. e0138441.
Rao, T. & S. Srivastava (2012): “Analyzing stock market movements using
twitter sentiment analysis.” In “Proceedings of the 2012 International
Conference on Advances in Social Networks Analysis and Mining (ASONAM
2012),” pp. 119–123. IEEE Computer Society.
Salton, G., A. Wong, & C.-S. Yang (1975): “A vector space model for
automatic indexing.” Communications of the ACM 18(11): pp. 613–620.
Schutze ¨ , H., C. D. Manning, & P. Raghavan (2008): Introduction to information
retrieval, volume 39. Cambridge University Press.
Si, J., A. Mukherjee, B. Liu, Q. Li, H. Li, & X. Deng (2013): “Exploiting
topic based twitter sentiment for stock prediction.” ACL (2) 2013: pp.
24–29.
Souza, T. T. P., O. Kolchyna, P. C. Treleaven, & T. Aste (2015): “Twitter
sentiment analysis applied to finance: A case study in the retail industry.”
arXiv preprint arXiv:1507.00784 .
Zhang, W., S. Skiena et al. (2010): “Trading strategies to exploit blog and
news sentiment.” In “Icwsm,” .
Zheludev, I., R. Smith, & T. Aste (2014): “When can social media lead
financial markets?” Scientific reports 4: p. 4213.
Předběžná náplň práce
Motivation

Thousands and thousands tweets are posted every minute. Considerable
part of this massive stream of information includes personal opinions or expression
of posters attitude about some topic. There have been several papers studying
methods how to extract the sentiment inside the twitter micro messages. For example
(Kolchyna, 2015) used for the sentiment classification machine learning as well as
lexical methods. In the paper, they further showed that machine learning approach
of sentiment classification outperformed the lexical approach and these two combined
yielded even more precise classification.
According the efficient market hypothesis proposed by (Fama, 1970), there should
be no way how to exploit the knowledge of the sentiment and generate abnormal
return on the stock market. However, recently there has been much research done
about the effect of twitter based sentiments on stock markets. For example (Fermut
2014) analyzed aggregated twitter data using lexical method and found out that
there is significant relation at long investment horizon and this effect diminishes with
shortening the horizon. Also in (Ranco 2015), authors showed significant dependence
in case of events characterized by twitter peeks for several days after such event, but
resulting only in small cumulative abnormal returns.
In my thesis we will use twitter sentimental analysis to model returns and volatility
of selected stock market indices. To do so, we will analyze huge dataset of all
tweets that does include specific keywords related to stock indices of thirteen technological
companies such Google and Microsoft over the past 20 months. The dataset
was collected using official Twitter streaming API.

Hypotheses

Hypothesis #1: We are able to quantify and measure the sentiment or mood
in society using Twitter and other text based social media, such as blogs
Master’s Thesis Proposal xii

Hypothesis #2: Using suitable twitter hashtags and word combinations as
sentiment indicator, we are able to set index with better predictive power than
lexical index based on agregated words.

Hypothesis #3: Sentiment is significant variable for explaining future returns
and especially future volatility.

Hypothesis #4: In case of significant exogeneous events (earthquakes, terrorist
attack, etc.) the market reaction is faster than corresponding reaction of social
media, leaving no room for arbitrage.

Methodology

With such enormous amount of twitter messages, we will need to
use methodology with regards to extra large dataset in order to process all relevant
information regarding the sentiment inside the tweets in reasonable time. This will
be done using suitable algorithms and data structures.
Correlation of sentiment indices and real financial data will be measured using
both standard discrete methodology for time series such as Granger causality
methods and vector autoregressive regression. In addition of these methods, fit of
matching the two types of data will also be measured using other approaches such
as wavelets.
In order to fully exploit the benefits of high frequency dataset, for modeling
volatility we will use techniques based on realized volatility, such as heterogeneous
autoregressive models. We will compare performance of such models with standard
GARCH family models. For detecting outlying events in stock market data, bipower
variation method will be used. This method will allow us to mark outliers in market
data and match them to events in twitter using methods described in paragraph
above.

Expected Contribution

Main goal of the thesis is to show relation between twitter
based sentiments and financial markets on different time horizons, with emphasis on
analysis of volatility. We will use unique data of tweets specifically tailored for
sentimental analysis of technological stock indices. The combination of the twitter
data and tick by tick stock market data let us use all benefits of high frequency
methodology and continuous approximation of the time series. With such dataset,
using machine learning approach allows us to create sentiment indices tweets to
precisely fit the market data. In this way, we are able to test our hypothesis with
much greater precision than in case of past research.
The thesis will examine the possibilities of creating trading algorithms based on
twitter sentiments. On the other hand, we want also explore new risk management
approaches based on social media. For Risk management and hedging strategies,
Master’s Thesis Proposal xiii
measuring magnitude of social media reaction to market events or vice versa, can be
extremely appealing.

Outline
1. Introduction
2. Literature review
3. Methodology
4. Data description
5. Results of analysis
6. Discussion of the results
7. Conclusion
Předběžná náplň práce v anglickém jazyce
Motivation

Thousands and thousands tweets are posted every minute. Considerable
part of this massive stream of information includes personal opinions or expression
of posters attitude about some topic. There have been several papers studying
methods how to extract the sentiment inside the twitter micro messages. For example
(Kolchyna, 2015) used for the sentiment classification machine learning as well as
lexical methods. In the paper, they further showed that machine learning approach
of sentiment classification outperformed the lexical approach and these two combined
yielded even more precise classification.
According the efficient market hypothesis proposed by (Fama, 1970), there should
be no way how to exploit the knowledge of the sentiment and generate abnormal
return on the stock market. However, recently there has been much research done
about the effect of twitter based sentiments on stock markets. For example (Fermut
2014) analyzed aggregated twitter data using lexical method and found out that
there is significant relation at long investment horizon and this effect diminishes with
shortening the horizon. Also in (Ranco 2015), authors showed significant dependence
in case of events characterized by twitter peeks for several days after such event, but
resulting only in small cumulative abnormal returns.
In my thesis we will use twitter sentimental analysis to model returns and volatility
of selected stock market indices. To do so, we will analyze huge dataset of all
tweets that does include specific keywords related to stock indices of thirteen technological
companies such Google and Microsoft over the past 20 months. The dataset
was collected using official Twitter streaming API.

Hypotheses

Hypothesis #1: We are able to quantify and measure the sentiment or mood
in society using Twitter and other text based social media, such as blogs
Master’s Thesis Proposal xii

Hypothesis #2: Using suitable twitter hashtags and word combinations as
sentiment indicator, we are able to set index with better predictive power than
lexical index based on agregated words.

Hypothesis #3: Sentiment is significant variable for explaining future returns
and especially future volatility.

Hypothesis #4: In case of significant exogeneous events (earthquakes, terrorist
attack, etc.) the market reaction is faster than corresponding reaction of social
media, leaving no room for arbitrage.

Methodology

With such enormous amount of twitter messages, we will need to
use methodology with regards to extra large dataset in order to process all relevant
information regarding the sentiment inside the tweets in reasonable time. This will
be done using suitable algorithms and data structures.
Correlation of sentiment indices and real financial data will be measured using
both standard discrete methodology for time series such as Granger causality
methods and vector autoregressive regression. In addition of these methods, fit of
matching the two types of data will also be measured using other approaches such
as wavelets.
In order to fully exploit the benefits of high frequency dataset, for modeling
volatility we will use techniques based on realized volatility, such as heterogeneous
autoregressive models. We will compare performance of such models with standard
GARCH family models. For detecting outlying events in stock market data, bipower
variation method will be used. This method will allow us to mark outliers in market
data and match them to events in twitter using methods described in paragraph
above.

Expected Contribution

Main goal of the thesis is to show relation between twitter
based sentiments and financial markets on different time horizons, with emphasis on
analysis of volatility. We will use unique data of tweets specifically tailored for
sentimental analysis of technological stock indices. The combination of the twitter
data and tick by tick stock market data let us use all benefits of high frequency
methodology and continuous approximation of the time series. With such dataset,
using machine learning approach allows us to create sentiment indices tweets to
precisely fit the market data. In this way, we are able to test our hypothesis with
much greater precision than in case of past research.
The thesis will examine the possibilities of creating trading algorithms based on
twitter sentiments. On the other hand, we want also explore new risk management
approaches based on social media. For Risk management and hedging strategies,
Master’s Thesis Proposal xiii
measuring magnitude of social media reaction to market events or vice versa, can be
extremely appealing.

Outline
1. Introduction
2. Literature review
3. Methodology
4. Data description
5. Results of analysis
6. Discussion of the results
7. Conclusion
 
Univerzita Karlova | Informační systém UK