Are there automotive stock signals in tweets?

Developing a machine learning framework.

With 336 millions of active users, Twitter is a critical tool for customers to express opinions about products, brands, and companies. With such a rich source of publicly available data, businesses can leverage this information to gather product feedback, improve customer relations and drive proactive, evidence-based business decisions. With billion drivers worldwide, the automotive industry has a large stake in maintaining positive business-client relations. Focusing on the technology, in this article we use social media data from Twitter and report on the results of a machine learning proof-of-concept study to relate the sentiment of the 5 top car brands during a two-week period to its daily stock returns. Therefore, we constructed the required data infrastructure, using an AWS server to collect tweets and performing a statistical correlation analysis using this data.

To carry out this analysis, we collected 847,318 tweets from around the world and stock data for the two week period of August 1 – August 13 for Toyota, Tesla, Ford, Mercedes and Porsche brands using the twython API written for python3 which targeted specific keywords related to the car brand. The python code that collected tweets was run on an AWS elastic EC2 instance and the collected data was stored in a SQL database. As an example of the type of tweets collected and the messiness of the data, a few unfiltered tweets related to Porsche are shown below,

Once the data was collected, it was processed into a Pandas dataframe and required significant processing as over 50%-60% of the raw tweets were duplicates, or almost identical and filled with text containing words that did not carry any useful information. Several data cleaning steps were carried out such as punctuation removal, stop word removal, lemmatization as well as using topic modelling to help identify and remove clusters of tweets not related to car brands.
Next, a sentiment model was constructed using a bag-of-words model, trained on Amazon review data, along with the Sentiment140 dataset. The final model was an ensemble model, consisting of logistic regression, naive Bayes and the TextBlob classifier. The output of the ensemble model was the prediction of the model with a specific target confidence level which was our main hyperparameter (from 0 to 1). These models achieved a precision and a recall score above 80% on the test sets. The total results of the two-week sentiment analysis of the top five car brands for a specific model hyperparameter are shown in the following graphic, with the positive sentiment tweets in green, negative in red, and neutral in grey.

Fig. 1: Semi-pie plots of the total sentiments collected for the five automotive car brands studied here (threshold value of 0.2). The green represents the positive, red is the negative, and grey is the neutral tweet percentages, respectively.

For each day, all tweets related to a specific automotive company were fed into the trained machine learning model with different hyperparameters producing a daily time series of the percentage of positive and negative sentiments tweets. A normalized-Bayesian-weighted average of these sentiment time series with different hyperparameter values was calculated, generating a time series representing the combined normalized daily positive and negative sentiment signals. In the figure below, we plot the sentiment signals for Tesla during the observation period.

Fig 2:  The generated time series of the normalized sentiments for the positive (top) and negative tweet (bottom) signals for time period investigated for Tesla. The solid lines indicate the true positive/negative signals, while the dashed lines are the estimated false positive/negative signals, respectively.

The best estimate of the sentiment signal (green for positive red for negative) is dark while the lighter color indicates the expected error based on the metrics of the model during training. The large spikes correspond to days when there was a lot of tweets with either positive or negative sentiments and similarly for the troughs. With the positive and negative sentiment time series on hand, we now look at how to relate these signals to the stock return data. According to the Capital Asset Pricing Model (CAPM), the return of a stock, \(\boldsymbol r\), is proportional to the return of the market \(\boldsymbol{r_M}\) with the proportionality constant \(\boldsymbol{\beta_M}\) and a small offset \(\boldsymbol\alpha\):

$$\boldsymbol{r = \alpha + \beta_M r_M + \epsilon,}$$

where \(\boldsymbol\epsilon\) is random noise. To test the effect of sentiment signals on this model, we added additional factors to the base CAPM: one for the normalized positive sentiment time series \(\boldsymbol P\), with proportionality constant \(\boldsymbol{\beta_P}\) and another for the normalized negative signal \(\boldsymbol N\), with coefficient \(\boldsymbol{\beta_N}\). If the positive and negative signals that we computed have no relation to the returns of the stock we would expect these two coefficients to be zero. Furthermore, the sign of the fitted coefficients tell us if the sentiment signal tends to increase or decreases the return of a stock:

$$\boldsymbol{r = \alpha + \beta_M r_M + \beta_P P + \beta_N N + \epsilon.}$$

The problem now becomes a linear regression problem. However, besides a simple point estimate of the numerical values \(\boldsymbol\alpha\), \(\boldsymbol{\beta_M}\), \(\boldsymbol{\beta_N}\), and \(\boldsymbol{\beta_P}\), we would like to know what the uncertainty of the point values are. For this purpose, we carried out a Bayesian fit of our linear model, which not only generates the point estimate of the parameters that we are searching for, but also provides the probability distribution of possible values which are used to rigorously quantify the uncertainty of the fitted parameters. Using the sentiment and stock data from the two weeks of the proof-of-principle study the following car brands indicated a sensitivity to the sentiment data at the 68% confidence level,

Fig 3:  The table summarizes the direction of the correlations indicated by the 68% confidence region around the mean from the analysis carried out in this work. The green arrow indicates a positive correlation, red is negative, and the grey bands indicate no association.

The green upward arrows connote a positive factor, indicating that the stock price tended to increase when the given sentiment was stronger, while the red downward arrows express a negative association. The grey line shows that the effect could be excluded at the 68% confidence region around the mean value of the distributions. The coefficient related to the market influence was consistently the largest factor and the known CAPM beta values from stock databases were consistent with the range of the uncertainty of the \(\boldsymbol{\beta_M}\) parameter found in this project, however, more data would be required to verify that. The \(\boldsymbol\alpha\) parameters for each stock were found to be consistently small and within the range of zero as expected by the CAPM.

Overall in this project, we have been able to develop the tools required to build a data pipeline that collects tweets of specific companies and carries out the sentiment analysis of those tweets. This is followed by a statistical analysis to determine the sensitivity of the stock to the sentiment signals with uncertainties to quantify our degrees of confidence. This analysis could be potentially useful for Businesses, allowing them to track and account for sentiment factors that may be influencing their stock returns.


Stay tuned for upcoming articles where we will use a two-month data set to track how the sentiment-stock correlations vary with time and which may improve the statistical power of our analysis.

[1] Amazon Dataset reviews:

[2] Sentiment140 Labelled Twitter Dataset for model:

[3] Yahoo Finance:

[4] Quantopian Sentiment Monitoring:

[5] Sentiment Analysis code: