## Are there automotive stock signals in tweets?

## Developing a machine learning framework.

With 336 millions of active users, **Twitter** is a critical tool for customers to express opinions about products, brands, and companies. With such a rich source of publicly available data, businesses can leverage this information to gather product feedback, improve customer relations and drive proactive, evidence-based business decisions. With billion drivers worldwide, the **automotive industry** has a large stake in maintaining positive business-client relations. Focusing on the technology, in this article we use social media data from Twitter and report on the results of a **machine learning** proof-of-concept study to relate the **sentiment** of the 5 top car brands during a two-week period to its **daily stock returns**. Therefore, we constructed the required **data infrastructure**, using an **AWS** server to collect tweets and performing a **statistical correlation analysis** using this data.

To carry out this analysis, we collected **847,318 tweets** from around the **world** and **stock data** for the two week period of **August 1 – August 13** for **Toyota**, **Tesla**, **Ford**, **Mercedes** and **Porsche** brands using the **twython** **API** written for **python3** which targeted specific keywords related to the car brand. The python code that collected tweets was run on an **AWS elastic EC2** instance and the collected data was stored in a **SQL database**. As an example of the type of tweets collected and the messiness of the data, a few unfiltered tweets related to **Porsche** are shown below,

Once the data was collected, it was processed into a **Pandas** dataframe and required significant processing as over **50%-60%** of the **raw tweets** were duplicates, or almost identical and filled with text containing words that did **not carry** any **useful information**. Several **data cleaning** steps were carried out such as punctuation removal, stop word removal, lemmatization as well as using **topic modelling **to help identify and remove clusters of tweets not related to car brands.

Next, a **sentiment model** was constructed using a **bag-of-words model**, trained on **Amazon review data**, along with the **Sentiment140** dataset. The **final model** was an **ensemble model**, consisting of **logistic regression**, **naive Bayes** and the **TextBlob** classifier. The **output** of the **ensemble model** was the **prediction** of the model with a **specific target confidence level** which was our main **hyperparameter** (from 0 to 1). These **models** achieved a **precision** and a **recall score** above **80%** on the test sets. The total **results** of the **two-week sentiment analysis** of the top five car brands for a specific model hyperparameter are shown in the **following graphic**, with the positive sentiment tweets in green, negative in red, and neutral in grey.

**Fig. 1:*** Semi-pie plots of the total sentiments collected for the five automotive car brands studied here (threshold value of 0.2). The green represents the positive, red is the negative, and grey is the neutral tweet percentages, respectively.*

For each day, all **tweets** related to a specific **automotive company** were fed into the **trained machine learning model** with different **hyperparameters** producing a **daily time series** of the percentage of **positive** and **negative** **sentiments tweets**. A **normalized-Bayesian-weighted** **average** of these sentiment **time series** with **different hyperparameter** values was **calculated**, generating a **time series** representing the combined **normalized daily positive** and **negative** **sentiment** signals. In the **figure below**, we plot the **sentiment signals** for **Tesla** during the observation period.

*Fig 2: ** The generated time series of the normalized sentiments for the positive (top) and negative tweet (bottom) signals for time period investigated for Tesla. The solid lines indicate the true positive/negative signals, while the dashed lines are the estimated false positive/negative signals, respectively.*

The **best estimate** of the **sentiment signal** (green for positive red for negative) is **dark** while the **lighter** **color** indicates the **expected error** based on the metrics of the model during training. The **large spikes** correspond to **days** when there was a **lot of tweets** with either **positive** or **negative** **sentiments** and **similarly** for the **troughs**. With the **positive** and **negative** **sentiment time series** on hand, we now look at how to **relate** these **signals** to the **stock return data**. According to the **Capital Asset Pricing Model (CAPM)**, the **return** of a stock, \(\boldsymbol r\), is **proportional** to the **return** of the **market** \(\boldsymbol{r_M}\) with the **proportionality constant** \(\boldsymbol{\beta_M}\) and a **small offset** \(\boldsymbol\alpha\):

$$\boldsymbol{r = \alpha + \beta_M r_M + \epsilon,}$$

where \(\boldsymbol\epsilon\) is **random noise**. To test the **effect** of **sentiment signals** on this **model**, we added **additional factors** to the base **CAPM**: one for the **normalized positive sentiment time series** \(\boldsymbol P\), with **proportionality constant** \(\boldsymbol{\beta_P}\) and another for the **normalized negative signal** \(\boldsymbol N\), with **coefficient** \(\boldsymbol{\beta_N}\). If the **positive** and **negative signals** that we computed have **no relation** to the **returns of the stock** we would **expect** these **two coefficients** to be **zero**. Furthermore, the **sign** of the **fitted coefficients** tell us if the **sentiment signal tends** to **increase** or **decreases** the **return of a stock**:

$$\boldsymbol{r = \alpha + \beta_M r_M + \beta_P P + \beta_N N + \epsilon.}$$

The problem now becomes a **linear regression problem**. However, besides a **simple point estimate** of the **numerical values** \(\boldsymbol\alpha\), \(\boldsymbol{\beta_M}\), \(\boldsymbol{\beta_N}\), and \(\boldsymbol{\beta_P}\), we would like to **know** what the **uncertainty** of the **point values** are. For this purpose, we carried out a **Bayesian fit** of our **linear model**, which not only **generates** the **point estimate** of the **parameters** that we are searching for, but also **provides** the **probability distribution** of **possible values** which are used to rigorously **quantify** the **uncertainty** of the **fitted parameters**. Using the **sentiment** and **stock data** from the **two weeks** of the **proof-of-principle study** the following **car brands** indicated a **sensitivity** to the **sentiment data** at the **68% confidence level**,

*Fig 3: ** The table summarizes the direction of the correlations indicated by the 68% confidence region around the mean from the analysis carried out in this work. The green arrow indicates a positive correlation, red is negative, and the grey bands indicate no association.*

The **green upward arrows** connote a **positive** factor, indicating that the **stock price tended** to **increase** when the given **sentiment** was **stronger**, while the **red downward arrows** express a **negative association**. The **grey line** shows that the **effect** could be **excluded** at the **68% confidence region** around the **mean value** of the **distributions**. The **coefficient** **related** to the **market influence** was consistently the **largest factor** and the known **CAPM beta values** from **stock databases** were **consistent** with the **range** of the **uncertainty** of the \(\boldsymbol{\beta_M}\) **parameter** found in this **project**, however, **more data** would be required to **verify** that. The \(\boldsymbol\alpha\) **parameters** for **each stock** were found to be consistently **small** and within the **range of zero** as **expected** by the **CAPM**.

Overall in this **project**, we have been able to **develop** the **tools** required to **build** a **data pipeline** that **collects tweets** of **specific companies** and carries out the **sentiment analysis** of those **tweets**. This is **followed by** a **statistical analysis** to determine the **sensitivity** of the **stock** to the **sentiment signals** with **uncertainties** to **quantify** our **degrees of confidence**. This **analysis** could be **potentially** **useful** for **Businesses**, allowing them to **track** and **account** for **sentiment factors** that may be **influencing** their **stock returns**.

Stay tuned for

upcoming articleswhere we will use atwo-month data settotrackhow thesentiment-stockcorrelationsvary withtimeand which mayimprovethestatistical powerof ouranalysis.

**[1]** Amazon Dataset reviews: http://jmcauley.ucsd.edu/data/amazon/

**[2]** Sentiment140 Labelled Twitter Dataset for model: http://help.sentiment140.com/for-students

**[3]** Yahoo Finance: https://finance.yahoo.com/

**[4]** Quantopian Sentiment Monitoring: https://www.youtube.com/watch?v=tYiKM1rIWx4&t=1928s

**[5]** Sentiment Analysis code: http://www.oscarjavierhernandez.com/other/2019/03/31/twitter_automotive_sentiment_analysis.html#about