This project was the capstone for a Data Visualization course (ICS 484). The objective was to compare Disney’s stock prices to the sentiment of online posts and comments from Reddit’s r/Disney subreddit. I worked with Clark Whitehead who found the dataset and led the sentiment analysis and Kiko Whiteley who was in charge of website design and visualizations for a different part of the dataset. My job on this project was to create a pipeline from downloading the stock market and Reddit data to the fully processed csv files ready to be graphed. This also included working with data smoothing software to normalize the data for graphing.
At the time this was the largest dataset I have worked with in a single project, so I learned many valuable skills about data pre-processing, cleaning, and organization. I also got comfortable working with APIs to process large amounts of data, as most steps required a separate API to complete. The process of finding what API to use and making the necessary adjustments to it for the project was another incredibly valuable experience. This project also allowed me to practice writing robust, modularized code that can be reused for a variety of different purposes and datasets.
The notebook I created sets up a detailed pipeline of gathering data from Reddit and the stock market, cleaning it, performing sentiment analysis, and graphing the results. The 6 major steps of this process are outlined below:
This is where the data is gathered, through the help of two different APIs. The PMAW API is used to gather data from Reddit, allowing the user to either pull from a subreddit’s collection of posts, or comments on the posts. Yahoo’s YFinance API allows us to get stock price data for any given stock.
Once the Reddit data is pulled from the internet, it needs to be cleaned. There are a lot of posts/comments that have been deleted or removed since originally being posted, and some that don’t have any content. Afte these are removed, the creation date is reformatted into a YYYY-MM-DD string format.
After the Reddit data is cleaned, we can perform sentiment analysis on it. Google’s BERT sentiment analysis model can be downloaded and used for this step.
Once the Reddit data is cleaned and passed through the sentiment analysis, we need to compare it to the stock market data. There are holes in the Reddit data since many posts/comments have been deleted or removed since their original upload, and there is no stock market data from weekends or holidays. Thus, we need to find a set of usable dates that both datasets have entries for. This step also removes all unnecessary field from the data to prepare it for graphing.
The sentiment data alone is incredibly volatile and messy, so data smoothing is needed. The ASAP model for smoothing data works great, and is applied to the Reddit sentiment data. The source code from this step is the entirety of the open source code fro the ASAP model.
This step takes in the data from step 4 and feeds it into step 5, producing a dataset that is ready to be plotted. It returns a dataframe containing the smoothed out data, so it can be exported and plotted in something like Plot.Ly.
The process outlined above was used on both posts and comments from r/Disney. The sentiment from these two datasets are graphed against Disney’s stock prices, as seen below.
You can view the whole code in the GitHub repository for this project.