This post will provide a high-level overview of how Quordata queries and processes data through machine learning models. Due to the complex nature of machine learning, it is recommended to perform supplemental reading to fully appreciate the body of work.
The process begins with the user entering a publicly traded company into the search bar. This query can be any company, but for the purposes of our Beta, this will be limited to 10 of the largest S&P 500 companies. When a query is submitted, the first step is looking through our database of cached results. If we have recently processed this query, the results are simply retrieved from the cache. These results come in a proprietary structure that is then formatted to a clean and readable UI, which will be discussed later.
If the query is not cached then the real work starts. Quordata will continue looking in the NoSQL database and perform a reverse document lookup to find all entries related to the query. This data comes from various online sources which currently include:Twitter, Reddit, LinkedIn, Wall Street Journal, New York Times, Washington Post, and Yahoo Finance. The text data is then consolidated into CSV, JSON, and Parquet files for later processing. When a query is requested that does not have data, Quordata will prioritize this for data collection. This process will be explained in further detail in another post.
Consolidated data is then passed through to the first model, the spam model. Quordata combines text data with metadata to evaluate the likelihood of the data being spam. Metadata can include followers, retweets, and likes for tweets; upvotes, downvotes, replies, and awards for Reddit posts; and comments and page hits for news articles. Quordata uses this metadata with the text data to look for patterns to predict when something might be considered spam. Spam is not the same as irrelevant data, but rather spam is data that is used for the express purpose of advertising, promotion, is sent by a bot, or otherwise repeating a meaningless message.
Unless otherwise stated, all Quordata machine learning code is written in Python. The spam model utilizes the Tensorflow and HuggingFace libraries. The spam model uses a pre-trained BERT model as a base, with transfer learning protocols to train the model to our specific use case. We manually labeled thousands of data points following the above rules to feed into our model. Quordata takes in our manually labeled data to generate a model that can accurately predict whether a data point is spam. The model is a binary classification model. That is, our spam model will predict whether an input aligns with one of two classification tags: not spam (ham) or spam. The model actually outputs two values, the probability of a point being each label. If the model is 60% confident that an input is spam, it returns the values 0.6 and (1 - 0.6) = 0.4. That means that the model is 40% confident that the input is ham. Quordata uses what is known as a softmax calculation to choose the appropriate label for an input given the two confidence levels.
Afterwards, the manually labeled input data is passed into the sentiment model. This model focuses purely on the text data, no metadata is used for labeling. Unlike the spam model, the sentiment model is known as a multi classification model, with three possible labels: Positive, negative, and neutral. The sentiment model also receives several thousand manually labeled data points for training. First, the model is based on a pre-trained SiEBERT model from the HuggingFace library. Quordata again uses transfer learning to fine tune the model. For both the spam and sentiment models, our hyperparameters have been adjusted and continue to be analyzed for further adjustments. We use a highest probability algorithm to choose the appropriate label based on the confidences of each label.
The last model utilized is the Biterm Topic Model (BTM). This model is meant to generate closely related topics to a given document set. In this case, our document set is all the data collected regarding the initial query. Our document set is passed into the BTM and the output is a collection of phrases that are related to each other given the scope of the document set. Multiple collections of phrases are gathered. For example, a query such as “Apple '' may have one collection containing phrases like “ios, macbook, laptop” and another “apple music, itunes, ipod”. Each of these collections of phrases are used to predict prevailing unique topics about the base query. In this case, “Apple '' might receive topics like ''macbook " and" music ". Quordata returns the top five most relevant topics this way.
Now that our data has been labeled with spam and sentiment models, Quordata is ready to generate the insights. Various metrics such as average sentiment score, confidence levels, and source counts can be generated with simple arithmetic on the results. The closely related topics are retrieved from the BTM. Quordata can also relate model scores to stock prices and trading volume, assuming the query is related to a publicly traded company. Stock data is received from Yahoo Finance, with backup libraries if necessary. Quordata can produce actionable insights based on past historical performance in relation to the company’s sentiment. Using recent trending information, Quordata can also make hypotheses based on what may be driving current price and sentiment action. These timeframes are based on the frequency of the data collected and the amount of data available for the given query.
Finally, all this data is collected and passed through several different tools to generate easy to digest dashboards. Chart.js, HTML, CSS, Python’s Plotly library, and jquery scripts are the main drivers behind the graphical user interface shown on a dashboard. When the GUI is ready, the server sends the results back to the user that initialized the query.