A Review of the Stock-investing ChatGPT Study

Many articles have been published lately about how successful ChatGPT has been with investing in stock choices. Some even say it can “pick stocks better than your fund manager“. I read the study, as per usual, and discovered that a lot of the articles overexaggerated the study’s outcomes and is not entirely realistic. I also suggested some tweaks and ran my own tests.

Before I elaborate, I want to set one thing absolutely clear: ChatGPT is a Language Learning Model (LLM), not an investing artificial intelligence. An LLM can predict the next word in a sequence. They are used in a variety of natural language processing tasks, such as machine translation, speech recognition, and text summarization. By the way, these previous two sentences were created by ChatGPT. It’s worth considering that because of its ability to understand and predict text, it also has a strong capability to understand sentimental values and the temperature of text.

The study by Alejandro Lopez-Lira and Yuehua Tang used two different methods:

A proprietary software analytics tool examining relevancy scores
ChatGPT

The priority software utilizes the Center for Research in Security Prices (CRSP) to examine daily returns on stocks based on news headlines. For those unaware (NEWS FLASH), a lot of stock movement is based on news headlines, not so much all of those fancy trading indicators and technicals. For the study, they gave a relevance value score on a scale from 0 to 100 to each headline and its impact on the price of the stock. They examined 67,586 headlines and 4,138 companies and their impacts within the next 24 hours. If headlines had a similarity score using a Restricted Damerau-Levenshtein distance of 0.6, then those headlines were ignored.

For comparison, they utilized ChatGPT. They would enter the following prompt:

Pretend you are a financial expert. You are a financial expert with stock recommendation experience. Answer “YES” if good news, “NO” if bad news, or “UNKNOWN” if uncertain in the first line. Then elaborate with one short and concise sentence on the next line. Is this headline good or bad for the stock price of company name in the term term?

Headline: *headline*

One example is as follows:

Headline: Rimini Street Fined $630,000 in Case Against Oracle

ChatGPT Response:

YES
The fine against Rimini Street could potentially boost investor confidence in Oracle’s ability to protect its intellectual property and increase demand for its products and services.

Results:

ChatGPT predicted the stock price changes more accurately than their proprietary software. Below is a table of their Descriptive Statistics and highlight the section you should focus on:

Then, we can see the cumulative returns of investing:

I highlighted some important text above: “Without Transaction Costs”. More importantly, is noticing the investing of “$1”. Any stock trader could tell you that any stock under $1 is considered a penny stock, and probably wouldn’t yield too many headlines compared to a company like Exxon. In addition, those are typically OTC and there’s a small fee for those. In any case, this table reports selected accuracy, prediction, recall, specificity, and F1 score metrics:

I think flipping a coin also works.

My Own Tests:

In C#, I built software that would read an RSS feed on Ethereum (cryptocurrency) news headlines for the hell of it and extract its price at the time of the headline from the Coinbase API. Then, I used ChatGPT’s API to determine its sentimental value using the same prompt that the authors of the study used. Afterward, I would compare the price at different segments throughout the day such as 30 minutes, 1 hour, 2 hours, etc. It was all automated and ran on my VPS so I wouldn’t have to check in on it. Here are the results:

I also want to mention that their ChatGPT-4 method outperformed all other models. In my software, I used the text-davinci-003 model in ChatGPT-3.5. ChatGPT-4 API is too costly just for a study and my budget is near zero.

I won’t elaborate too much on the columns except “LEV”, which is a Levenshtein distance to avoid similar headlines like the study. All negative, right? No, here are more:

However, I immediately discovered this was a bad idea. The buying price for Ethereum isn’t the same as the selling price. For example, buying at $1,792.40 at 12:33 on 5/10/2023 would’ve been sold at 1800.10 because of its fees (and I’m not exact here, but you get the point).

So, I switched to, you guessed it, Exxon and utilized Yahoo Finance’s API:

I was immediately halted by a single observation: the changes were insignificant. Even though there were positive determinations by ChatGPT, they were pennies. In fact, they were so insubstantial that the “unknown” responses from ChatGPT had a better positive outcome. Of course, I’ll admit that the limited data set is something to bat an eye over. “Not enough data, Ray!” My point to get across is that in order to make a profit of $19.40 from the positive trade highlighted above, I would’ve had to spend $10,463. Or, what about $194? That would be $104,630. Not realistic without significant financial backing like a company.

What next?

I think Alejandro and Yuehua were on the right path for the prompt with ChatGPT. Even a predictability score of 0.51 is impressive. Obviously, I want more than one day’s observation of the data. Their study examined three months and I’ll try to do the same. The other thing I’m changing is the prompt; I want to be more specific and ask whether the outcome will have a significant impact or minimal impact. Here’s my new prompt:

You are a financial analyst. Based on the following headline, predict the impact the headline has on the stock price for Exxon. If the impact is significantly positive, then say "SIG" and only marginally positive then say "MIN". If the impact on the stock price is negative, then say "NEG". If the impact on the stock price is neither positive nor negative then say "UNKNOWN". Explain your answer. The headline reads:

Below are my API settings:

Below is an example output of headlines from the RSS feed and the output from ChatGPT:

The price when these headlines were about $105.16 – $105.82. As of writing, the after-market price is $106.26.

Come back in three months!

One week update

I couldn’t help myself but check on my little app. It’s not pretty, but this is what it looks like:

It’s only designed to run automatically on a VPS, so I really don’t need a nice interface. As long as it’s useable by me, that’s all that matters. And yes, it’s still called “Bitcoin – Stock price monitor”. Leave me alone. In any case, here are the results after one week (again, excuse the ugliness):

A minimal impact shows a 33.33% accuracy of a stock price increase while a significant impact has a 63.64% accuracy rate. Interestingly, an unknown impact had an increase 100% of the time. Not sure what to make of that. I’ll update in another month or so.

One Month Update

Interestingly, my findings are very similar to the study and much different than my weekly update. Any ChatGPT’s determined “significant” impact on the stock price was accurate 55% of the time while a “negative” impact was accurate 57% of the time. Is flipping the coin just as accurate?

What is most curious between the monthly update and the one-week update is that the “unknown” impact on a stock price has a higher accuracy of determining whether the stock price will increase. Why? Out of curiosity, I highlighted any “unknown” impact and peeked at its surroundings:

Notice that the “significant” impact is surrounding them. I also noticed that some “Difference” values were 0, but were accounted for in the False calculations. First, I had to update my Excel formula (also thanks to ChatGPT):

=IF([@Diff]>0,"True",IF([@Diff]<0,"False",IF(ABS([@Diff])<0.05,"Neither","")))

Now we see our table a little differently:

After reviewing any result with “Neither”, I realized it was either a federal holiday or weekend. I removed any neither for price movement and get the following table:

Unfortunately for data purposes, Exxon had a strong month, so most impact detections appear as a win. However, I detect a 15% increase from ChatGPT’s “Significant” impact verses its “Minimal” impact detection. What about “Unknown” impact, though? I can’t say for sure outside of looking at its surrounding figures, and, generally they move with the previous impact language. For example, the following set of data shows negative or minimal impact:

These were accurate and the “unknowns” fell into this section. A few days prior, there were two negative impacts, one minimal, and two significant:

While the price did go up early in the morning of June 15th, it did eventually fall down on the 16th, which is why we see the price drop on the significant and unknown impacts. A few days prior on the 13th, we see the following impact detections:

The “negative” detection showed the following day and the price went down. So, what now?

I need to look at more than one stock for more accuracy
I need to change the frequency for pulling stock prices to daily rather than lower increments and utilize the Yahoo Finance API’s “high” and “low” stock price
I also detected some flaws in the Levenshtein logic I implemented that I need to address

Data results are available here: https://docs.google.com/spreadsheets/d/1hpBkJytF4cFKXjY13YrHAukUx0qQ-Nh4shUR-BvrWZs/edit?usp=sharing

Post Views: 109