image-6
Posted by: Moritz Uehling, CTO

No, AI won’t magically solve everything. But it does have a role in data analytics.

ChatGPT is the new craze, and everyone has their take on what it means for the tech world (everything from killing search engines to eliminating jobs). In short, we think ChatGPT and GPT-3, the large language model it’s built on, are amazing, but also have some serious flaws that make them very unreliable for data analytics. However, we see a lot of potential and there are some interesting use cases that we’ll get into at the end (so it’s not all doom and gloom).

The Good

Let’s start with what GPT-3 does really well: predict the next word[1], based on the previous text. Nothing more, nothing less.

To get good at this task, it had to “learn” (one way or another) about many different topics people understand (e.g. grammar, software development, rhyming, idioms, slang, important persons, politics, current events). GPT-3’s neural network found a way to encode this information into itself. Today, GPT-3 uses this encoded information because it makes it better at predicting the next few words of a text.

In other words, you can think of GPT-3 as a much, much better version of the predictive text keyboard on your phone.

image5

GPT-3 and other large language models have proven to be incredibly adaptable. For example, if you use a pre-trained model and give it just a bit more training data, it can adapt to completely different situations. That might be, for example, code generation, i.e. generating source code based on a prompt.

ChatGPT is (essentially) a version of GPT-3 that has an additional training step. ChatGPT has been trained on generating (correct) responses with humans in the loop, so we can use the amazing text-generation powers to answer prompts. And it works amazingly well, most of the time.

Correctness - the Weakness of Language Models

GPT-3’s biggest problem is that it tries to predict what comes next, not what is correct.

Let’s test GPT-3 with a fun but very telling example about our favorite data analytics company, Datawisp.

Here’s my prompt:

screen-shot-2023-02-09-at-12-49-01-pm

And here’s what GPT-3 produced:

screen-shot-2023-02-09-at-12-48-46-pm

The sections highlighted in blue show how the software can be a bit ignorant. “More profitable,” “more attractive opportunity,” etc. are all things that are generally said while discussing pivots, but they’re not at all applicable or appropriate in this case. However, if you didn’t know anything about distributing software or ice cream, it would be really tough to notice the error.

Some other known issues with GPT-3:

  • It has trouble recognizing negatives

  • It has trouble with basic math

  • It makes stuff up / hallucinates things

  • It can’t tell if someone’s prompt is fact or fiction

Solving these problems is really hard. If you want to learn a bit more, here’s a great video by the AI researcher Robert Miles.

Let’s dig into what these problems mean for developing software.

GPT won't steal your software development job… at least not yet

When we apply the weaknesses of GPT-3 to programming, the results are a little worrying. The inherent errors can be really difficult to notice at first, since, for example, ChatGPT is really really good at generating code.

Almost more impressively, it can even compile code. Here’s an example of ChatGPT compiling some simple code[2]:

image2

ChatGPT’s ability to compile the above is super impressive, but when pressed one step further, GPT-3 struggles with basic math. For example here it simply ignores the “+y -x” part: 

image1

Perhaps most concerning (but funny) is that you can just lie to it, and it’ll believe you:

image4

This doesn’t mean that this interaction isn’t impressive at all. The fact that large language models have learned math is rather impressive, and it’s not just memorizing all combinations of numbers. There seems to be some computation going on. 

But, to quote OpenAI:

"Large language models like GPT-3 have many impressive skills, including their ability to imitate many writing styles, and their extensive factual knowledge. However, they struggle to perform tasks that require accurate multistep reasoning, like solving grade school math word problems. Although the model can mimic the cadence of correct solutions, it regularly produces critical errors in logic."

OpenAI Blog, Solving Math Word Problems[3]

And those problems with multi-step come to haunt the AI in this example - it simply “forgets” a step, until I remind it of it.

Could some larger language model improve on that? Yeah, but correctness remains hard, because at its core, the AI isn’t reasoning, it’s predicting text - which at least right now, seems to be a fundamentally different thing

More importantly, we need to acknowledge that we’re not there yet.

Here’s what Jerome Pesenti, the former VP of AI at Meta said about that topic on the podcast Gradient Dissent[5] recently:

Jerome: You are in software [development], right? I give you a piece of code, okay, and I tell you it's 99% accurate.

How good does it give you...the problem is that generating code that's not accurate...I mean, sometimes finding a bug is way harder than writing the code from scratch, right?

[...]

I think the way to think of Codex and stuff like that is like an auto-complete. It's a very smart auto-complete, the same way when you write your email right now, Gmail does auto-complete. It can complete sentences, and it's quite smart, and it's quite impressive. And if you cherry-pick the results, it looks amazing and it's very surprising what it can do.

But, you know, it writes something, and then you have to say, "Well, is that actually accurate?"

You don't have guarantees, and not having guarantees in code is a huge, huge problem, right? Really bug-free code is worth a million times [of just] code. It's not the size of the code that matters.

So, I'm really cautious on this one. I do think it's a useful developer tool. People will use it like they use auto-complete to write email. But it's not going to write...it's not going to put developers out of a job. No way. And especially...it's tricky when you write code, because you need to have guarantees.

Interviewer: Well, I certainly feel like it helps me write code faster.

[...]

Jerome: You can write worse code faster, I'll give you that. That's for sure. Is it really allowing you to write...I think it will — I also believe it, right? — it will make people faster. But how much will depend on the validity of the code?

If you had a system that could guarantee you that the code is accurate, that would be a complete revolution. This is not what it is, right? 

Again, having guarantees and having control over the outputs is something that's really one of the big challenges of these models.

Making sure that what it says is accurate, that's another thing. These language models, they hallucinate. Avoiding that is really, really, really tricky.

I highly recommend listening to the whole podcast if this is something that interests you.

An Illustrative Example

I want to cap my thoughts off with this example of ChatGPT[6] simply making stuff up following a prompt that my friend and I gave it.

For context, hydraulic lines are essentially tubes filled with some liquid (“hydraulic fluid”) and they allow you to move forces from one part of the airplane to another, and make controlling the aircraft a bit easier. The emergency lighting system on an airplane is obviously not powered by a fluid, but by electricity. However, the model doesn’t understand this at all. 

image7

Once ChatGPT starts making things up, it will blissfully continue doing so:

image3

The slides don’t fill with hydraulic fluid - they obviously fill with air. 

But here’s the thing: the most problematic errors aren’t the ones that are obvious to everyone. They’re the ones you only spot when you’re:

  1. Really looking at the generated response closely or

  2. Have a way to evaluate if something is correct (e.g. domain knowledge)

Here’s a quick example from Datawisp, an actual (slightly simplified here) problem that I had to solve this week. We want our server to back up a user’s sheet whenever it’s updated, but at most every 10 minutes (so the user can restore the sheet if something happens to it). 

So, let’s ask OpenAI’s Codex to generate some code. If you don’t know SQL, bear with me - I’ll walk you through it!

Here’s a prompt that creates a table and then asks GPT-3 to create a backup based on some conditions:

screen-shot-2023-02-08-at-12-25-04-pm

And here's the generated code:

screen-shot-2023-02-08-at-12-24-49-pm

On the surface, the ChatGPT-generated code looks correct - it checks if the last backup of a specific sheet is more than 10 minutes old, and if it is, it inserts a copy of it into the sheets_backups table.

But it has a subtle error, that you likely won’t spot unless you actually try out the query, or have a good understanding of SQL: If there’s no backup yet, the (SELECT backup FROM sheets_backups...) part of the query will return NULL.

And so, if you don’t already have a backup, you won’t get a backup.

That’s not what we want, and FAR from optimal!

Sure, you can argue that’s what I asked for, but it’s not what I wanted - what’s correct. If we rework the prompt a little it’ll work just fine: 

screen-shot-2023-02-08-at-12-24-15-pm

As you can imagine, spotting this kind of mistake is hard, and it’s outright impossible if you’re not already good at reading and writing code.

Conclusion

While the evolution of AI has been nothing short of remarkable, language models are still extremely prone to just making stuff up. Not only that - they make stuff up very convincingly and sometimes, very subtly, which makes the errors hard to notice even for subject matter experts.

We think this makes them particularly unreliable for taking over technical tasks from non-experts. A black box that simply takes human input and spits out answers that can’t be easily checked seems like a nightmare, especially in the world of data analytics. 

However, the value that AI brings seems undeniable: in the right hands, it can generate very usable code that only requires a few small tweaks, with minimal input. We think this can greatly increase the speed at which small teams can write code, analyze data, or even build new products. And so, our 2023 Roadmap for Datawisp includes some areas where we think AI will add value to our platform.

At Datawisp we’ve always had two goals: 1) to make working with data without code easier and 2) to make our sheets easy to understand even for users that didn’t make them. With Datawisp, we want you to be able to use the power of AI to help analyze data while still being able to see what’s happening and catch mistakes. This is why we’re introducing some new features that leverage this technology while leaning on Datawisp’s visual interface to reduce the chance of errors.

We’ll have more to announce on this topic at the Games Development Conference in mid March!

Schedule an in person or remote demo and stop by our booth to find out more!

__

Footnotes

  1.  Technically the next “token” (which can be only part of a word), but for the purposes of this article, the difference doesn’t really matter

  2.  Although this code is fairly unoptimized, the nop - literally “no operation” - is unnecessary, and at IL_0009 it jumps to IL_000b for literally no reason at all. All 3 of these instructions would likely be left out by a compiler, unless I’m overlooking something.

  3.  https://openai.com/blog/grade-school-math/

  4. ChatGPT works a bit differently. so you might need to add the caveat “It’s predicting text that human reviewers like” – which is still different from reasoning!

  5. https://wandb.ai/wandb_fc/gradient-dissent/reports/Jerome-Pesenti-Large-Language-Models-PyTorch-and-Meta--VmlldzozMTgzMjc2?galleryTag=podcast

  6.  The reason I’m using ChatGPT over GPT-3 here is that ChatGPT is already “trained” to answer accurately, and it still struggles with it.