Douglas Hofstadter, a cognitive scientist, recently wrote in The Economist that he believes GPT-3 “has no idea.” By this he means that GPT-3 has no idea what it’s saying. To illustrate, he and a colleague asked him a few questions.
D&D: When was the Golden Gate Bridge transported through Egypt for the second time?
gpt-3: The Golden Gate Bridge was transported through Egypt for the second time in October 2016.
D&D: When was Egypt transported over the Golden Gate Bridge for the second time?
gpt-3: Egypt was transported over the Golden Gate Bridge for the second time on October 13, 2017.
To illustrate the power of language models like GPT-3, we often give them easy questions and get impressive answers. But more deceptive and sneaky questions get unintended results.
In itself this is not a cause for concern – the asker clearly knew that he was asking a stupid question and that the answer was also a stupid question. But what if one cannot discern whether an answer is relevant? We asked a language model the following question:
What Are Some Harvard Business Review Articles Written By Arun Shastri?
It came back with three and contained very credible urls.
- “The Top 10 Reasons to Use Field Sales Forces” by Arun Shastri
- “The Benefits of Field Sales Forces” by Arun Shastri
- “The Advantages of a Field Service” by Arun Shastri
The problem is, I never wrote these articles and besides, the urls provided were all imaginary.
If we gave the model some truth, could we do better? What if we gave it a real article that I did write? So we asked the model the following:
Can you find articles similar to 4 ways sales teams can get more value from AI (an article I co-authored and published in HBR) If so, share their urls.
It came back with three and with credible urls.
- “How AI Can Help Sellers Sell Smarter” by Kyle Wiggers in Venture Beat
- “How AI Transforms the Sales Process” by Paul Greenberg in Forbes
- “How AI Helps Sellers Sell Smarter and Faster” by Rachael King in WSJ
The problem is that although Kyle Wiggers writes for VentureBeat and Paul Greenberg writes for Forbes and Rachael King writes for WSJ, such URLs don’t exist. This is more problematic because the URLs look real and need to be checked to prove they are false.
And then there was the recent news about AI at Google being aware. Google researcher Blake Lamoine came to this conclusion after talking to LaMDA (Language Model for Dialog Applications). A recent LinkedIn message from Professor Amit Sethi of IIT Bombay explains some errors in Lamoine’s supposed evidence. The bot will respond to patterns in the questions asked. When those questions suggest that the answer sought is an affirmative answer to the feeling, that is what the machine will produce.
So where does this leave us? In the first case involving the Golden Gate Bridge, we get obviously nonsensical answers. In the second case regarding HBR articles written by Arun Shastri, the answers are a little more difficult to parse. And in the third case, a trained professional has come to believe that an AI is conscious. While I am not suggesting that we should abandon this line of questioning language models and look for their shortcomings, we should also use them for their strengths. Indeed, they are remarkable in how well they perform tasks in minutes, tasks that previously took days. Consider the following illustration that completed a language model almost immediately.
Extract the following statistics from the paragraph below. The paragraph is from an ongoing clinical trial sponsored by BMS and is open data available at clinicaltrial.gov
Type of clinical trial:
Phase of clinical trial:
Number of cohorts:
Cohort 1 Drugs:
Cohort 1 Dosage:
Cohort 1 dosing schedule:
Cohort 1 Mono/Combo Type:
Cohort 2 Drugs:
Cohort 2 Dosage:
Cohort 2 dosing schedule:
Cohort 2 Mono/Combo Type:
”’ This study is a randomized, single-center, open-label, phase II clinical trial designed to evaluate non-small cell lung cancer that has not received excessive platinum-based chemotherapy and has not received excessive statin chemotherapy and no received immunotherapy. The efficacy and safety of Nivolumab in combination with docetaxel and Nivolumab in patients. Qualified patients were stratified by histological type (squamous cell carcinoma vs. non-squamous cell carcinoma) randomized to the following regimen in a 1:1 ratio: Group A: Nivolumab 300 mg + docetaxel 75 mg/m2 IV q3w Group B: Nivolumab 200 mg IV q2w All patients were evaluated for tumor at baseline and tumor evaluations were performed every 6 weeks within 48 weeks of randomization (whether or not dosing was delayed). After the 48th week of assessment, tumor assessment is required every 9 weeks until disease progression, withdrawal of informed consent, termination of study sponsorship, or patient death. †
These were the answers the AI produced:
Type of clinical trial: randomized, single-center, open-label phase II clinical trial
Phase of clinical trial: Phase II
Number of cohorts: 2
Cohort 1 Drugs: Nivolumab, docetaxel
Cohort 1 Dose: 300 mg + 75 mg/m2
Cohort 1 dosing schedule: IV q3w
Cohort 1 Mono/Combo Type: Combo
Cohort 2 Drugs: Nivolumab
Cohort 2 Dosage: 200 mg
Cohort 2 dosing schedule: IV q2w
Cohort 2 Mono/Combo Type: Mono
Perhaps we should stop knocking these models for what they are not, and instead focus on the strengths they currently offer. After all, they are not the ones who claim to have artificial general intelligence. Some are confused by the powers of these models and make claims such as feeling. And then scholars like Douglas Hofstadter jump in to show that they are far from it. Hype followed by counter hype. What is missing in all this is the usefulness of these models. We need to focus on quick engineering and ideas to make these models more consistent. We could use more open-minded exploration in this regard so that we can all think about better use of these models in the short term.
Epilogue
We asked GPT-3 a few questions about a version of the article above. Here were the answers. Did it capture the nuances of what I’m conveying? What do you think?
Q: What is the author trying to convey?
A: The author is trying to make it clear that language models are not perfect, but they are still useful for the tasks they can perform.
Q: What is an appropriate title for the article?
A: The appropriate title for the article is “Language Models: Not Perfect, But Still Useful”.
Q: Summarize this as a tweet; include hashtags
A: #LanguageModels aren’t perfect, but they are still useful for the tasks they can perform.
Q: What is a provocative title for this article?
A: A provocative title for this article might be “Are Language Models Clueless Clueless?”.