September 16, 2024

Can ChatGPT—and its successors—go from cool to tool?

Anyone reading Freedom to Tinker has seen examples of ChatGPT doing cool things.  One of my favorites is its amazing answer to this prompt: “write a biblical verse in the style of the King James Bible explaining how to remove a peanut butter sandwich from a VCR.”   Based in part on this kind of viral example, some people are making big claims about the impact that ChatGPT will have on the world in general, and science in particular.  A recent article in Nature, for example, claimed that “conversational AI is a game-changer for science.”   Maybe.  But, let’s not forget that there is a huge gap between writing funny instructions for removing food from home electronics and doing scientific research.  In other words, there’s a big difference between being cool and being a tool.  In this post, I’ll describe what happened when I tried to use ChatGPT in scientific research, specifically helpling me do peer review for a scientific paper.  In short, ChatGPT didn’t help me do peer review at all; not one little bit.  Although this result was somewhat disappointing for me, I think the experience offers lessons about how we can assess future versions of ChatGPT and all the new AI tools that are coming.

OpenAI launched ChatGPT on November 30, 2022.  People immediately rushed to demonstrate its surprising capabilities, and it seems the more surprising the demonstration, the more viral it went. Like others, I was amazed when I first saw viral screenshots of ChatGPT in action.  After playing with ChatGPT for a while, I wondered whether I should stop all my work in order to focus on understanding and using ChatGPT—and all of its successors that are supposedly even bigger, better, and just around the corner.  As my kids, wife, and friends can tell you, it was one of the only things I wanted to talk about for a while.

Playing with ChatGPT

In the course of messing around  with ChatGPT, however, I started to notice a pattern.  My conversations with it generally fell into two broad categories.  The first could be called, “I wonder if” conversations.  For example, I wonder if ChatGPT can write a wedding toast in the style of Donald Trump (answer: yes, quite well actually).  These conversations were quite exciting.  The second type of conversation, however, started with a real problem that I had in my work.  These could be called “getting stuff done” conversations.  For these kinds of conversations, ChatGPT was much less exciting.  For computer programming tasks, for example, asking ChatGPT was about as helpful as searching the website Stack Overflow.  To be clear, this is incredibly impressive because Stack Overflow is amazing.  But, it is not that useful because Stack Overflow already exists.  In order to be a useful tool, ChatGPT has to be able to help us do real things better than our existing tools.

ChatGPT and scientific peer review

Around the time I was playing with ChatGPT, I received a request from a scientific journal to peer review a manuscript.  For non-academic readers, a bit of quick background might be helpful. When a team of researchers wants to publish a scientific paper, they send it to a journal and then the journal organizes a peer review.  During peer review, the scientific journal sends the manuscript to about three other researchers.  These researchers help the editor of the journal assess quality and help the authors improve.

Although peer review sounds good in theory, many academics know that the process can be a slog in practice.  Writing good reviews consumes a lot of time, and receiving reviews can be infuriating when it sometimes seems as if the reviewers have not even read your paper.  The challenges with peer review  have led to complaints and some exploration with alternatives, including using other forms of natural language processing.

I personally think peer review serves an important purpose and can often work well; most of my papers are improved by peer review, even if the process can be frustrating at times.  Therefore, I tried to see if I could find a way to use ChatGPT as a tool to improve peer review.  Could ChatGPT write my review for me?  A first draft?  Could it make my review even a bit better or a bit faster?

Using ChatGPT for scientific peer review

Before starting this exploration, I began by thinking about the ethics of using ChatGPT for peer review.  I decided that it is reasonable under certain conditions, and I’ve written more about that in the ethics appendix at the end of this post.

Next I had to decide how I would measure success.  By this point, I had used ChatGPT enough that I knew that it could not write a real review for me.  So, I set a much simpler criterion: Could it help me at all?  To assess this, I wrote a review following my normal process.  Then I started interacting with ChatGPT to see if it would say something—anything—that would lead me to change my review.  ChatGPT is known to have a problem with hallucination, so I knew not to take anything it said too seriously, but I was at least hopeful that it would say something that would prime my imagination and lead me to write a better, more helpful review.

Immediately I ran into a mundane but very real problem.  The manuscript that I received from the journal was in a complicated pdf format, and I had to convert it to plain text in order to put it into ChatGPT.  Surprisingly, it took me about an hour to get it into a plain text file that I thought was fair to ChatGPT.  I found this part especially frustrating because I often spend a lot of time on the reverse of this process: getting plain text into a format suitable for a journal.

After I finally had the manuscript in a plain text format, I ran into a new problem: ChatGPT has a character limit.   To circumvent this hurdle, I tried putting chunks of the paper into ChatGPT and asking it to review the paper.  I tried many different prompts, including asking some of the questions the scientific journal asked me.  Generally, these prompts fell into two buckets: aesthetic (e.g., what are the most exciting parts of this paper?) and technical (e.g., what are some concerns about the manuscript’s statistical analysis?).  For both kinds of prompts, ChatGPT didn’t produce anything useful to me.  Instead, it often avoided aesthetic questions by providing summaries of the paper (which wasn’t very useful because I had already read the paper carefully).  For technical questions, ChatGPT avoided my question by defaulting to more general questions about common problems with statistical analysis.  These general answers were not bad, they were just not helpful for improving my review.  After about 20 minutes of trying and failing to get something interesting, I gave up and submitted my review with no changes.  After almost 90 minutes of extra work, my review was not improved one bit.

One more try

This might seem like the end of this story, but it is not.  For some reason, I still believed that AI might help, and I was lucky enough to run into a prompt engineer named Mike Taylor.   Mike helps people write effective queries for ChatGPT and similar systems.  He suggested that I try GPT-3 (to get around some of ChatGPTs filters) and try providing more context in my prompt.  Based on Mike’s advice, I created an account on GPT-3 and I tried again (using the model “text-davinci-003”).  These results were a bit better, because GPT-3 was more willing to express an opinion.  But there was still nothing useful for my peer review.

Lessons learned

Overall, this experience taught me three things.  First, it increased my belief that ChatGPT is cool—it did some stuff that really surprised me during this process.  Second, it convinced me that ChatGPT (and GPT-3) are not obviously helpful for scientific peer review.  In other words, ChatGPT is not yet a good tool, at least for this task. Finally, the experience revealed an important asymmetry: coolness is easy to assess but toolness is difficult to assess.  If I had started by asking ChatGPT to review the paper, I would have gotten words that superficially seemed like a review, and I would have been very impressed.  But, after doing a real review, it was easy for me to spot the limitations of ChatGPT’s responses.  My colleague Matt Weinberg also reported a similar feeling about the difficulty of separating ChatGPT fancy fluff from real insight.  

There are some limitations to the conclusions that we should draw from my experience.  First, peer review is just one part of the scientific process, and it might even be a part that is especially hard for AI.  Even if we stick to the task of peer review, it is possible that someone who is both a scientific expert in the topic of this paper and expert in prompt engineering could have gotten better results.  It is also possible that someone could get better results with a different kind of scientific paper.  Finally, and most importantly, ChatGPT and other large language models are improving, and it could be the case that they will work better in the future—or with a fine tuned model trained specifically for peer review.

Looking beyond this activity and its limitations, I hope this example suggests an interesting new way to test large language models.  Peer review is an especially attractive testing ground for these models for three reasons.  First, peer review is a real task and improvements would have societal benefit; this is in contrast to some other large language model evaluation tasks that seem quite artificial.  Second, peer review already involves experts making careful evaluations of texts, so adding large language models into the mix seems to enable us to measure performance against world-class experts at very little additional cost.  Finally, peer review is a decentralized activity, so there is room for experimentation within strong ethics norms, and the assessments of the utility can be made by authors, reviewers, and journal editors (rather than AI companies or deep skeptics). 

It is undeniable that ChatGPT is cool, at least to me.  But will ChatGPT and other generative AI technologies really “redefine human knowledge, accelerate changes in the fabric of our reality, and reorganize politics and society” as Henry Kissinger, Eric Schmidt, and Dan Huttenlocher claimed in the Wall Street Journal?  If these systems are going to live up to those kinds of claims, they are probably going to have to become really useful for many important tasks.  For scientific peer review, at least, I don’t think we are there yet.  We could, however, design processes that would allow us to measure progress, which is important if we want independent assessments of ChatGPT and future generative AI systems.  Finally, this activity is a reminder that there is a big difference between being a powerful tool and a cool toy.

Ethical appendix

Before undertaking this activity, I considered the ethics of it, which is something I recommend to others in my book, Bit by Bit.  I decided that this would be reasonable to try under two conditions.  First, I wanted to ensure that whatever I did would meet my obligation to the authors, journal, and scientific community, at least as well as if I had done a normal peer review.  In other words, I wanted to match or exceed the standard of care.  Writing the review first and only then interacting with ChatGPT seems to ensure this standard.  Second, I wanted to ensure transparency, so I explained in my review what I had done.  Writing this blog post also helps ensure transparency-based accountability.  In fact, in the process of getting feedback on this post, I learned about an issue that I had not considered: whether the prompts I put into ChatGPT—which included parts of the paper—would become part of its future training data.  As of the time of this writing, it seems like the policies are rapidly changing, and the norms are unclear.  Future reviewers and journals could consider this issue during future explorations with ChatGPT and other large language models.  Peer review is an incredibly important process, and I hope that we can continue to explore ways to make it more useful for researchers, journals, and the scientific community.

I’d like to thank Robin Berjon, Denny Boyle, Sayash Kapoor, Karen Levy, Jakob Mökander, Arvind Narayanan, Karen Rouse, Mike Taylor, and Jevin West for helpful conversations.  All views expressed are, of course, my own.