Study Design with Generative AI

academia

methodology

genAI

Author

Jilly MacKay

Published

March 25, 2026

I have been working on a little research strand incorporating AI generated data, and its made me think about the reproducibility steps you need when you incorporate AI into your methodology and methods. I thought I’d try to capture some of my thoughts on this in a blog post. So here we go!

What Kinds of AI?

Something I’m noticing a lot at the moment is that ‘AI’ covers a whole range of things, from the ‘several binary logistic regressions in a trenchcoat’ forms of machine learning, to self trained large language models, to the big name commercial Large Language models like OpenAI’s ChatGPT or Anthropic’s Claude, to the prompt responding media generators like OpenAI’s Sora2 (at least, there used to be), or Google’s Gemini Veo and Nanobanana.

I think there is likely scope to include all of these in some aspect of study design, but the ones I’m focusing on at the moment are the disruptors like ChatGPT, Gemini and Claude.

I think these kinds of conversational genAI are the ones that people are most likely to turn to as a search engine substitute, and the ones that are often spoken about as having the most promise. They’re also the most accessible to Jane and Joe Public, so that’s what I’m focussing on here. My concern is not how someone with considerable research design knowledge might utilise bespoke models in novel ways, but how people who are time-pressed with limited research design knowledge might utilise commercial systems, and where the potential challenges are.

Potential AI Uses

I can see a handful of potential use cases for these AI in study design and methods:

Planning a study design
Consulting on power analyses
Creating dummy data
Responding to surveys (generating data)
Generating charts and figures

All of these come with big caveats and risks, but I think they are places where many time-stretched researchers might start using AI, and thus they are places where we need to start thinking about how we make our methods more reproducible when we choose to use AI

The Ethics Bit

I need to state first off that there are Big Problems with the use of AI for these tasks. For me there’s three issues that really concern me.

These types of AI were trained on stolen data and unrecognised labour
These types of AI have significant environmental and ecological costs thanks to their power consumption
These types of AI cannot ethically reason, are deeply biased, and can cause harm in their users

Reproducibility

I argue that genAI is fundamentally unreproducible. I still think Ted Chiang’s blurry jpeg analogy is one of the best explainers I’ve read on AI, and one of the key parts of this is that it is always trying to produce something that looks like what has been produced before. But it doesn’t start from the same place each time.

We know that genAI learn from the conversations they have, we know that they are primed by the conversation they have had with you. I would not be surprised if they’re also primed by other data they have on you if, for example, they are owned by Google and you have a Google account.

Note on AI use for this blog post

I decided to use AI for this blog post because I wanted to demonstrate the kinds of answers you can use. I have used ELM (Edinburgh’s ChatGPT API) for a similar presentation I’ve done for work, but the answers were not very sophisticated and you can’t share the conversations easily.

I opted to use Claude because I could share the link to the conversation directly, and that seemed to be the most useful approach to me. It is also the AI I’m most impressed with so far, and I thought it was worth demonstrating a good example. That said, I did run into the free prompt bumper so this conversation took a few days (not necessarily a bad thing! )

Link to the Claude conversation referenced throughout / PDF Archive

I have decided for this conversation to give Claude only one prompt, and to only respond or clarify when it asks me to. You might argue this is unfair, but I think this is A) not an unusual pattern of behaviour for users and B) the best energy use case for a blog.

Claude, help me design my study

You can ask genAI to help design a study.

I’ve given Claude an example of a study that I want to run, and asked it to help me design the process.

I’m planning a research study, can you help me design this? Its an educational study. I have access to 30 participants for three weeks, and everyone needs to receive the control, a simple educational resource that takes 1 week to go through. I have an intervention, which is a complex educational resource which takes 5 days to do, but I can only give it to 7 people at once. How can I design this study to maximise my power?

I was initially very impressed with Claude’s answer, which asked me to consider whether order effects were likely, and offered a couple of extra considerations such as ensuring participants were randomly assigned to groups. It even offered a couple of diagrams which I liked.

A diagram of study design generated by Claude featuring 4 groups, with 3 receiving the intervention, and 1 serving as control — Claude’s description of my best study design

The final design it recommends does something I always like in study design, where it used unequal groups to boost the control numbers, which actually increases your statistical power.

Here’s where we start to find problems though. Claude doesn’t seem to fully think its answer through before it replies (you and me both, friend), and it initially recommends a design that has a flaw, although it subsequently corrects that and then adjusts based on my added restrictions. I think this could potentially be a little confusing for people who are simply looking for quick answers.

Modifying your prompt with ‘please provide the single best answer and give your reasoning’ might help with this problem, but then, we also want our user to consider different options.

Also - have you spotted Claude’s mistake with its example above? It took me a minute . . .

It has me giving the intervention to 14 people in Week 3!

For what it’s worth, this is what I would have designed myself:

My design for the educational intervention study
Col1	Week 1	Week 2	Week 3
Grp 1 (n = 7)	(Test 1) Intervention	Test 2	Control
Grp 2 (n = 7)		(Test 1) Intervention (Test 2)	Control
Grp 3 (n = 7)		(Test 1) Control	Intervention (Test 2)
Grp 4 (n = 9)	(Test 1) Control (Test 2)

This design features n = 14 receiving the intervention first and then the control (which they must receive), so we can compare the Intervention first. We then have 7 who have a potential order effect of receiving the control first, and 9 in our control only group. Now, you might rightly say that your statistical power is limited here, and yes it is - but I think this is the ‘trick’ to this question. There’s no good answer.

We might expect Group 3 to do the best, because they’ll have had the Control before the intervention, but that’s actually evidence in favour of the intervention, because we can characterise the improvement associated with receiving both. Because everyone needs the control, and we can only give a few people the intervention, my design maximises the number of people receiving both, which is arguably a good thing. We want to have some good exit criteria to know when we should stop the study, for example if our data shows us a deleterious effect on knowledge (but this is extremely unlikely!)

I want to emphasise as well that this is not an unusual set of restrictions in an educational study. Often you only have access to your participants for a set amount of time, and often you need to make sure they get an existing intervention of known quantity. I have run trials a bit like this for human and animal participants, and I don’t think its hugely unlikely. It’s a fair test of the system.

Claude, help me with a power analysis

I’m going to test everyone before the trial and then after the trial with a 50 question multiple choice test. The average score is usually 35/50. What kind of effect size can I detect with this sample?

Claude again asks me for more information, around the typical range, what alpha level I’m interested in, and what effect size I want to detect, and then gives a pretty detailed answer with, what I’m really impressed by, an interactive sensitivity chart.

. . . except the chart doesn’t work.

To be fair to Claude, my charts often don’t work first time either, but yet again here is genAI over-promising. I haven’t given it an easy study design, so I’m not surprised its struggling here, but still, its going to require a lot more conversation than I’m willing to invest my energy and .

It has correctly identified that we’d need to push the average score up from 35/50 to 38/50 to reach a significant result with our numbers. One of the decisions you might be able to make from this is to explore whether this is feasible or likely, and if not, is there a better way of designing your experiment, such as perhaps looking at qualitative measures (or using a Bayesian analysis!)

I do think Claude’s walkthrough is better than most statistical power calculators, as it already knows the study design its talking about (albeit one that isn’t feasible). Statistical power is a tricky beast to get to know, and reviewers love asking about, often in not very relevant ways. Statistical power is the probability of correctly rejecting the null hypothesis when it is false, and it is reliant on the effect size, your sample size, what you’ve set as your significance threshold (e.g. p < .05) and this all results in your Power. Power protects you from wrongly rejecting the null hypothesis, in this case, that the new intervention is better. But what’s the impact of wrongly rejecting that hypothesis here? They get an educational intervention that’s just as good? Hoenig and Heisey have some good chat about this here that Claude doesn’t seem to consider.

But overall, yes, this is a reasonable use of genAI that I think could help you make some informative choices in your study design.

Claude, give us some dummy data

First off, I am a huge advocate for using dummy data, and you’ll see quite a few of my OSF repositories contain code for generating dummy data to test my designs with. This is absolutely something that people should do when they’re thinking about designing a study

For example, this is how I generated data for my design without thinking too much about it (there’s a much more efficient way, but if I’m not giving Claude extra chances I shouldn’t give myself them either!)

library(tidyverse)

sdy <- tibble(Grp1 = rnorm(7, 37,4),
              Grp2 = rnorm(7, 37, 4),
              Grp3 = rnorm(7, 39, 2))
sd2 <- tibble(Group = "Grp4",
              score = rnorm(9, 35, 4))

sdy <- sdy |> 
  pivot_longer(cols = Grp1:Grp3,
               names_to = "Group",
               values_to = "score")

sdy <- rbind(sdy, sd2) |> 
  mutate(score = round(score, 0))

rm(sd2)

This is what Claude generated: study_dummy_data.csv

claude <- read.csv("study_dummy_data.csv", header = TRUE)

And they look like this:

sdy |> 
  ggplot(aes(x = Group, y = score, colour = Group)) +
  geom_boxplot() +
  geom_jitter(alpha = .4) +
  theme_classic() +
  labs(x = "Group", y = "Post Intervention Test Score", subtitle = "Post intervention score (Dummy data, R Generated)")

claude |> 
  ggplot(aes(x = as.factor(group), y = t2_score, colour = as.factor(group))) +
  geom_boxplot() +
  geom_jitter(alpha = .4) +
  theme_classic()+
  labs(x = "Group", y = "Post Intervention Test Score", subtitle = "Post intervention score (Dummy data, Claude Generated)")

Generating this data has already got me thinking - do we really need a pre-post design if we’re comparing the treatments? Maybe not. Pre-Post testing has a priming effect on our participants, so I think this is where I’d start cutting that out, and potentially combine Groups 1 and 2.

Do you need genAI for this? No, of course not, you could generate this with excel, R, or software of choice very easily. Could genAI teach you this the first time you do it? Yeah, sure, but is it necessary? Did AI respond to this step that gave me an idea about my design? Nope, it did not.

Hey Claude, answer this survey

Please read the following and choose the single best answer. How well do you think this conversation will teach people about study design? Extremely well. Somewhat well. Neither well nor poorly. Somewhat poorly. Extremely poorly

It gets verbose, but it thinks Somewhat Well

I think there’s actually a few use cases here. I’m interested generally in how AI represents the answers to certain questions for two reasons. One, I think its interesting to see a reflection of the training data and the model in the answer, and two, I think this might save some time in places. I have utilised genAI in a literature review for example, and I think you could incorporate it in a certain fashion in a literature review or document review as a summarising tool, that you later check and review.

As with above, I think its very helpful to workshop your prompts ahead of time and to ask in a standardised fashion. I do wonder how much it will learn from previous conversations though.

Claude, let’s make some figures

I’m really interested in image generation by AI these days, in fact, I’m writing papers on it.

Can you generate me a chart showing a significant difference in this study design

A figure generated by Claude showing a significant difference in this study design — Claude’s chart

Now, Claude isn’t actually an image generator, it can only generate figures drawn from data that it itself generates. If you read the linked conversation, you’ll spot that it had to re-generate the sample data to make this chart, as the initial sample data happened to show a non-significant result.

I think there’s a concern here, and its worth flagging, that charts are easier than ever to falsify now. We need more than ever to show our working and how charts were generated.

But there might be a good argument for saying you want to create not a chart but some other form of figure. I have found myself using Nanobanana (Google’s image generator) and being very impressed. One of the things I really like about all Google’s image generation is that it uses SynthID embedded in all its media generation, which facilitates AI detection.

Here’s an example Gemini Nanobanana conversation and image

Create an educational illustration of the ladder of aggression with minimal text featuring a cartoon border collie

A figure generated by Google's Nanobanana showing a dog aggression ladder — Nanobanana’s Ladder of Aggression

This image is not perfect, but I think with a few additional prompts you could probably make something quite useful here. Is it worth the ethical cost and cutting out a real artist? In my opinion, probably not, but I can really see where cash-strapped NGOs and time-strapped researchers would appreciate this.

I think its worth noting that Hadley Wickham has created an API interface for nanobanana in an attempt to make its images more reproducible for research.

Final thoughts

None of this requires genAI - in fact, all of this is stuff that should be achievable with a bit of consideration and a search engine, or access to a methodologist.

The common argument in favour of genAI is that it democratises access to this type of knowledge. That there may be, for example, someone in an NGO who wants to design a quick intervention study and doesn’t have access to a methodologist, and may not have the understanding and time needed to do a few google searches.

My counter argument is that if they don’t have the understanding needed to search traditionally, then they don’t have the understanding needed to fact check AI and do this right. It certainly requires a number of conversations with your genAI of choice, and the ecological impact of that is cumulative.

I’m not sure this is worth it - but if you are going to use it, hopefully this has given some pointers. The big take homes:

Record everything you do
Share links to conversations where possible, but be aware that these products may be discontinued
Be prepared to share your workflow and host it openly as much as possible
Don’t rely on it