Introduction to science funding

my first forays into metascience; adjusted to adhere to ap sem irr

Where is the General Relativity of today? The suspicion that science is stagnating is not unfounded. Economists Bloom et al. (2020) find that the “number of researchers required today to achieve the famous doubling of computer chip density is more than 18 times larger than the number required in the early 1970s.” And this pattern of needing more scientists and funding than for the same progress holds across domains. While economists debate several explanations, there is no doubt every minute and dollar is increasingly precious. Yet, scientists now spend copius time applying for grants they might not receive, and the federal government distributes $159.8 billion annually for US R&D (NSF). If science so critically propels societal progress (85% of economic development since 1945 Press 2013), we should really want to know: are we efficiently allocating resources?

I’ll focus on two sources of inefficiency: 1. transaction costs (time spent on the funding process not research) and 2. selection errors (funding less valuable projects).

The debate in science funding has two sides. Those who defend standard peer-review argue the process keeps taxpayer money accountable and only domain experts can evaluate what’s promising. But reformers counter this saying that the system is now bloated and overly conservative. They say it wastes researcher time on diminishing returns, and low-risk incentives filters out the high-variance ideas needed for breakthroughs. This essay analyzes both perspectives. The conclusion (as usual) falls somewhere in the blurry middle.

1. Transaction costs of screening

von Hippel & von Hippel (2015) find the average proposal takes 116 PI hours. Is this worth it?

The costs are massive.

The most rigorous study of cost from Herbert et al. (2013) (from BMJOpen) surveyed 3727 NHMRC proposals in Australia and found that researchers spent 550 working years preparing proposals. This is equivalent to $66 million AUD in salary time, while its 21% success rate means 80% of that effort was useless. Opportunity costs here are huge. (One caveat is the Australian study might not fully generalize to NIH, but it still has similar size/scale of wasted effort).

Gordon and Poulin (2009) found the total cost of peer-review competition for Canada’s NSERC was even more than the cost of simply giving every researcher ~$30K/year with no competition. Selection here costed more than the distributed money.

But the screening does add value, up to a point.

Where the two sides often disagree is on whether this cost is worth it. Defenders use Li & Agha (2015) published in Science which tracks n=130000 NIH grants from 1980-2008. In this enormous sample, they find reviewers can identify better proposals: “A one–standard deviation worse peer-review score among awarded grants is associated with 15% fewer citations, 7% fewer publications, 19% fewer high-impact publications, and 14% fewer follow-on patents.” So the process is better than random.

But reformers argue although Li & Agha shows peer review separates bad from good, it doesn’t show the system can separate between good proposals, where most competition happens.

Here, evidence favors reformers. Cole, Cole, and Simon, in their seminal 1981 paper (in Science) found that above a certain quality, outcomes depend more on who’s present at the panel that day, not proposal quality. While 1981 is old, Pier et al (2018)’s modern study (where 43 reviewers independently scored the same 25 grants) is a confirmation. Inter-rate reliability was just 0.2; you’d need ~12 reviewers to even reach 0.5 reliability. These studies show which reviewer matters more than the proposal.

So while the two sides often dispute the effectiveness of screening, they really don’t contradict each other. Li & Agha describes the lower end of quality, where peer review works well, but the higher end (e.g. 85th vs 95th percentile) is formidable to the current approach. Is it cost-effective? Gross and Bergstrom 2019 shows that additional effort polishing grants from 95th to 98th has negative return, so time is being spent in places when they could be better used on research.

Why doesn’t the process correct itself?

Seeing the marginal returns, why don’t researchers spend less time grant-writing since they know continuing past a point wastes time? The von Hippel 2015 study found that sheer quantity of grants written is what translates to more funding (not mere quality). Universities receive a slice of grants too, so they too would prioritize grant-getting faculty. Seeing the statistics, researchers would actually overinvest in grant writing, which is rational for them but lowers everyone’s overall outcome. Solution to such a cooperation problem has to come from structural changes on funding.

2. Selection errors (do incentives miss high-value?)

This section asks: even if the process were free, would it fund the right research? The two perspectives differ here more.

Public science funding currently uses a model of accountability reminiscent of contracting: funding is linked to specific projects with predefined deliverables on short timelines. From taxpayer’s perspective, this logically prevents slack and seems to ensure every dollar is spent to increment progress.

Reformers say that funding based on projects with predefined deliverables (as NIH does) is optimizing in the wrong direction, since science is heavy-tailed (where value comes mostly from few outlier discoveries).

However, defenders aren’t sold on this premise. We know citations are distributed heavy-tailed, but citations don’t equate to scientific value. Although penicillin and CRISPR contribute more than thousands of individual papers, we still can’t test a different course history. Perhaps, they point out, different funding mechanisms would have produced more such discoveries.

Does peer review systematically miss outliers?

Wang, Veugelers, & Stephan (2017) (in Research Policy) find reviewers are unwilling to support unconventional ideas infront of their peers, since defending a weird proposal might damage their reputation if it fails. On the other hand, approving a safe but “successful” one is safe. This incentive to conform likely filters the exact high-variability bets that heavy-tailed distributions favor.

Defenders might respond by saying filtering weird ideas is the point because most fail. This is a tradeoff that we still lack an answer to. Is it worth a lower-quality median to achieve a better tail-end?

The HHMI case is promising but disputed

The most-cited evidence is Azoulay et al’s 2011 paper “Incentives and Creativity” comparing NIH (project-based, short review cycles, low failure tolerance) and HHMI (people-based, renewable every 5 years). HHMI investigators were 96% more likely to produce papers in top 1% of citation distribution, but they also flopped more often (35% more low-citation papers). So we indeed see outlier-dominated distribution, where for a small sacrifice in the typical paper, we get more breakthroughs at the tail end.

Though the Azoulay paper’s often cited as slam-dunk argument for funding people not projects, defenders point out this is still an observational study. HHMI has <1% acceptance rate (vs NIH’s >20%) which means they get to pick from a pool of already-elite scientists. Even as Azoulay tried hard to control for selection bias, HHMI still had higher citations before the grant. We still cannot separate the pure impacts of its funding process from simply picking better people. Azoulay himself acknowledges this limitation that the study is still suggestive.

3. What does this mean?

We find evidence supporting two takeaways (with different confidence):

  1. Transaction costs likely outweigh the marginal selection benefits.
  2. Peer review does move the needle (thankfully), but we can’t simply scale this bureaucracy because of diminishing returns.
    1. The large-sample studies from Herbert (2013), Gordon (2009), and Pier (2018) show the current system costs large researcher time for unreliable screening ability.
    2. Even defenders should consider this seriously.
  3. There might be bias against outlier work.
    1. Wang (2017) finds conformity bias, and Azoulay (2011) suggests people-based models can produce more breakthroughs.
    2. However, causation is still not proven.

The correct response is trying out various funding models, like modified lotteries. With limited resources, rigorously designing and testing them is where the debate should go. And that’s where I’ll turn to in the TMP.