AdobeStock_320460723.jpeg

Did AI Get It Right?

Reviewing screening outcomes for the 2025 Global Challenges
Published on by Pooja Wagh and Rebecca Spens

Never miss an update from our AI for Good series: sign up for the AI for Good newsletter.

In this AI for Good series, we’ve been exploring the growing operational challenges facing social sector funders and how tech might help us respond to them. Catch up on part one and part two.

Earlier this year, MIT Solve partnered with Harvard Business School and the Foster School of Business at the University of Washington to design, test, and deploy an AI tool that helped staff screen applications to our annual Global Challenges. The tool assigned a “passing probability” that assessed how likely each solution was to advance to the next round of our evaluation pipeline based on our screening criteria. From there, the tool categorized the solutions as “Pass,” “Fail,” or “To Review,” the last of which would require a human staff member to make the decision.

Double (and triple) checking AI’s work

Because of the extensive training and testing we performed while developing the tool, we felt confident in the tool’s ability to make decisions, but we wanted to dive deeper:

  1. To see how “To Review” solutions fared later in the process. This would help us decide whether to tighten cutoffs in the future and reduce the amount of human screening time further.

  2. To check how well AI’s pass/fail calls matched human judgment. This would ensure performance remained consistent with what we saw during tool tuning.

  3. To compare how humans and AI explain why a solution failed. If aligned, this could open up ways to use AI-generated rationales directly in applicant feedback.

What We Learned

Most “to review” solutions didn’t make the cut with humans

When the AI tool marked solutions as needing a closer look, human screeners usually ended up rejecting those submissions. That happened across all categories— Health (in which 73% of “to review” solutions were rejected by humans), Learning (72%), Climate (65%), and Economic Prosperity (59%). This confirmed that the majority of solutions tagged with “to review” weren’t going to pass, which makes sense because we were extremely careful about assigning outright fails. This also suggests that we could adjust thresholds to reduce the human review burden further.

Among the “To Review” pool, we also found that the solutions humans advanced had higher average passing probabilities than those they failed. This demonstrates strong alignment between human and AI judgment.

Overall, only 3% of solutions marked “to review” by AI reached the semifinalist round, and under 1% of solutions became finalists or Solvers, underscoring that this category is mainly solutions that aren’t high enough quality to make it beyond the initial stages.

AI, like people, can make mistakes

When we audited 80 of the solutions that AI had passed or failed, we found that humans aligned with the AI tool’s decision 70 times. Of the 10 disagreements, eight were cases where AI passed the solution but humans failed it. This is a safer error direction than the inverse: failing something that should have passed. For the two that AI failed but humans passed: one, upon re-review, was found to have been incorrectly passed by human reviewers, meaning that the AI judgment on it was actually correct. The other was incorrectly failed by the AI, though reviewers agreed it was not of a standard where it would have proceeded to the semifinalist stage. This is an important reminder that even extremely well-tuned and tested AI tools will not perform perfectly (which is also true of even the most experienced humans).

Lastly, when we looked at how solutions automatically passed by AI performed at the next stage of selection (reviews), we found a positive correlation between the AI-assigned passing probability and later human scores.

In short, higher passing probability from AI at the screening stage generally predicted stronger human ratings at the next stage of the selection process.

Solid judgment vs. shakier justification

When both AI and humans failed a solution, humans felt that at least one “failure reason” selected by the AI tool was incorrect 53% of the time. The most common incorrectly selected failure reasons in this regard related to how well the idea fit the challenge or how the solution used technology. However, when looking at AI’s overall, qualitative rationale for rejection, human reviewers judged 63% to be clear and accurate enough to share directly with applicants—a stronger performance than for the AI-assigned failure reasons.

Looking Forward

Overall, our first year using AI in screening showed strong alignment with human judgment and helped us streamline the earliest stage of selection. For future cycles, we will consider adjusting thresholds so that the weaker applications are automatically filtered out, while continuing to safeguard against false negatives (i.e., solutions incorrectly failed at the screening stage).

While we began exploring how AI-generated rationales might eventually be used in applicant feedback, we’re not ready to confidently implement this yet.

This was a big first step in responsibly integrating AI into the way we select social impact solutions. We’re excited to continue learning, testing, and sharing as we go.

Tags:

  • Innovation
  • AI for Good

Share this article: