Research Guides: Issues of AI: AI Apps & Models

AI Apps, Models, Chatbots and LLMs (Large Language Models)

Which AI to Use Now: An Updated Opinionated Guide by Ethan Mollick - One Useful Thing
Released on January 26, 2025
Why Claude Sonnet 3.5 from Anthropic, Gemini from Google, and ChatGPT from OpenAI are still the best three options for most people.

Finding More Large Language Models on Benchmarks

Chatbot Arena (aka LMSYS)
"We present Chatbot Arena, a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner. In this blog post, we are releasing our initial results and a leaderboard based on the Elo rating system, which is a widely-used rating system in chess and other competitive games. We invite the entire community to join this effort by contributing new models and evaluating them by asking questions and voting for your favorite answer."

Humanity's Last Exam
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam, a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. The dataset consists of 2,500 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting.

METR by Model Evaluation & Threat Research
AI companies and wider society want to understand the capabilities of frontier AI systems, and what risks they pose.

METR researches, develops and runs cutting-edge tests of AI capabilities, including broad autonomous capabilities and the ability of AI systems to conduct AI R&D.

SimpleBench
"We introduce SimpleBench, a multiple-choice text benchmark for LLMs where individuals with unspecialized (high school) knowledge outperform SOTA models. SimpleBench includes over 200 questions covering spatio-temporal reasoning, social intelligence, and what we call linguistic adversarial robustness (or trick questions). For the vast majority of text-based benchmarks LLMs outperform a non-specialized human, and increasingly, exceed expert human performance."

SWE-bench
"SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution."

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Top 10 LLM leaderboards by Shakudo
As of September 2024, several prominent LLM leaderboards are actively monitoring and evaluating the performance of large language models across a diverse range of benchmarks and tasks. These leaderboards provide invaluable insights into how different models stack up against each other, helping researchers and practitioners understand the strengths and weaknesses of various approaches. Here are some of the top LLM leaderboards you should know about.

Perplexity
"Perplexity's Copilot feature provides a guided Al search experience, allowing you to explore topics in depth and learn new things." --- Perplexity

Poe
Chatbot from Quora. POE (Platform for Open Exploration) allows users to access and use a range of large language models and bots.

There's an AI for That
This is a database of over 33,000 AI tools available for over 13,000 tasks ... (as of 4/14/2025). "Use our smart AI search to find the best and latest AI tools for any use case." .... Threre's an AI for That

You.com
"You.com's platform empowers users, regardless of their technical expertise, by allowing them to select from a wide array of LLMs and tailor their assistant's behavior through custom instructions." --- Michael Nuñez, VentureBeat, May 30, 2024

Concensus
"AI Search Engine for Research. Find & understand the best science, faster." --- Concensus

Connected Papers
"Connected Papers is a unique, visual tool to help researchers and applied scientists find and explore papers relevant to their field of work. To create each graph, we analyze an order of ~50,000 papers and select the few dozen with the strongest connections to the origin paper."

Emergent Mind
Search for papers, topics, authors, or questions, and Emergent Mind will find the most relevant computer science papers on arXiv and synthesize an answer with citations.

Semantic Scholar
"Semantic Scholar provides free, AI-driven search and discovery tools, and open resources for the global research community. We index over 200 million academic papers sourced from publisher partnerships, data providers, and web crawls."