mastodon.green is one of the many independent Mastodon servers you can use to participate in the fediverse.
Plant trees while you use Mastodon. A server originally for people in the EU, but now open for anyone in the world

Administered by:

Server stats:

1.2K
active users

#benchmarks

3 posts3 participants0 posts today

Meta faces flak for using an experimental Llama 4 Maverick AI to inflate benchmark scores. This prompted an apology & policy shift, now favoring the original version which lags behind OpenAI's GPT-4o, Anthropic's Claude 3.5, & Google's Gemini 1.5. Meta explained the experimental version was optimized for dialogue and excelled in LM Arena, but its benchmark reliability is debated. Meta clarifies they test various AI models, releasing the open-source version of Llama4 #AI #Meta #Llama4 #Benchmarks

Hallo schlaues Fediverse, ich tauche gerade in ein völlig absurdes #Rabbithole und mein M1 Macbook hat dank #LlmStudio und #Ollama seinen Lüfter wiederentdeckt… Aktuell ist lokal bei #LLM mit 8-12b Schluss (32GB Ram). Gibt es irgendwo #Benchmarks die mir bitte ausreden, dass das mit einem M4 >48GB RAM drastisch besser wird? Oder wäre was ganz anderes schlauer? Oder anderes Hobby? Muss Mobil (erreichbar) sein, weil zu unsteter Lebenswandel für ein Desktop. Empfehlungen gern in den Kommentaren.

Mashable: A new AI test is outwitting OpenAI, Google models, among others. “The Arc Prize Foundation, a nonprofit that measures AGI progress, has a new benchmark that is stumping the leading AI models. The test, called ARC-AGI-2 is the second edition ARC-AGI benchmark that tests models on general intelligence by challenging them to solve visual puzzles using pattern recognition, context […]

https://rbfirehose.com/2025/03/29/mashable-a-new-ai-test-is-outwitting-openai-google-models-among-others/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · Mashable: A new AI test is outwitting OpenAI, Google models, among others | ResearchBuzz: Firehose
More from ResearchBuzz: Firehose

Ah, the classic tale of a coder thinking #SIMD would make their code fly 🚀, only to discover it trips over its own feet 👟. Our hero's memory seems as patchy as their #benchmarks, but fear not, the valuable lesson here is clear: #optimization is just a synonym for #headache. 🤦‍♂️
genna.win/blog/convolution-sim #coding #woes #lessons #HackerNews #ngated

genna.winPerformance optimization, and how to do it wrong | Just wing itOptimization is hard. And sometimes, the compiler makes it even harder.

🔔 New Essay 🔔

"The Intelligent AI Coin: A Thought Experiment"

Open Access here: seanfobbe.com/posts/2025-02-21

Recent years have seen a concerning trend towards normalizing decisionmaking by Large Language Models (LLM), including in the adoption of legislation, the writing of judicial opinions and the routine administration of the rule of law. AI agents acting on behalf of human principals are supposed to lead us into a new age of productivity and convenience. The eloquence of AI-generated text and the narrative of super-human intelligence invite us to trust these systems more than we have trusted any human or algorithm ever before.

It is difficult to know whether a machine is actually intelligent because of problems with construct validity, plagiarism, reproducibility and transferability in AI benchmarks. Most people will either have to personally evaluate the usefulness of AI tools against the benchmark of their own lived experience or be forced to trust an expert.

To explain this conundrum I propose the Intelligent AI Coin Thought Experiment and discuss four objections: the restriction of agents to low-value decisions, making AI decisionmakers open source, adding a human-in-the-loop and the general limits of trust in human agents.

@histodons @politicalscience

seanfobbe.com · [Essay] The Intelligent AI Coin: A Thought Experiment
More from Seán Fobbe