Johan Empa @JohanEmpa

3 posts3 participants0 posts today

**hamid_reza_razeghi** @hamid_reza_razeghi@mastodon.social · 3d

hamid_reza_razeghi @hamid_reza_razeghi@mastodon.social

Meta faces flak for using an experimental Llama 4 Maverick AI to inflate benchmark scores. This prompted an apology & policy shift, now favoring the original version which lags behind OpenAI's GPT-4o, Anthropic's Claude 3.5, & Google's Gemini 1.5. Meta explained the experimental version was optimized for dialogue and excelled in LM Arena, but its benchmark reliability is debated. Meta clarifies they test various AI models, releasing the open-source version of Llama4 #AI #Meta #Llama4 #Benchmarks

**ComputerBase** @ComputerBase@mastodon.social · 4d

ComputerBase @ComputerBase@mastodon.social

RDNA 4 × Linux im Test: Benchmarks der Radeon RX 9070 XT unter Arch Linux https://www.computerbase.de/artikel/grafikkarten/amd-radeon-rx-9070-xt-linux-test.91853/ #linux #benchmarks #amd

ComputerBaseAMD Radeon RX 9070 (XT) unter Linux im TestAber wie schlägt sich AMD RDNA 4 in Spielen unter Linux? ComputerBase hat den Test mit Radeon RX 9070 XT und Arch Linux gemacht.

**Nicole Hennig** @nic221@techhub.social · 6d

Nicole Hennig @nic221@techhub.social

Announcing the OpenAI Pioneers Program https://openai.com/index/openai-pioneers-program/ #AI #evals #benchmarks

Text Shot: We believe that industries like legal, finance, insurance, healthcare, accounting, and many others are missing a unified source of truth for model benchmarking.

We are excited to spend time assisting eval creation with multiple companies in each sector over the coming months. Our team will work intensively with each company to design evals tailored to their domain—establishing clear benchmarks that guide model development and improve trust in AI systems, and sharing them publicly. Industry specific evals will be published at a later date.

**Korrespondent zur See** @Hinnerk@mastodon.social · Apr 7

Apr 7

Korrespondent zur See @Hinnerk@mastodon.social

Hallo schlaues Fediverse, ich tauche gerade in ein völlig absurdes #Rabbithole und mein M1 Macbook hat dank #LlmStudio und #Ollama seinen Lüfter wiederentdeckt… Aktuell ist lokal bei #LLM mit 8-12b Schluss (32GB Ram). Gibt es irgendwo #Benchmarks die mir bitte ausreden, dass das mit einem M4 >48GB RAM drastisch besser wird? Oder wäre was ganz anderes schlauer? Oder anderes Hobby? Muss Mobil (erreichbar) sein, weil zu unsteter Lebenswandel für ein Desktop. Empfehlungen gern in den Kommentaren.

14%M4 MacBook bevor der Zoll es noch teurer macht
43%Klick dir was bei Hetzner (oder z.b. Empfehlung)
43%Linux Notebook mit fetter Grafikkarte (Tipps?)

**ResearchBuzz: Firehose** @researchbuzz_firehose@rbfirehose.com · Mar 29

Mar 29

ResearchBuzz: Firehose @researchbuzz_firehose@rbfirehose.com

Mashable: A new AI test is outwitting OpenAI, Google models, among others. “The Arc Prize Foundation, a nonprofit that measures AGI progress, has a new benchmark that is stumping the leading AI models. The test, called ARC-AGI-2 is the second edition ARC-AGI benchmark that tests models on general intelligence by challenging them to solve visual puzzles using pattern recognition, context […]

https://rbfirehose.com/2025/03/29/mashable-a-new-ai-test-is-outwitting-openai-google-models-among-others/

ResearchBuzz: Firehose | Individual posts from ResearchBuzz · Mar 29Mashable: A new AI test is outwitting OpenAI, Google models, among others | ResearchBuzz: Firehose

More from

ResearchBuzz: Firehose

#ai #aiassisted #artificialgeneralintelligenceagi_

**Nicole Hennig** @nic221@techhub.social · Mar 29

Mar 29

Nicole Hennig @nic221@techhub.social

The real reason AI benchmarks haven’t reflected economic impacts https://epochai.substack.com/p/the-real-reason-ai-benchmarks-havent #AI #benchmarks

But why have researchers focused on tasks that are "just within reach"? One practical reason is that benchmarks have largely been constructed to provide effective training signals for improving Al models - tasks that are too easy or too hard don't generate useful feedback. And if all you care about is whether one model outperforms another, you don't need realistic tasks - just benchmarks for which differences in score correlate with differences in a broader range of capabilities.

**samurro** @samurro@fosstodon.org · Mar 26

Mar 26

samurro @samurro@fosstodon.org

Wow #google published internal #benchmarks for their new #AI model, showing numbers in comparison to other models. So this is the new way to increase #stockvalue huh? #Marketing #llm

**Politico.eu (Unofficial RSS)** @politico_eu_bot@social.espeweb.net · Mar 24

**MIT Technology Review DE** @techreview_de@social.heise.de · Mar 14

Mar 14

MIT Technology Review DE @techreview_de@social.heise.de

„Solche Ansätze (der angenommenen Gleichheit der Menschen) zwingen Maschinen dazu, Menschen einförmig zu behandeln, auch wenn es legitime Unterschiede gibt", erklärt Angelina Wang den Ansatz für die neuen KI-Benchmarks.

#KI #Bias #Benchmarks

https://t3n.de/news/wir-sind-nicht-alle-gleich-wie-neue-ki-benchmarks-sprachmodellen-bei-unterschieden-helfen-sollen-1677487/

t3n Magazin · Mar 14Wir sind nicht alle gleich: Wie neue KI-Benchmarks Sprachmodellen bei Unterschieden helfen sollenBy MIT Technology Review Online

**N-gated Hacker News** @ngate@mastodon.social · Mar 7

Mar 7

N-gated Hacker News @ngate@mastodon.social

Ah, the classic tale of a coder thinking #SIMD would make their code fly , only to discover it trips over its own feet . Our hero's memory seems as patchy as their #benchmarks, but fear not, the valuable lesson here is clear: #optimization is just a synonym for #headache.
https://genna.win/blog/convolution-simd/ #coding #woes #lessons #HackerNews #ngated

genna.winPerformance optimization, and how to do it wrong | Just wing itOptimization is hard. And sometimes, the compiler makes it even harder.

**Erik Jonker** @ErikJonker@mastodon.social · Mar 7

Mar 7

Erik Jonker @ErikJonker@mastodon.social

Nice overview of benchmarks/leaderboards for the various AI models.
https://scale.com/leaderboard
#AI #leaderboard #benchmarks

scale.comSEAL LLM Leaderboards: Expert-Driven Private EvaluationsExplore the SEAL leaderboards for expert-driven, private, regularly updated LLM rankings and evaluations across domains like coding, instruction following and more!

**Nicole Hennig** @nic221@techhub.social · Mar 1

Mar 1

Nicole Hennig @nic221@techhub.social

How would you interview an AI, to give it a job? https://www.strangeloopcanon.com/p/how-would-you-interview-an-ai-to #AI #evals #benchmarks

In some ways our inability to measure how well these models do at various tasks is what's holding us back from realising how much better they are at things than one might expect and how much worse they are at things than one might expect. It's why LLM haters dismiss it by calling it fancy autocorrect and say how it's useless and burns a forest, and LLM lovers embrace it by saying how it solved a PhD problem they had struggled with for years in an hour.
They're both right in some ways, but we still don't have an ability to test them well enough. And in the absence of a way to test them, a way to verify. And in the absence of testing and verification, to improve. Until we do we're all just HR reps trying to figure out what these candidates are good at!

**Seán Fobbe** @seanfobbe@fediscience.org · Feb 24

Feb 24

Seán Fobbe @seanfobbe@fediscience.org

New Essay

"The Intelligent AI Coin: A Thought Experiment"

Open Access here: https://seanfobbe.com/posts/2025-02-21_intelligent-ai-coin-thought-experiment/

Recent years have seen a concerning trend towards normalizing decisionmaking by Large Language Models (LLM), including in the adoption of legislation, the writing of judicial opinions and the routine administration of the rule of law. AI agents acting on behalf of human principals are supposed to lead us into a new age of productivity and convenience. The eloquence of AI-generated text and the narrative of super-human intelligence invite us to trust these systems more than we have trusted any human or algorithm ever before.

It is difficult to know whether a machine is actually intelligent because of problems with construct validity, plagiarism, reproducibility and transferability in AI benchmarks. Most people will either have to personally evaluate the usefulness of AI tools against the benchmark of their own lived experience or be forced to trust an expert.

To explain this conundrum I propose the Intelligent AI Coin Thought Experiment and discuss four objections: the restriction of agents to low-value decisions, making AI decisionmakers open source, adding a human-in-the-loop and the general limits of trust in human agents.

@histodons @politicalscience

seanfobbe.com · Feb 21[Essay] The Intelligent AI Coin: A Thought Experiment

More from

Seán Fobbe

#AI #ArtificialIntelligence #ThoughtExperiment

**WinFuture.de** @WinFuture@mastodon.social · Feb 24

Feb 24

WinFuture.de @WinFuture@mastodon.social

Erste #Benchmarks zur Leistung der #Radeon #RX9070 (XT) in #Spielen sind durchgesickert. Die Zahlen stammen dabei wohl direkt von #AMD und versprechen eine deutliche Steigerung verglichen mit der #RX7900GRE. https://winfuture.de/news,149067.html?utm_source=Mastodon&utm_medium=ManualStatus&utm_campaign=SocialMedia

WinFuture.de · Feb 24RX 9070 (XT): 68% Leistungssteigerung in offiziellen AMD-BenchmarksBy Felix Krauth

**Global Threads** @globalthreads@mastodon.social · Feb 23

Feb 23

Global Threads @globalthreads@mastodon.social

AI
xAI vs OpenAI: Benchmark Controversy

OpenAI says xAI misled the public by omitting key "cons@64" scores in Grok 3 benchmarks.

Grok 3 Reasoning Beta actually underperforms OpenAI’s o3-mini-high when measured at first attempt (@1).

Experts say AI benchmarks often lack transparency, omitting crucial cost and efficiency data.

#AI #xAI #OpenAI #Grok3 #Benchmarks