If there's Intelligent Life out There

Comments · 4 Views

Optimizing LLMs to be great at particular tests backfires on Meta, Stability.

Optimizing LLMs to be great at particular tests backfires on Meta, Stability.


-.
-.
-.
-.
-.
-.
-


When you buy through links on our site, we might earn an affiliate commission. Here's how it works.


Hugging Face has launched its second LLM leaderboard to rank the best language models it has evaluated. The new leaderboard seeks to be a more challenging consistent requirement for testing open large language model (LLM) performance throughout a variety of jobs. Alibaba's Qwen models appear dominant in the leaderboard's inaugural rankings, taking 3 spots in the top 10.


Pumped to announce the brand name new open LLM leaderboard. We burned 300 H100 to re-run brand-new assessments like MMLU-pro for all significant open LLMs!Some knowing:- Qwen 72B is the king and Chinese open designs are dominating total- Previous assessments have ended up being too simple for current ... June 26, 2024


Hugging Face's 2nd leaderboard tests language designs across four tasks: knowledge screening, thinking on extremely long contexts, complex mathematics abilities, and direction following. Six standards are utilized to check these qualities, with tests consisting of solving 1,000-word murder secrets, explaining PhD-level questions in layman's terms, and a lot of difficult of all: high-school mathematics formulas. A complete breakdown of the criteria used can be found on Hugging Face's blog site.


The frontrunner of the brand-new leaderboard is Qwen, Alibaba's LLM, which takes first, 3rd, and 10th location with its handful of variants. Also appearing are Llama3-70B, Meta's LLM, and a handful of smaller open-source projects that handled to surpass the pack. Notably missing is any indication of ChatGPT; Hugging Face's leaderboard does not test closed-source models to guarantee reproducibility of results.


Tests to qualify on the leaderboard are run solely on Hugging Face's own computers, which according to CEO Clem Delangue's Twitter, setiathome.berkeley.edu are powered by 300 Nvidia H100 GPUs. Because of Hugging Face's open-source and collective nature, anyone is complimentary to submit new designs for testing and admission on the leaderboard, with a brand-new ballot system focusing on popular new entries for screening. The leaderboard can be filtered to reveal just a highlighted range of significant designs to prevent a complicated glut of small LLMs.


As a pillar of the LLM area, Hugging Face has ended up being a relied on source for LLM knowing and neighborhood cooperation. After its first leaderboard was launched last year as a way to compare and recreate screening outcomes from numerous recognized LLMs, the board rapidly took off in appeal. Getting high ranks on the board became the objective of many designers, small and big, and as designs have ended up being usually more powerful, 'smarter,' and optimized for the specific tests of the very first leaderboard, its outcomes have become less and less significant, for this reason the development of a 2nd version.


Some LLMs, consisting of newer versions of Meta's Llama, badly underperformed in the brand-new leaderboard compared to their high marks in the first. This came from a pattern of over-training LLMs just on the first leaderboard's standards, resulting in falling back in real-world performance. This regression of performance, thanks to hyperspecific and self-referential information, follows a trend of AI efficiency growing worse over time, showing as soon as again as Google's AI responses have shown that LLM efficiency is only as great as its training data which true artificial "intelligence" is still lots of, numerous years away.


Remain on the Innovative: Get the Tom's Hardware Newsletter


Get Tom's Hardware's finest news and in-depth reviews, straight to your inbox.


Dallin Grimm is a contributing writer for Tom's Hardware. He has been developing and breaking computer systems because 2017, serving as the resident child at Tom's. From APUs to RGB, Dallin has a handle on all the latest tech news.


Moore Threads GPUs presumably reveal 'exceptional' reasoning performance with DeepSeek models


DeepSeek research study recommends Huawei's Ascend 910C delivers 60% of Nvidia H100 reasoning efficiency


Asus and MSI hike RTX 5090 and RTX 5080 GPU prices by up to 18%


-.
bit_user.
LLM efficiency is just as great as its training information which true artificial "intelligence" is still many, several years away.
First, this statement discounts the function of network architecture.


The definition of "intelligence" can not be whether something processes details exactly like humans do, otherwise the search for extra terrestrial intelligence would be completely useless. If there's intelligent life out there, it most likely doesn't think quite like we do. Machines that act and act intelligently also needn't always do so, either.
Reply


-.
jp7189.
I don't like the click-bait China vs. the world title. The truth is qwen is open source, open weights and can be run anywhere. It can (and has actually already been) tweaked to add/remove predisposition. I praise hugging face's work to create standardized tests for photorum.eclat-mauve.fr LLMs, and for putting the focus on open source, open weights first.
Reply


-.
jp7189.
bit_user said:.
First, this declaration discounts the function of network architecture.


Second, intelligence isn't a binary thing - it's more like a spectrum. There are different classes cognitive jobs and capabilities you might be acquainted with, if you study child development or animal intelligence.


The definition of "intelligence" can not be whether something processes details exactly like people do, otherwise the look for extra terrestrial intelligence would be completely useless. If there's intelligent life out there, it probably does not believe rather like we do. Machines that act and act intelligently also need not always do so, either.
We're creating a tools to assist people, therfore I would argue LLMs are more valuable if we grade them by human intelligence requirements.
Reply


- View All 3 Comments


Most Popular


Tomshardware belongs to Future US Inc, an international media group and leading digital publisher. Visit our business site.


- Conditions.
- Contact Future's professionals.
- Privacy policy.
- Cookies policy.
- Availability Statement.
- Advertise with us.
- About us.
- Coupons.
- Careers


© Future US, Inc. Full 7th Floor, 130 West 42nd Street, elearnportal.science New York City, NY 10036.

Comments