Join our daily and weekly newsletters for the latest updates and exclusive content on industry leading AI coverage. Learn more
Hugging the Face achieved a remarkable AI breakthroughintroducing vision-language models that run on devices as small as smartphones while outperforming their predecessors that require large data centers.
The company is new SmolVLM-256M modelwhich requires less than one gigabyte of GPU memory, surpasses their performance Idefics 80B model from just 17 months ago — a system 300 times larger. This dramatic reduction in size and increase in capability marks a watershed moment for practical AI deployment.
“When we released the Idefics 80B in August 2023, we were the first company open-source a video language model,” Andrés Marafioti, machine learning research engineer at Hugging Face, said in an exclusive interview with VentureBeat. “By achieving a 300X size reduction- on while improving performance, SmolVLM marks a breakthrough in vision-language models.”
Smaller AI models running on everyday devices
The development comes at a crucial moment for businesses struggling with astronomical computing costs to implement AI systems. New SmolVLM models — available at 256M and 500M parameter sizes — process images and understand visual content at speeds previously unreachable in their size class.
The smallest version processes 16 examples per second while using only 15GB of RAM with a batch size of 64, making it very attractive for businesses looking to process large volumes of visuals. data. “For a mid-sized company that processes 1 million images per month, this translates into significant annual savings in computational costs,” Marafioti told VentureBeat. “The reduced memory footprint means businesses can deploy more affordable cloud instances, reducing infrastructure costs.”
The development has already caught the attention of major technology players. IBM has partnered with Hugging Face to integrate the 256M model into the Doclingtheir document processing software. “While IBM certainly has access to massive computing resources, using small models like this allows them to efficiently process millions of documents at a fraction of the cost,” Marafioti said. .
How Face Hugging reduces the size of the model without compromising on power
Efficiency gains come from technical innovations in vision processing and language components. The team moved from the 400M parameter vision encoder to the 93M parameter version and implemented more aggressive token compression techniques. These changes maintain high performance while reducing computational requirements.
For startups and small businesses, these developments can be transformative. “Startups can now launch sophisticated computer vision products in weeks instead of months, with infrastructure costs that were prohibitive just months ago,” Marafioti said.
The impact goes beyond the cost savings of enabling entirely new applications. The models enable advanced document search capabilities through ColiPalian algorithm that creates searchable databases from document archives. “They get very close performances to models 10X the size while significantly increasing the speed at which the database is created and searched, making enterprise-wide visual search accessible to businesses of all types. first time,” explained Marafioti.
Why small AI models are the future of AI development
The breakthrough challenges conventional wisdom about the relationship between model size and capability. While many researchers believe that large models are necessary for advanced vision-language tasks, SmolVLM has shown that smaller, more efficient architectures can achieve similar results. The 500M parameter version achieves 90% of the performance of its 2.2B parameter sibling in key benchmarks.
Rather than suggesting an efficiency plateau, Marafioti sees these results as evidence of untapped potential: “Until now, the standard has been to release VLMs starting with 2B parameters; we think that the small models are not useful. We have proven that, in fact, models of 1/10 of the size can be very useful for businesses.
This development comes amid growing concern about AI environmental impact and computational cost. By dramatically reducing the resources required for vision-language AI, Hugging Face’s innovation helps solve both issues while making advanced AI capabilities accessible to a wider range of organizations.
The models are available open-sourcecontinues Hugging Face’s tradition of increasing access to AI technology. This accessibility, combined with the efficiency of the models, could facilitate the adoption of AI vision language in industries from healthcare to retail, where processing costs were previously prohibitive.
In a field where bigger has long meant better, Hugging Face’s success suggests a new paradigm: The future of AI may not be found in larger models running on distant data. center, but in agile, efficient systems that run right on our devices. As the industry grapples with questions of scale and sustainability, these smaller models may represent the biggest breakthroughs.
Source link