Developed with Illuin by the research groups of the MICS Laboratory at CentraleSupélec Technology is a new language model (LLM) CroissantLLM. Fully open source with 1.3 billion parameters, it works efficiently on local consumer hardware including entry-level PCs and smartphones. The model, available on the Hugging Face platform, will be presented this March 7 as part of the “Les Ateliers de l’Ia” event at Paris La Défense and further afield.
CroissantLLM is offered as a sovereign, open, ethical, economical language model, which is the result of collaboration between academia and industry, explains Professor Céline Hudelot, director of the MICS laboratory:
“This work is the result of close collaboration between academia and industry, which illustrates the importance of synergy in the advancement of AI research. CroissantLLM is the result of work carried out by CentraleSupélec in collaboration with several renowned academic partners such as Sorbonne University, INESC-ID, Instituto Superior Técnico, Carnegie Mellon University and Institut DATAIA. This is also possible thanks to ILLUIN technology and valuable support from industry partners such as Unbabel, Diabolocom and EqualAI.
The model was developed by a French research team based on the Jean Zay supercomputer within GENCI. Instead of training the latest models primarily on English corpora, the team chose to do so on equal amounts of data in French and English, allowing them to integrate and master the peculiarities of the French language and culture.
For this purpose, the team collected more than 303 billion tokens of French data and 36 billion tokens of French-English translation data. The training set contains 3 trillion tokens of data, more than Llama 2.
CroissantLLM, based on the Llama architecture, has 1.3 billion parameters, a size that allows for fast execution on low-end GPU servers and decent speed on mobile devices or CPUs. The model is accessible to a wide range of users for specific industrial applications, translations or chat.
Model evaluations
To evaluate the performance of the model in French, the researchers introduced the French Bench, which consists of various classification and generation tasks. They also evaluated it in English terms.
Assessments show that CroissantLLM offers competitive performance in both English and French. On French classification benchmarks, CroissantLLM significantly outperforms models of similar sizes trained primarily on monolingual English or French data and multilingual models. It outperforms 3x larger models in most tasks (Bloom 3B).
For CroissantLLM Chat, researchers refined CroissantLLM on chat data, including interactions with ChatGPT, to improve its conversational capabilities in both languages.
A transparent and ethical model
The research team ensured that CroissantLLM complies with the rules set by the latest AI Act to make it an ethical model.
Manuel Faiz, one of the members of the research team, explains in a blog about Hugging Face:
“The CroissantLLM initiative was designed with transparency in mind from the beginning. We validate 81% of the transparency criteria of the FMDI framework, beyond the scores of the most open initiatives, publishing the data, the models, the training process, and all the codes used to follow the data and train the model.
The models, datasets, training codes, evaluation criteria and data are completely open source:
Article Notes: blog Manual Phys
More Stories
What are the 5 most spoken languages in the world?
Master the Art of Applying Acrylic Nails at Home: A Complete Guide
Tortoises as Family Pets: Teaching Responsibility and Care