18. November 2024

New publication on Portugese NLP New publication on Portugese NLP

To stimulate the future of open development of neural text generation in Portuguese, Nicholas Kluge Corrêa and colleagues present both GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens, and Tucano, a series of decoder-transformers natively pre-trained in Portuguese.

We are very pleased to announce the release of the latest research by Nicholas Kluge Corrêa and colleagues Aniket Sen, Sophia Falk and Shiza Fatimah, which promises to push further the advances of the open-source Portuguese NLP (neuro-linguistic programming) community. In their study, following their previous work with the TeenyTinyLlama pair, the researches tackle a major challenge in the world of NLP: creating capable language models for low-resource languages.

 

While languages like English have access to datasets with trillions of tokens, countless benchmarks for evaluation, and a new state-of-the-art model published every few weeks, Portuguese—and many other languages—are far behind.

 

The key contributions of their study are the following:

The creation of GigaVerbo, a large and high-quality dataset for Portuguese language modeling. With this dataset, the team pushed language modeling pretraining for Portuguese to pass the 500 billion tokens mark.
The creation of auxiliary filters and datasets to help parse large-scale Portuguese text datasets.
The development of the Tucano series, a new collection of open-source, efficient, and effective language models for Portuguese. Tucano models outperform all multilingual models of comparable size, even surpassing Llama-3.2 models on several benchmarks.
The critical comparison of available benchmarks in order to show that many evaluations used by the Portuguese NLP community, when applied to evaluate the capabilities of foundational models natively pre-trained on Portuguese, have little to no correlation with the scaling of token ingestion during training.
All byproducts of the researcher’s work, such as datasets, models, code implementation, and logs, are open and freely available, paving the way for a more fair and sustainable future for the Portuguese NLP community.

 

Preprint

Models and datasets

GitHub

Wird geladen