/language-models

pre-trained Language Models

Primary LanguageJupyter Notebook

Language Models

Repository of pre-trained Language Models and NLP models.

HF-LLM.rs 🦀

HF-LLM.rs 🦀 is a CLI tool for accessing Large Language Models (LLMs) like Llama 3.1, Mistral, Gemma 2, Cohere and much more hosted on Hugging Face. It allows you to interact with various models, provide input, and receive responses in a terminal environment. (credit: Vaibhav Srivastav)

unstructured library | Get the JSON and HTML versions of any PDF (legal, financial, medical…), even PDF with tables!

Speech-to-Text | Get transcription WITH SPEAKERS from large audio file in any language (OpenAI Whisper + NeMo Speaker Diarization)

Video-to-Audio | A notebook and Web APP to get mp3 audio file from any YouTube video

Speech-to-Text | Quickly get a transcription of a large audio file in any language with "Faster-Whisper"

Curso | ChatGPT Prompt Engineering for Developers

Document AI | Accuracy of layout finetuned models (LiLT and LayoutXLM base) on the dataset DoclayNet base (notebooks)

Document AI | Inference at paragraph level by using the association of 2 Document Understanding models (LiLT and LayoutXLM base fine-tuned on DocLayNet base dataset)

Document AI | APP to compare the Document Understanding LiLT and LayoutXLM (base) models at paragraph level

Document AI | Inference APP and fine-tuning notebook for Document Understanding at paragraph level with LayoutXLM base

Document AI | APP to compare the Document Understanding LiLT and LayoutXLM (base) models at line level

Document AI | Inference APP and fine-tuning notebook for Document Understanding at line level with LayoutXLM base

Document AI | Inference APP and fine-tuning notebook for Document Understanding at paragraph level

Document AI | Inference APP for Document Understanding at line level

Document AI | Document Understanding model at line level with LiLT, Tesseract and DocLayNet dataset

DocLayNet image viewer APP

Document AI | Processing of DocLayNet dataset to be used by layout models of the Hugging Face hub (finetuning, inference)

Speech-to-Text & IA | Transcrição de qualquer áudio em português com Whisper

IA & empresas | Diminua o tempo de inferência de modelos Transformer com BetterTransformer

NLP & Código para todos | Função de perda ponderada para classificação de texto (multiclasse)

NLP nas empresas | Como eu treinei um modelo T5 em português na tarefa QA no Google Colab

NLP | Modelos e Web App para Reconhecimento de Entidade Nomeada (NER) no domínio jurídico brasileiro

Finetuning of the specialized version of the language model BERTimbau on a token classification task (NER) with the dataset LeNER-Br

Finetuning of the language model BERTimbau on LeNER-Br text files

NLP nas empresas | Técnicas para acelerar modelos de Deep Learning para inferência em produção

NLP nas empresas | Reconhecimento de textos com Deep Learning em PDFs e imagens

NLP nas empresas | Como criar um modelo BERT de Question-Answering (QA) de desempenho aprimorado com AdapterFusion?

NLP nas empresas | Como ajustar um modelo de linguagem natural como BERT para a tarefa de Question-Answering (QA) com um Adapter?

NLP nas empresas | Como ajustar um modelo de linguagem natural como BERT para a tarefa de classificação de tokens (NER) com um Adapter?

NLP nas empresas | Como ajustar um modelo de linguagem natural como BERT a um novo domínio linguístico com um Adapter?

NLP | Modelo de Question Answering em qualquer idioma baseado no BERT large (estudo de caso em português)

NLP | How to add a domain-specific vocabulary (new tokens) to a subword tokenizer already trained like BERT WordPiece

Summary: In some cases, it may be crucial to enrich the vocabulary of an already trained natural language model with vocabulary from a specialized domain (medicine, law, etc.) in order to perform new tasks (classification, NER, summary, translation, etc.). While the Hugging Face library allows you to easily add new tokens to the vocabulary of an existing tokenizer like BERT WordPiece, those tokens must be whole words, not subwords. This article explains why and how to obtain these new tokens from a specialized corpus.

NLP | Modelo de Question Answering em qualquer idioma baseado no BERT base (estudo de caso em português)

Portuguese

I trained 1 Portuguese Bidirectional Language Model (PBLM) with the MultiFit configuration with 1 NVIDIA GPU v100 on GCP.

WARNING: a Bidirectional LM model using the MultiFiT configuration is a good model to perform text classification but with only 46 millions of parameters, it is far from being a LM that can compete with GPT-2 or BERT in NLP tasks like text generation. This my next step ;-)

Note: The training times shown in the tables on this page are the sum of the creation time of Fastai Databunch (forward and backward) and the training duration of the bidirectional model over 10 periods. The download time of the Wikipedia corpus and its preparation time are not counted.

MultiFiT configuration (architecture 4 QRNN with 1550 hidden parameters by layer / tokenizer SentencePiece (15 000 tokens))

PBLM accuracy perplexity training time
forward 39.68% 21.76 8h
backward 43.67% 22.16 8h

[ WARNING ] The code of this notebook lm3-portuguese-classifier-olist.ipynb must be updated in order to use the SentencePiece model and vocab already trained for the Portuguese Language Model in the notebook lm3-portuguese.ipynb as it was done in the notebook lm3-portuguese-classifier-TCU-jurisprudencia.ipynb (see explanations at the top of this notebook).

Here's an example of using the classifier to predict the category of a TCU legal text:

Using the classifier to predict the category of TCU legal texts

French

I trained 3 French Bidirectional Language Models (FBLM) with 1 NVIDIA GPU v100 on GCP but the best is the one trained with the MultiFit configuration.

French Bidirectional Language Models (FBLM) accuracy perplexity training time
MultiFiT with 4 QRNN + SentencePiece (15 000 tokens) forward 43.77% 16.09 8h40
backward 49.29% 16.58 8h10
ULMFiT with 3 QRNN + SentencePiece (15 000 tokens) forward 40.99% 19.96 5h30
backward 47.19% 19.47 5h30
ULMFiT with 3 AWD-LSTM + spaCy (60 000 tokens) forward 36.44% 25.62 11h
backward 42.65% 27.09 11h

1. MultiFiT configuration (architecture 4 QRNN with 1550 hidden parameters by layer / tokenizer SentencePiece (15 000 tokens))

FBLM accuracy perplexity training time
forward 43.77% 16.09 8h40
backward 49.29% 16.58 8h10

Here's an example of using the classifier to predict the feeling of comments on an amazon product:

Using the classifier to predict the feeling of comments on an amazon product

2. Architecture QRNN / tokenizer SentencePiece

FBLM accuracy perplexity training time
forward 40.99% 19.96 5h30
backward 47.19% 19.47 5h30

3. Architecture AWD-LSTM / tokenizer spaCy

FBLM accuracy perplexity training time
forward 36.44% 25.62 11h
backward 42.65% 27.09 11h