IndoBERT Base P1

indobenchmark/indobert-base-p1

published Mar 2022 · updated May 2021

IndoBERT Base P1 is a embed model that generates contextual embeddings for Indonesian text, trained on the Indo4B corpus using masked language modeling and next sentence prediction.

status

coming soon

API providers

downloads / mo

826.3K

license

mit

specs

Task	Feature Extraction
Architecture	BERT Base
Parameters	124.5M
License	MIT

about this model

IndoBERT Base (phase1 - uncased) is an embedding model that generates contextual representations of Indonesian text. It is a BERT-base architecture (124.5M parameters) pretrained on the Indo4B dataset (23.43 GB of text collected from social media, blogs, news, and websites) using masked language modeling and next sentence prediction objectives. The model is released under the MIT license and supports PyTorch, TensorFlow, and JAX frameworks. As part of the IndoNLU benchmark, this model was evaluated across twelve tasks ranging from single-sentence classification to pair-sentence sequence labeling, covering diverse domains and complexity levels. The model serves as a baseline for the IndoNLU benchmark suite, which was introduced at AACL-IJCNLP 2020. The Indo4B training corpus was curated from publicly available Indonesian sources including social media, blogs, news, and websites to ensure broad linguistic coverage. The model has been widely adopted within the Indonesian NLP ecosystem, with over 855,000 monthly downloads, 100 Spaces, and 139 finetuned variants. It is tagged for Feature Extraction as its primary task and is available under the MIT license. The official indobenchmark-toolkit repository provides evaluation and benchmarking utilities for this model.

best for

·Indonesian text classification
·Indonesian sentiment analysis
·Indonesian semantic similarity and paraphrase detection

FAQ

What is IndoBERT Base P1 best used for?

It is best for generating contextual embeddings for Indonesian NLP tasks such as classification, sequence labeling, and similarity.

What license does IndoBERT Base P1 use?

It is released under the MIT license, allowing free use, modification, and distribution.

How can I call IndoBERT Base P1 via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending input text to the embeddings endpoint.

What input format does the model expect?

The model accepts Indonesian text strings; for embedding, provide the text in the input field of the API request.

How does IndoBERT Base P1 compare to other Indonesian BERT models?

It is a base-sized model with 124.5M parameters, part of the IndoBERT family trained on the Indo4B dataset; larger versions (large) and lighter versions (lite) are also available.

not yet live

We're benchmarking and onboarding IndoBERT Base P1 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →

nomic-embed-text-v1.5