Fine-tuning

Netmind currently supports fine-tuning only through the API. In the future, users will be able to access this feature via the UI as well.

Fine-tuning Models

The following models are available to use with our fine-tuning API.

Training Precision Type indicates the precision type used during training for each model.
- bf16 (bfloat 16): This uses bf16 for all weights. Some large models on our platform uses full bf16 training for better memory usage and training speed.

LoRA Fine-tuning

Model String for API

Context Length

Training Precision Type*

Batch Size Range

Price (1M Tokens)

token per image

meta-llama/Meta-Llama-3-8B-Instruct

8192

bf16

1-2

$0.48

meta-llama/Llama-3.1-8B-Instruct

8192

bf16

1-2

$0.48

Qwen/Qwen2.5-7B-Instruct

32768

bf16

$0.42

meta-llama/Llama-3.2-11B-Vision-Instruct

32768

bf16

1-2

$0.5

128

For more models fine-tuning, please email us (support@netmind.ai).

Pricing

Pricing for fine-tuning is based on model size, the number of training tokens, the number of validation tokens, the number of evaluations, and the number of epochs. In other words, the total number of tokens used in a job is n_epochs * n_tokens_per_dataset.

For example, if you start a "meta-llama/Meta-Llama-3-8B-Instruct" fine-tuning job with 1M token and 1 epoch, the cost will be $0.48

Datasets on BNB Greenfield

BNB Greenfield is a cutting-edge decentralized storage and blockchain storage solution, designed to harness the power of decentralized technology in the field of data ownership and the data economy.

We are excited to collaborate with BNB Greenfield to leverage its distributed storage capabilities. By hosting open-source datasets in public buckets, we make it easier for users to download or utilize these datasets directly in the platform's fine-tuning feature.

Looking ahead, we plan to continuously expand the dataset offerings, including publishing curated datasets created by our team. We are also exploring the possibility of allowing users to upload their own datasets and earn blockchain-based rewards.

Currently, we have deployed and support the following datasets for use with the fine-tuning feature:

Dataset name

Description

Greenfield Link

OIG-small-chip2

OIG is a large-scale dataset containing instructions that are created using data augmentation from a diverse collection of data sources, and formatted in a dialogue style (… … pairs). The goal of OIG is to help convert a language model pre-trained on large amounts of text into an instruction-following model. It is designed to support continued pre-training to enable a base model (e.g., GPT-NeoX-20B) that can be later fine-tuned with the smaller-scale domain-specific datasets.

https://dcellar.io/share?file=netmind-power-dataset%2Foig_small_chip2.jsonl

COIG

The Chinese Open Instruction Generalist (COIG) project provides a diverse set of high-quality Chinese instruction datasets to support the development of Chinese language models (LLMs). It includes datasets for general instructions, exam scenarios, human value alignment, counterfactual correction chats, and programming tasks like LeetCode. These resources aim to enhance instruction tuning and offer templates for building new Chinese datasets.

https://dcellar.io/share?file=netmind-power-dataset%2Fcoig.jsonl

MULTIWOZ2_2

Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. MultiWOZ 2.1 (Eric et al., 2019) identified and fixed many erroneous annotations and user utterances in the original version, resulting in an improved version of the dataset. MultiWOZ 2.2 is a yet another improved version of this dataset, which identifies and fixes dialogue state annotation errors across 17.3% of the utterances on top of MultiWOZ 2.1 and redefines the ontology by disallowing vocabularies of slots with a large number of possible values (e.g., restaurant name, time of booking) and introducing standardized slot span annotations for these slots.

https://dcellar.io/share?file=netmind-power-dataset%2Fmultiwoz2_2.jsonl

im-feeling-curious

This public dataset is an extract from Google's "i'm feeling curious" feature. To learn more about this feature, search for "i'm feeling curious" on Google. Tasks: Answering open-domain questions, generating random facts. Limitations: May contain commercial content, false information, bias, or outdated information.

https://dcellar.io/share?file=netmind-power-dataset%2Fim_feeling_curious.jsonl

You can access these public datasets through the following API /v1/public-files

PreviousDedicated Endpoints NextGPU Clusters

Last updated 5 months ago

Was this helpful?