Skip to content

Commit 70c005c

Browse files
sarahyurickNeMo Bot
authored andcommitted
Add feedback to tutorials (#1476)
* Add feedback to tutorials Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * clarify install instructions for classifier tutorials Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * byo classifiers Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> * add descriptions Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> --------- Signed-off-by: Sarah Yurick <sarahyurick@gmail.com> Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
1 parent bc8a91f commit 70c005c

12 files changed

+438
-52
lines changed

tutorials/text/distributed-data-classification/README.md

Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,23 @@ For more information about the classifiers, refer to our [Distributed Data Class
1515

1616
<div align="center">
1717

18-
| NeMo Curator Classifier | Hugging Face Page |
19-
| --- | --- |
20-
| `AegisClassifier` | [nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) and [nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0) |
21-
| `ContentTypeClassifier` | [nvidia/content-type-classifier-deberta](https://huggingface.co/nvidia/content-type-classifier-deberta) |
22-
| `DomainClassifier` | [nvidia/domain-classifier](https://huggingface.co/nvidia/domain-classifier) |
23-
| `FineWebEduClassifier` | [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) |
24-
| `FineWebMixtralEduClassifier` | [nvidia/nemocurator-fineweb-mixtral-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier) |
25-
| `FineWebNemotronEduClassifier` | [nvidia/nemocurator-fineweb-nemotron-4-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier) |
26-
| `InstructionDataGuardClassifier` | [nvidia/instruction-data-guard](https://huggingface.co/nvidia/instruction-data-guard) |
27-
| `MultilingualDomainClassifier` | [nvidia/multilingual-domain-classifier](https://huggingface.co/nvidia/multilingual-domain-classifier) |
28-
| `PromptTaskComplexityClassifier` | [nvidia/prompt-task-and-complexity-classifier](https://huggingface.co/nvidia/prompt-task-and-complexity-classifier) |
29-
| `QualityClassifier` | [quality-classifier-deberta](https://huggingface.co/nvidia/quality-classifier-deberta) |
18+
| NeMo Curator Classifier | Description | Hugging Face Page |
19+
| --- | --- | --- |
20+
| `AegisClassifier` | Identify and categorize unsafe content per document | [nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) and [nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0) |
21+
| `ContentTypeClassifier` | Categorize the type-of-speech per document | [nvidia/content-type-classifier-deberta](https://huggingface.co/nvidia/content-type-classifier-deberta) |
22+
| `DomainClassifier` | Categorize the domain per document | [nvidia/domain-classifier](https://huggingface.co/nvidia/domain-classifier) |
23+
| `FineWebEduClassifier` | Determine the educational value per document; this model was trained using annotations from Llama 3 70B-Instruct | [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) |
24+
| `FineWebMixtralEduClassifier` | Determine the educational value per document; this model was trained using annotations from Mixtral 8x22B-Instruct | [nvidia/nemocurator-fineweb-mixtral-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier) |
25+
| `FineWebNemotronEduClassifier` | Determine the educational value per document; this model was trained using annotations from Nemotron-4-340B-Instruct | [nvidia/nemocurator-fineweb-nemotron-4-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier) |
26+
| `InstructionDataGuardClassifier` | Identify LLM poisoning attacks per document | [nvidia/instruction-data-guard](https://huggingface.co/nvidia/instruction-data-guard) |
27+
| `MultilingualDomainClassifier` | Categorize the domain per document; supports classification in 52 languages | [nvidia/multilingual-domain-classifier](https://huggingface.co/nvidia/multilingual-domain-classifier) |
28+
| `PromptTaskComplexityClassifier` | Classifies text prompts across task types and complexity dimensions | [nvidia/prompt-task-and-complexity-classifier](https://huggingface.co/nvidia/prompt-task-and-complexity-classifier) |
29+
| `QualityClassifier` | Categorize documents as high, medium, or low quality | [quality-classifier-deberta](https://huggingface.co/nvidia/quality-classifier-deberta) |
3030

3131
</div>
32+
33+
Note that all classifiers support English text classification only, except the `MultilingualDomainClassifier`.
34+
35+
## Bring Your Own Classifier
36+
37+
Advanced users may want to integrate their own Hugging Face classifier(s) into NeMo Curator. Broadly, this requires creating a `CompositeStage` consisting of a CPU-based tokenizer stage and a GPU-based model inference stage. Refer to the [Text Classifiers README](https://github.com/NVIDIA-NeMo/Curator/tree/main/nemo_curator/stages/text/classifiers#text-classifiers) for details about how to do this.

tutorials/text/distributed-data-classification/aegis-classification.ipynb

Lines changed: 41 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,9 @@
1414
" - Volta™ or higher (compute capability 7.0+)\n",
1515
" - CUDA 12.x\n",
1616
"\n",
17-
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.\n",
17+
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page.\n",
1818
"\n",
19-
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page."
19+
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies (`text_cuda12` or `all`). Check proper installation with:"
2020
]
2121
},
2222
{
@@ -25,12 +25,50 @@
2525
"metadata": {},
2626
"outputs": [],
2727
"source": [
28-
"# Silence Curator logs via Loguru\n",
28+
"# First, silence Curator logs via Loguru\n",
2929
"import os\n",
3030
"\n",
3131
"os.environ[\"LOGURU_LEVEL\"] = \"ERROR\""
3232
]
3333
},
34+
{
35+
"cell_type": "code",
36+
"execution_count": null,
37+
"metadata": {},
38+
"outputs": [],
39+
"source": [
40+
"import nemo_curator\n",
41+
"\n",
42+
"nemo_curator.__version__ # should be >= 1.0.0"
43+
]
44+
},
45+
{
46+
"cell_type": "markdown",
47+
"metadata": {},
48+
"source": [
49+
"We can check that GPUs are available, then check that the `gpustat` dependency was installed:"
50+
]
51+
},
52+
{
53+
"cell_type": "code",
54+
"execution_count": null,
55+
"metadata": {},
56+
"outputs": [],
57+
"source": [
58+
"!nvidia-smi"
59+
]
60+
},
61+
{
62+
"cell_type": "code",
63+
"execution_count": null,
64+
"metadata": {},
65+
"outputs": [],
66+
"source": [
67+
"import gpustat\n",
68+
"\n",
69+
"gpustat.__version__ # check gpu dependency is installed"
70+
]
71+
},
3472
{
3573
"cell_type": "markdown",
3674
"metadata": {},

tutorials/text/distributed-data-classification/content-type-classification.ipynb

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,23 +12,61 @@
1212
" - Volta™ or higher (compute capability 7.0+)\n",
1313
" - CUDA 12.x\n",
1414
"\n",
15-
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.\n",
15+
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page.\n",
1616
"\n",
17-
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page."
17+
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies (`text_cuda12` or `all`). Check proper installation with:"
1818
]
1919
},
2020
{
2121
"cell_type": "code",
22-
"execution_count": 1,
22+
"execution_count": null,
2323
"metadata": {},
2424
"outputs": [],
2525
"source": [
26-
"# Silence Curator logs via Loguru\n",
26+
"# First, silence Curator logs via Loguru\n",
2727
"import os\n",
2828
"\n",
2929
"os.environ[\"LOGURU_LEVEL\"] = \"ERROR\""
3030
]
3131
},
32+
{
33+
"cell_type": "code",
34+
"execution_count": null,
35+
"metadata": {},
36+
"outputs": [],
37+
"source": [
38+
"import nemo_curator\n",
39+
"\n",
40+
"nemo_curator.__version__ # should be >= 1.0.0"
41+
]
42+
},
43+
{
44+
"cell_type": "markdown",
45+
"metadata": {},
46+
"source": [
47+
"We can check that GPUs are available, then check that the `gpustat` dependency was installed:"
48+
]
49+
},
50+
{
51+
"cell_type": "code",
52+
"execution_count": null,
53+
"metadata": {},
54+
"outputs": [],
55+
"source": [
56+
"!nvidia-smi"
57+
]
58+
},
59+
{
60+
"cell_type": "code",
61+
"execution_count": null,
62+
"metadata": {},
63+
"outputs": [],
64+
"source": [
65+
"import gpustat\n",
66+
"\n",
67+
"gpustat.__version__ # check gpu dependency is installed"
68+
]
69+
},
3270
{
3371
"cell_type": "markdown",
3472
"metadata": {},

tutorials/text/distributed-data-classification/domain-classification.ipynb

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,23 +12,61 @@
1212
" - Volta™ or higher (compute capability 7.0+)\n",
1313
" - CUDA 12.x\n",
1414
"\n",
15-
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.\n",
15+
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page.\n",
1616
"\n",
17-
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page."
17+
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies (`text_cuda12` or `all`). Check proper installation with:"
1818
]
1919
},
2020
{
2121
"cell_type": "code",
22-
"execution_count": 1,
22+
"execution_count": null,
2323
"metadata": {},
2424
"outputs": [],
2525
"source": [
26-
"# Silence Curator logs via Loguru\n",
26+
"# First, silence Curator logs via Loguru\n",
2727
"import os\n",
2828
"\n",
2929
"os.environ[\"LOGURU_LEVEL\"] = \"ERROR\""
3030
]
3131
},
32+
{
33+
"cell_type": "code",
34+
"execution_count": null,
35+
"metadata": {},
36+
"outputs": [],
37+
"source": [
38+
"import nemo_curator\n",
39+
"\n",
40+
"nemo_curator.__version__ # should be >= 1.0.0"
41+
]
42+
},
43+
{
44+
"cell_type": "markdown",
45+
"metadata": {},
46+
"source": [
47+
"We can check that GPUs are available, then check that the `gpustat` dependency was installed:"
48+
]
49+
},
50+
{
51+
"cell_type": "code",
52+
"execution_count": null,
53+
"metadata": {},
54+
"outputs": [],
55+
"source": [
56+
"!nvidia-smi"
57+
]
58+
},
59+
{
60+
"cell_type": "code",
61+
"execution_count": null,
62+
"metadata": {},
63+
"outputs": [],
64+
"source": [
65+
"import gpustat\n",
66+
"\n",
67+
"gpustat.__version__ # check gpu dependency is installed"
68+
]
69+
},
3270
{
3371
"cell_type": "markdown",
3472
"metadata": {},

tutorials/text/distributed-data-classification/fineweb-edu-classification.ipynb

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,23 +12,61 @@
1212
" - Volta™ or higher (compute capability 7.0+)\n",
1313
" - CUDA 12.x\n",
1414
"\n",
15-
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.\n",
15+
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page.\n",
1616
"\n",
17-
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page."
17+
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies (`text_cuda12` or `all`). Check proper installation with:"
1818
]
1919
},
2020
{
2121
"cell_type": "code",
22-
"execution_count": 1,
22+
"execution_count": null,
2323
"metadata": {},
2424
"outputs": [],
2525
"source": [
26-
"# Silence Curator logs via Loguru\n",
26+
"# First, silence Curator logs via Loguru\n",
2727
"import os\n",
2828
"\n",
2929
"os.environ[\"LOGURU_LEVEL\"] = \"ERROR\""
3030
]
3131
},
32+
{
33+
"cell_type": "code",
34+
"execution_count": null,
35+
"metadata": {},
36+
"outputs": [],
37+
"source": [
38+
"import nemo_curator\n",
39+
"\n",
40+
"nemo_curator.__version__ # should be >= 1.0.0"
41+
]
42+
},
43+
{
44+
"cell_type": "markdown",
45+
"metadata": {},
46+
"source": [
47+
"We can check that GPUs are available, then check that the `gpustat` dependency was installed:"
48+
]
49+
},
50+
{
51+
"cell_type": "code",
52+
"execution_count": null,
53+
"metadata": {},
54+
"outputs": [],
55+
"source": [
56+
"!nvidia-smi"
57+
]
58+
},
59+
{
60+
"cell_type": "code",
61+
"execution_count": null,
62+
"metadata": {},
63+
"outputs": [],
64+
"source": [
65+
"import gpustat\n",
66+
"\n",
67+
"gpustat.__version__ # check gpu dependency is installed"
68+
]
69+
},
3270
{
3371
"cell_type": "markdown",
3472
"metadata": {},

tutorials/text/distributed-data-classification/fineweb-mixtral-edu-classification.ipynb

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,23 +12,61 @@
1212
" - Volta™ or higher (compute capability 7.0+)\n",
1313
" - CUDA 12.x\n",
1414
"\n",
15-
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.\n",
15+
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page.\n",
1616
"\n",
17-
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page."
17+
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies (`text_cuda12` or `all`). Check proper installation with:"
1818
]
1919
},
2020
{
2121
"cell_type": "code",
22-
"execution_count": 1,
22+
"execution_count": null,
2323
"metadata": {},
2424
"outputs": [],
2525
"source": [
26-
"# Silence Curator logs via Loguru\n",
26+
"# First, silence Curator logs via Loguru\n",
2727
"import os\n",
2828
"\n",
2929
"os.environ[\"LOGURU_LEVEL\"] = \"ERROR\""
3030
]
3131
},
32+
{
33+
"cell_type": "code",
34+
"execution_count": null,
35+
"metadata": {},
36+
"outputs": [],
37+
"source": [
38+
"import nemo_curator\n",
39+
"\n",
40+
"nemo_curator.__version__ # should be >= 1.0.0"
41+
]
42+
},
43+
{
44+
"cell_type": "markdown",
45+
"metadata": {},
46+
"source": [
47+
"We can check that GPUs are available, then check that the `gpustat` dependency was installed:"
48+
]
49+
},
50+
{
51+
"cell_type": "code",
52+
"execution_count": null,
53+
"metadata": {},
54+
"outputs": [],
55+
"source": [
56+
"!nvidia-smi"
57+
]
58+
},
59+
{
60+
"cell_type": "code",
61+
"execution_count": null,
62+
"metadata": {},
63+
"outputs": [],
64+
"source": [
65+
"import gpustat\n",
66+
"\n",
67+
"gpustat.__version__ # check gpu dependency is installed"
68+
]
69+
},
3270
{
3371
"cell_type": "markdown",
3472
"metadata": {},

0 commit comments

Comments
 (0)