Merge pull request #1242 from bact/docs-dep

bact · web-flow · commit 610c219c8221 · 2026-01-30T12:37:32.000Z
Add "docs" dependencies
diff --git a/.github/workflows/codemeta2cff.yml b/.github/workflows/codemeta2cff.yml
@@ -2,7 +2,9 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileType: SOURCE
 
-name: Generate CITATION.cff from codemeta.json
+# Generate CITATION.cff from codemeta.json
+
+name: Generate CITATION.cff
 run-name: Generate CITATION.cff after ${{github.event_name}} by ${{github.actor}}
 
 on:
diff --git a/.github/workflows/deploy-docs.yml b/.github/workflows/deploy-docs.yml
@@ -1,4 +1,10 @@
-name: Deploy development documentation
+# SPDX-FileCopyrightText: 2026 PyThaiNLP Project
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileType: SOURCE
+
+# Deploy documentation to https://pythainlp.org/docs/
+
+name: Deploy dev docs
 on:
   push:
     branches:
@@ -14,18 +20,17 @@ on:
 jobs:
   deploy-docs:
     name: Build and deploy documentation
-    runs-on: ubuntu-24.04
+    runs-on: ubuntu-latest
     steps:
     - name: Checkout
       uses: actions/checkout@v6
     - name: Set up Python
       uses: actions/setup-python@v6
       with:
-        python-version: "3.10"
+        python-version: "3.12"
     - name: Install build tools and doc build tools
       run: |
         pip install --upgrade "pip<24.1" "setuptools>=69.0.0,<=73.0.1"
-        pip install boto smart_open sphinx sphinx-rtd-theme
       # pip<24.1 because https://github.com/omry/omegaconf/pull/1195
       # setuptools>=65.0.2 because https://github.com/pypa/setuptools/commit/d03da04e024ad4289342077eef6de40013630a44#diff-9ea6e1e3dde6d4a7e08c7c88eceed69ca745d0d2c779f8f85219b22266efff7fR1
       # setuptools<=73.0.1 because https://github.com/pypa/setuptools/issues/4620
@@ -35,7 +40,7 @@ jobs:
     #  run: |
     #    if [ -f docker_requirements.txt ]; then pip install -r docker_requirements.txt; fi
     - name: Install PyThaiNLP
-      run: pip install .
+      run: pip install ".[docs]"
     - name: Build sphinx documentation
       run: |
         cd docs && make html
diff --git a/README.md b/README.md
@@ -14,9 +14,11 @@
 [![Google Colab Badge](https://badgen.net/badge/Launch%20Quick%20Start%20Guide/on%20Google%20Colab/blue?icon=terminal)](https://colab.research.google.com/github/PyThaiNLP/tutorials/blob/master/source/notebooks/pythainlp_get_started.ipynb)
 [![Chat on Matrix](https://matrix.to/img/matrix-badge.svg)](https://matrix.to/#/#thainlp:matrix.org)
 
-PyThaiNLP is a Python package for text processing and linguistic analysis, similar to [NLTK](https://www.nltk.org/) with a focus on Thai language.
+PyThaiNLP is a Python package for text processing and linguistic analysis,
+similar to [NLTK](https://www.nltk.org/) with a focus on Thai language.
 
-PyThaiNLP เป็นไลบารีภาษาไพทอนสำหรับประมวลผลภาษาธรรมชาติ คล้ายกับ NLTK โดยเน้นภาษาไทย [ดูรายละเอียดภาษาไทยได้ที่ README_TH.MD](https://github.com/PyThaiNLP/pythainlp/blob/dev/README_TH.md)
+PyThaiNLP เป็นไลบารีภาษาไพทอนสำหรับประมวลผลภาษาธรรมชาติ คล้ายกับ NLTK โดยเน้นภาษาไทย
+[ดูรายละเอียดภาษาไทยได้ที่ README_TH.MD](https://github.com/PyThaiNLP/pythainlp/blob/dev/README_TH.md)
 
 ## Quick install
 
diff --git a/docs/api/generate.rst b/docs/api/generate.rst
@@ -2,71 +2,44 @@
 
 pythainlp.generate
 ==================
-The :class:`pythainlp.generate` module is a powerful tool for generating Thai text using PyThaiNLP. It includes several classes and functions that enable users to create text based on various language models and n-gram models.
 
-Modules
--------
-
-Unigram
-~~~~~~~
-.. autoclass:: Unigram
-   :members:
+The :mod:`pythainlp.generate` module provides classes and functions for generating Thai text using n-gram and neural language models.
 
-The :class:`Unigram` class provides functionality for generating text based on unigram language models. Unigrams are single words or tokens, and this class allows you to create text by selecting words probabilistically based on their frequencies in the training data.
+N-gram generators
+-----------------
 
-Bigram
-~~~~~~
-.. autoclass:: Bigram
+.. autoclass:: pythainlp.generate.Unigram
    :members:
 
-The :class:`Bigram` class is designed for generating text using bigram language models. Bigrams are sequences of two words, and this class enables you to generate text by predicting the next word based on the previous word's probability.
+.. autoclass:: pythainlp.generate.Bigram
+   :members:
 
-Trigram
-~~~~~~~
-.. autoclass:: Trigram
+.. autoclass:: pythainlp.generate.Trigram
    :members:
 
-The :class:`Trigram` class extends text generation to trigram language models. Trigrams consist of three consecutive words, and this class facilitates the creation of text by predicting the next word based on the two preceding words' probabilities.
+Thai2fit helper
+---------------
 
-pythainlp.generate.thai2fit.gen_sentence
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autofunction:: pythainlp.generate.thai2fit.gen_sentence
    :noindex:
 
-The function :func:`pythainlp.generate.thai2fit.gen_sentence` offers a convenient way to generate sentences using the Thai2Vec language model. It takes a seed text as input and generates a coherent sentence based on the provided context.
+WangChanLM
+----------
 
-pythainlp.generate.wangchanglm.WangChanGLM
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. autoclass:: pythainlp.generate.wangchanglm.WangChanGLM
    :members:
 
-The :class:`WangChanGLM` class is a part of the `pythainlp.generate.wangchanglm` module, offering text generation capabilities. It includes methods for creating text using the WangChanGLM language model.
-
 Usage
-~~~~~
-
-To use the text generation capabilities provided by the `pythainlp.generate` module, follow these steps:
+-----
 
-1. Select the appropriate class or function based on the type of language model you want to use (Unigram, Bigram, Trigram, Thai2Vec, or WangChanGLM).
-
-2. Initialize the selected class or use the function with the necessary parameters.
-
-3. Call the appropriate methods to generate text based on the chosen model.
-
-4. Utilize the generated text for various applications, such as chatbots, content generation, and more.
+Choose the generator class or function for the model you want, initialize it with appropriate parameters, and call its generation methods. Generated text can be used for chatbots, content generation, or data augmentation.
 
 Example
-~~~~~~~
-
-Here's a simple example of how to generate text using the `Unigram` class:
+-------
 
 ::
    from pythainlp.generate import Unigram
-   
-   # Initialize the Unigram model
+
    unigram = Unigram()
-   
-   # Generate a sentence
    sentence = unigram.gen_sentence("สวัสดีครับ")
-   
    print(sentence)
diff --git a/docs/api/summarize.rst b/docs/api/summarize.rst
@@ -2,20 +2,20 @@
 
 pythainlp.summarize
 ===================
-The :class:`summarize` is Thai text summarizer.
+The :mod:`pythainlp.summarize` module provides functions for Thai text summarization and keyword extraction.
 
-Modules
--------
+Functions
+---------
 
-.. autofunction:: summarize
-.. autofunction:: extract_keywords
+.. autofunction:: pythainlp.summarize.summarize
+.. autofunction:: pythainlp.summarize.extract_keywords
 
-Keyword Extraction Engines
+Keyword extraction engines
 --------------------------
 
 KeyBERT
-+++++++
+-------
 
 .. automodule:: pythainlp.summarize.keybert
-.. autoclass::  pythainlp.summarize.keybert.KeyBERT
+.. autoclass:: pythainlp.summarize.keybert.KeyBERT
     :members:
diff --git a/docs/conf.py b/docs/conf.py
@@ -97,6 +97,7 @@
     "sphinx.ext.mathjax",
     "sphinx.ext.ifconfig",
     "sphinx.ext.viewcode",
+    "sphinx_copybutton",
 ]
 
 # Add any paths that contain templates here, relative to this directory.
@@ -116,7 +117,7 @@
 #
 # This is also used if you do content translation via gettext catalogs.
 # Usually you set "language" from the command line for these cases.
-language = None
+language = "en"
 
 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
diff --git a/docs/index.rst b/docs/index.rst
@@ -10,7 +10,7 @@ PyThaiNLP documentation
 
 PyThaiNLP is a Python library for Thai natural language processing (NLP).
 
-Website: `PyThaiNLP.github.io <https://pythainlp.org/>`_
+Website: `pythainlp.org <https://pythainlp.org/>`_
 
 
 .. toctree::
@@ -38,10 +38,10 @@ Indices and tables
 
 Citations
 =========
-If you use PyThaiNLP in your project or publication, please cite the library as follows
+If you use PyThaiNLP in your project or publication, please cite the library as follows:
 
     Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, and Can Udomcharoenchaikit. 2023. PyThaiNLP: Thai Natural Language Processing in Python. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 25–36, Singapore, Singapore. Empirical Methods in Natural Language Processing.
 
-Apache Software License 2.0
+License: Apache License 2.0
 
 Maintained by the PyThaiNLP team.
diff --git a/docs/notes/FAQ.rst b/docs/notes/FAQ.rst
@@ -1,6 +1,6 @@
 FAQ
 ===
 
-*Frequently Asked Questions about PyThaiNLP*
+*Frequently asked questions about PyThaiNLP*
 
-You can read the FAQ at `FAQ | PyThaiNLP GitHub <https://pythainlp.org/FAQ>`_
+Read the FAQ at `PyThaiNLP FAQ <https://pythainlp.org/FAQ>`_.
diff --git a/docs/notes/command_line.rst b/docs/notes/command_line.rst
@@ -1,7 +1,7 @@
-Command Line
+Command line
 ============
 
-You can use some thainlp functions directly from command line.
+You can use some `thainlp` functions directly from the command line.
 
 **Tokenization**::
 
@@ -24,7 +24,7 @@ You can use some thainlp functions directly from command line.
     $ thainlp tokenize sent "หลายปีที่ผ่านมา ชาวชุมชนโคกยาวหลายคนได้พากันย้ายออก บ้างก็เสียชีวิต บางคนถูกจำคุกในข้อบุกรุกป่าหรือแม้กระทั่งสูญหาย"
     หลายปีที่ผ่านมา @@ชาวชุมชนโคกยาวหลายคนได้พากันย้ายออก @@บ้างก็เสียชีวิต @@บางคนถูกจำคุกในข้อบุกรุกป่าหรือแม้กระทั่งสูญหาย@@
 
-**Part-Of-Speech tagging**::
+**Part-of-speech tagging**::
 
     pythainlp tagg pos [-s SEPARATOR] TEXT
 
diff --git a/docs/notes/getting_started.rst b/docs/notes/getting_started.rst
@@ -1,23 +1,23 @@
-Getting Started
+Getting started
 ===============
 
-PyThaiNLP is a Python library for natural language processing (NLP) of Thai language. With this package, you can perform NLP tasks such as text classification and text tokenization.
+PyThaiNLP is a Python library for Thai natural language processing (NLP). With this package you can perform common NLP tasks such as text classification and tokenization.
 
-**Tokenization Example**::
+**Tokenization example**::
 
     from pythainlp.tokenize import word_tokenize
 
     text = "โอเคบ่เรารักภาษาถิ่น"
     word_tokenize(text, engine="newmm")  # ['โอเค', 'บ่', 'เรา', 'รัก', 'ภาษาถิ่น']
-    word_tokenize(text, engine="icu")  # ['โอ', 'เค', 'บ่', 'เรา', 'รัก', 'ภาษา', 'ถิ่น']
+    word_tokenize(text, engine="icu")    # ['โอ', 'เค', 'บ่', 'เรา', 'รัก', 'ภาษา', 'ถิ่น']
 
-Thai has historically faced a lot of NLP challenges. A quick list of them include as follows:
+Thai NLP faces several challenges. A brief list includes:
 
-#. **Start-end of sentence marking** - This is arguably the biggest problem for the field of Thai NLP. The lack of end of sentence marking (EOS) makes it hard for researchers to create training sets, the basis of most research in this field. The root of the problem is two-pronged. In terms of writing system, Thai uses space to indicate both commas and periods. No letter indicates an end of a sentence. In terms of language use, Thais have a habit of starting their sentences with connector terms such as 'because', 'but', 'following', etc, making it often hard even for natives to decide where the end of sentence should be when translating.
+#. **Sentence boundary detection** — This is one of the biggest challenges in Thai NLP. The lack of explicit end-of-sentence markers makes it difficult to create training sets for many tasks. The issue is twofold: in the writing system, Thai punctuation and spacing do not always indicate sentence endings; in language use, sentences often begin with conjunctions such as 'because' or 'but', which can make sentence boundaries ambiguous even for native speakers.
 
-#. **Word segmentation** - Thai does not use space and word segmentation is not easy. It boils down to understanding the context and ruling out words that do not make sense. This is a similar issue that other Asian languages such as Japanese and Chinese face in different degrees. For languages with space, a similar but less extreme problem would be multi-word expressions, like the French word for potato — 'pomme de terre'. In Thai, the best known example is "ตา-กลม" and "ตาก-ลม". As of recent, new techniques that capture words, subwords, and letters in vectors seem poised to overcome to issue.
+#. **Word segmentation** — Thai does not use spaces to separate words, so segmentation is challenging. Solving it often requires understanding context to rule out unlikely word breaks. This is similar to issues in other Asian languages such as Japanese and Chinese. Recently, techniques that represent words, subwords, and characters as vectors (embeddings) have improved performance and help address this problem.
 
-Tutorial Notebooks
+Tutorial notebooks
 ==================
 - `PyThaiNLP Get Started <https://pythainlp.org/tutorials/notebooks/pythainlp-get-started.html>`_
 - `Other tutorials <https://pythainlp.org/tutorials/>`_
diff --git a/docs/notes/installation.rst b/docs/notes/installation.rst
diff --git a/docs/notes/license.rst b/docs/notes/license.rst
diff --git a/pyproject.toml b/pyproject.toml
diff --git a/pythainlp/ancient/aksonhan.py b/pythainlp/ancient/aksonhan.py
diff --git a/pythainlp/ancient/currency.py b/pythainlp/ancient/currency.py

Original file line number	Diff line number	Diff line change
`@@ -97,6 +97,7 @@`
`97`	`97`	`"sphinx.ext.mathjax",`
`98`	`98`	`"sphinx.ext.ifconfig",`
`99`	`99`	`"sphinx.ext.viewcode",`
	`100`	`+ "sphinx_copybutton",`
`100`	`101`	`]`
`101`	`102`
`102`	`103`	`# Add any paths that contain templates here, relative to this directory.`
`@@ -116,7 +117,7 @@`
`116`	`117`	`#`
`117`	`118`	`# This is also used if you do content translation via gettext catalogs.`
`118`	`119`	`# Usually you set "language" from the command line for these cases.`
`119`		`-language = None`
	`120`	`+language = "en"`
`120`	`121`
`121`	`122`	`# List of patterns, relative to source directory, that match files and`
`122`	`123`	`# directories to ignore when looking for source files.`